17  Segregation Measures

Code
import os

import geopandas as gpd
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import seaborn as sns
import contextily as ctx
from factor_analyzer import FactorAnalyzer
from geosnap import DataStore
from geosnap import io as gio
from networkx.drawing.nx_agraph import graphviz_layout
from segregation.batch import batch_compute_multigroup, batch_compute_singlegroup
from segregation.local import LocalDistortion, MultiLocationQuotient
from segregation.multigroup import MultiInfoTheory, MultiGini, MultiDiversity
from segregation.singlegroup import Dissim, Gini, Entropy
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score
from scipy.stats import zscore

%load_ext watermark
%watermark -iv -a "eli knaap"
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: eli knaap

networkx       : 3.4.2
factor_analyzer: 0.5.1
segregation    : 2.5.2.dev4+gea53e4c.d20250128
geosnap        : 0.14.1.dev14+g0443e2a.d20250103
sklearn        : 1.6.0
contextily     : 1.6.2
pandas         : 2.2.3
seaborn        : 0.13.2
scipy          : 1.14.1
geopandas      : 1.0.1
matplotlib     : 3.10.0

In concept, segregation is about separation; when we measure residential segregation, we are asking whether people belonging to different groups share the same space, often conceived as the same ‘neighborhood’. This is a more ambiguous task than measuring, for example, educational segregation, where the shared resource such as schools are very well-defined.

Segregation of white and colored children in public schools has a detrimental effect upon the colored children. The impact is greater when it has the sanction of the law, for the policy of separating the races is usually interpreted as denoting the inferiority of the Negro group… Any language in contrary to this finding is rejected. We conclude that in the field of public education the doctrine of ‘separate but equal’ has no place. Separate educational facilities are inherently unequal.

Warren (1954)

School attendance is usually straightforward to measure: it is generally explicit which students attend what schools, but in the case of residential segregation we are forced to rely on the fuzzy notion of neighborhoods. In practice, this means that researchers simply adopt the tract or blockgroup as a placeholder for the neighborhood, then examine blockgroup or tract-level segregation.

Here, we’ll use PySAL’s segregation module to analyze residential segregation by race and ethnicity in Southern California, and we begin by collecting data for the entire region, then partitioning it into the coastal and inland sections.

Code
datasets = DataStore()

socal = gio.get_acs(
    datasets,
    county_fips=["06037", "06025", "06059", "06071", "06073", "06065", "06111"],
    years=[2018],
)
socal = socal.to_crs(socal.estimate_utm_crs())
socal[["p_hispanic_persons", "geometry"]].assign(geometry=socal.geometry.simplify(100)).explore(
    column="p_hispanic_persons",
    scheme="quantiles",
    cmap="Blues",
    k=8,
    tooltip=["p_hispanic_persons"],
    style_kwds={"weight": 0.5},
    tiles='CartoDB Positron'
)
socal["county"] = socal.geoid.str[:5]
county_names = [
    "Imperial",
    "Los Angeles",
    "Orange",
    "Riverside",
    "San Bernadino",
    "San Diego",
    "Ventura",
]
county_fips = ["06025", "06037", "06059", "06065", "06071", "06073", "06111"]
namer = dict(zip(county_fips, county_names))
socal['county'] = socal.county.replace(to_replace=namer)

coastal = socal[socal.county.isin(["Los Angeles", "Orange", "San Diego", "Ventura"])]
inland = socal[socal.county.isin(['Riverside', "San Bernadino", "Imperial"])]

f, ax = plt.subplots(1,2, figsize=(10, 5))
coastal.plot(column='county', ax=ax[0])
inland.plot(column='county', ax=ax[1])
/Users/knaaptime/Dropbox/projects/geosnap/geosnap/io/constructors.py:217: UserWarning: Currency columns unavailable at this resolution; not adjusting for inflation
  warn(

17.1 Residential Segregation Measures

The segregation package calculates dozens of segregation indices, each of which captures something different about the ways that population groups interact or remain separated in space. Most of the commonly-used statistics are global or aggregate measures, meaning they summarize the total level of segregation across all units in a study region.

17.1.1 Single-Group Indices

Single-group indices measure the partitioning of one population group relative to everyone else.

Code
dissim_hisp = Dissim(socal, "n_hispanic_persons", "n_total_pop")
dissim_black = Dissim(socal, "n_nonhisp_black_persons", "n_total_pop")

gini_hisp = Gini(socal, "n_hispanic_persons", "n_total_pop")
gini_black = Gini(socal, "n_nonhisp_black_persons", "n_total_pop")

entropy_hisp = Entropy(socal, "n_hispanic_persons", "n_total_pop")
entropy_black = Entropy(socal, "n_nonhisp_black_persons", "n_total_pop")

Each class has a statistic attribute that holds the computed value for each segregation measure

Code
dissim_hisp.statistic
0.49957776952346783
Code
dissim_black.statistic
0.547197680270968
Code
gini_hisp.statistic
0.6602166788700566
Code
gini_black.statistic
0.7234615852802052
Code
entropy_hisp.statistic
0.2714618709533524
Code
entropy_black.statistic
0.2616509031724341

According to the Dissimilarity and Gini indices, the black population in southern California is more segregated than the Latinx/Hispanic population, but the reverse is true according to the Entropy index

17.1.1.1 Batch Computation

To examine several indices at once, segregation provides a set of “batch_compute” functions. Instead of a fitted Class, the batch_compute_singlegroup function returns a table of segregation indices and is a convenient way of collecting many statistics simultaneously.

Code
socal_all_singlegroup = batch_compute_singlegroup(socal, "n_hispanic_persons", "n_total_pop")
socal_all_singlegroup
Statistic
Name
AbsoluteCentralization 0.7737
AbsoluteClustering 0.2704
AbsoluteConcentration 0.6548
Atkinson 0.3726
BiasCorrectedDissim 0.4993
BoundarySpatialDissim NaN
ConProf 0.4794
CorrelationR 0.3272
Delta 0.8843
DensityCorrectedDissim 0.3856
Dissim 0.4996
DistanceDecayInteraction 0.4788
DistanceDecayIsolation 0.5741
Entropy 0.2715
Gini 0.6602
Interaction 0.3723
Isolation 0.6277
MinMax 0.6663
ModifiedDissim 0.4901
ModifiedGini 0.6509
PARDissim 0.4808
RelativeCentralization 0.0739
RelativeClustering 0.0667
RelativeConcentration 0.3460
SpatialDissim 0.3593
SpatialProxProf 0.8412
SpatialProximity 1.1710

17.1.2 Multigroup Indices

Multigroup measures capture the partitioning of several population groups simultaneously. Most multigroup measures are extensions of singlegroup measures and have a more recent history in the literature. (Reardon & Firebaugh, 2002).

Code
pop_groups = ['n_asian_persons', 'n_hispanic_persons', 'n_nonhisp_black_persons', 'n_nonhisp_white_persons']

multi_div_coast = MultiDiversity(coastal, pop_groups)
multi_div_inland = MultiDiversity(inland, pop_groups)


multi_info_coast = MultiInfoTheory(coastal, pop_groups)
multi_info_inland = MultiInfoTheory(inland, pop_groups)

For multigroup diversity:

Code
print(f"coast: {multi_div_coast.statistic}")
print(f"inland: {multi_div_inland.statistic}")
coast: 1.1839984141407496
inland: 1.06835444467478

for multigroup information theory:

Code
print(f"coast: {multi_info_coast.statistic}")
print(f"inland: {multi_info_inland.statistic}")
coast: 0.31802561870773133
inland: 0.20407072911853946

Regardless which index is used, multigroup segregation is higher in the coastal region than the inland one

17.1.2.1 Batch Computation

Again, the measures can be “batch computed”

Code
socal_all_multigroup = batch_compute_multigroup(socal, groups=pop_groups)
Code
socal_all_multigroup
Statistic
Name
GlobalDistortion 335.5186
MultiDissim 0.5131
MultiDivergence 0.3498
MultiDiversity 1.1664
MultiGini 0.6766
MultiInfoTheory 0.2999
MultiNormExposure 0.3048
MultiRelativeDiversity 0.2955
MultiSquaredCoefVar 0.2508
SimpsonsConcentration 0.3517
SimpsonsInteraction 0.6483

17.1.3 Local Segregation Measures

Unlike global measures, local segregation statistics measure segregation in each geographic unit rather than summarizing segregation across the region. For example the recently proposed Distortion index is designed to visualize how segregation changes over a region (Bézenac et al., 2022; Olteanu et al., 2019).

The use of trajectory convergence analysis provides a flexible way for capturing change across all scales from small spatial units and how the rate of convergence to the citywide average modifies over space. Thus, the method provides an analysis of how far, in spatial terms, any individual or neighborhood is from the citywide multigroup distribution.

Code
d = LocalDistortion(socal, groups=pop_groups)
ax = d.data.plot('distortion',  scheme='quantiles', cmap='RdBu_r', alpha=0.6, )
ctx.add_basemap(ax=ax, crs=socal.crs)

Code
d.data.assign(geometry=d.data.geometry.simplify(100)).explore(
    "distortion",
    cmap="RdBu_r",
    style_kwds={"weight": 0.5},
    scheme="quantiles",
    tiles="CartoDB Positron",
)
Make this Notebook Trusted to load map: File -> Trust Notebook

17.2 The Dimensions of Residential Segregation

With more than 20 segregation indices at our disposal, it can be confusing to understand which measure is ‘best’ or whether multiple indices really capture slightly different versions of ‘the same thing’. A seminal contribution in segregation measurement is given by Massey & Denton (1988) who were the first to examine quantitatively the wide variety of segregation indexing strategies and the information each provides. Their paper is important first because of the sheer volume of work: in the late 80s, it was seriously laborious to gather data for a large number of metropolitan regions and compute a dozen segregation indices in each one; the paper’s second major contribution is the way it helped clarify the relationships among various indices that had been proposed over the years.

Like personality theory and item-response theory in psychology, Massey and Denton proposed that there were multiple ways to measure segregation that capture different concepts of the term (i.e. unevenness versus clustering)–but there probably are not 20 different concepts like there are indices in use. Instead, those two dozen indices are different ways of capturing the five or so dimensions of segregation that really exist (they argue). This was a dramatic step forward in understanding what we’re actually measuring when we study residential segregation. To conduct their analysis, Massey and colleagues gathered metropolitan data from around the country and computed every segregation index they could manage, then they ran a factor analysis on the set of results.

Assessing the outcome, they argued that segregation indices could be divided into five categories: evenness, exposure, concentration, centralization, and clustering.

Minority members may be distributed so that they are overrepresented in some areas and underrepresented in others, varying on the characteristic of evenness. They may be distributed so that their exposure to majority members is limited by virtue of rarely sharing a neighborhood with them. They may be spatially concentrated within a very small area, occupying less physical space than majority members. They may be spatially centralized, congregating around the urban core, and occupying a more central location than the majority. Finally, areas of minority settlement may be tightly clustered to form one large contiguous enclave, or be scattered widely around the urban area. (Massey & Denton, 1988, p. 283)

Following Massey & Denton (1988), Massey et al. (1996), and Massey (2012) we can look at dimensionality in segregation measures by computing two-group indices for Black, Hispanic, and Asian groups (vs all others) for every metro region in the country. Here we will use the 2021 ACS data at the blockgroup level, and use a 2km radius for computing generalized spatial measures, which helps account for heterogeneity in BG size

Code
multi_groups = [
    "n_nonhisp_white_persons",
    "n_nonhisp_black_persons",
    "n_hispanic_persons",
    "n_asian_persons",
]
singles = ["n_nonhisp_black_persons", "n_hispanic_persons", "n_asian_persons"]

datasets = DataStore()
msas = datasets.msas()
msas = msas[msas["type"].str.startswith("Metro")]  # only metros not micros
msa_fips = msas.geoid.values

17.2.1 Two-Group Measures

To recreate Massey’s results, we will batch compute every segregation index available in a for-loop using data for every metropolitan region in the country. This gives us the raw data we need to ask whether the indices can be collapsed into a smaller set of factors.

Code
for group in singles:
    if not os.path.exists(f"../data/{group}_measures.csv"):
        dfs = []
        for metro in msa_fips:
            try:
                # get blockgroup-level data for the MSA
                df = gio.get_acs(datasets, msa_fips=metro, level="bg", years=[2021])
                # create a temporary 'total' population of white/other
                df["temp_total"] = df[group] + df["n_nonhisp_white_persons"]
                df = df.dropna(subset=["geometry"]).to_crs(df.estimate_utm_crs())
                # compute all seg measures with a 2km neighborhood
                seg = batch_compute_singlegroup(
                    df, group_pop_var=group, total_pop_var="temp_total", distance=2000
                )
                dfs.append(seg.Statistic.rename(metro))
            except Exception as e:  # PR will fail
                print(e)
                df = None
                pass
        results = pd.concat(dfs, axis=1).T
        results.to_csv(f"../data/{group}_measures.csv")
results = pd.concat(
    [pd.read_csv(f"../data/{group}_measures.csv", index_col=0, converters={0:str}) for group in singles]
)

results
AbsoluteCentralization AbsoluteClustering AbsoluteConcentration Atkinson BiasCorrectedDissim BoundarySpatialDissim ConProf CorrelationR Delta DensityCorrectedDissim ... MinMax ModifiedDissim ModifiedGini PARDissim RelativeCentralization RelativeClustering RelativeConcentration SpatialDissim SpatialProxProf SpatialProximity
10180 0.9042 0.2060 0.9826 0.1854 0.2700 0.4386 0.0912 0.0711 0.9389 0.2354 ... 0.4268 0.2526 0.3842 0.5578 0.2732 2.8646 0.8783 0.4294 0.3103 1.1767
10420 0.6662 0.1589 0.9315 0.3920 0.5184 0.5253 0.3129 0.2655 0.7895 0.2338 ... 0.6831 0.5113 0.6634 0.6148 0.3112 1.2762 0.6891 0.5175 0.3230 1.1463
10500 0.7307 0.1498 0.7150 0.4416 0.5449 0.3733 0.6500 0.3603 0.7642 0.3928 ... 0.7058 0.5366 0.7039 0.5224 0.3250 -0.3211 0.6843 0.3549 1.2321 1.1838
10740 0.9356 0.0814 0.9715 0.1645 0.3076 NaN 0.0594 0.0435 0.9527 0.3065 ... 0.4720 0.2865 0.4018 0.5727 0.1409 1.8258 0.5618 0.5001 0.1721 1.0728
10780 0.8175 0.2732 0.8932 0.4960 0.5688 0.4437 0.4854 0.3902 0.8486 0.3596 ... 0.7253 0.5609 0.7229 0.6129 0.4747 0.6398 0.7614 0.4309 0.6865 1.2401
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49420 0.8819 0.1295 0.9884 0.3293 0.4473 0.6066 0.0821 0.0729 0.9325 0.4331 ... 0.6193 0.4248 0.5729 0.6500 0.3188 5.3510 0.7655 0.6027 0.1564 1.1197
49620 0.5282 0.0403 0.9106 0.2805 0.3691 0.6393 0.0256 0.0222 0.8133 0.3260 ... 0.5422 0.3401 0.4777 0.6631 0.2947 3.7648 0.4683 0.6369 0.0929 1.0365
49660 0.5693 0.0525 0.9268 0.4412 0.4899 0.7501 0.0290 0.0272 0.8686 0.4154 ... 0.6617 0.4498 0.6161 0.7616 0.1188 9.3811 0.4258 0.7487 0.0423 1.0503
49700 0.8876 0.2421 0.9240 0.1872 0.3156 0.3531 0.1736 0.1205 0.8744 0.2860 ... 0.4801 0.3039 0.4390 0.4757 0.1219 1.8433 0.7137 0.3512 0.3706 1.1877
49740 0.9686 0.0656 0.9851 0.2637 0.3569 0.5350 0.0465 0.0397 0.9611 0.3272 ... 0.5278 0.3238 0.4655 0.5713 0.2100 2.1184 0.7531 0.5346 0.1533 1.0529

906 rows × 27 columns

The results dataframe now stores every segregation index for every metropolitan region in the country, with each row representing a different metro region and each column storing a different segregation index. As such, we could use it to quickly glance at the most segregated places in the country in 2021. According to the Gini index, the top ten were essentially all in the South or the Midwest.

Code
results.index = results.index.astype(str)
results[["Gini"]].sort_values("Gini", ascending=False).merge(
    msas, left_index=True, right_on="geoid"
).head(10)[["Gini", "name"]]
Gini name
176 0.8751 Decatur, AL
278 0.8516 Chicago-Naperville-Elgin, IL-IN-WI
567 0.8500 Milwaukee-Waukesha, WI
29 0.8387 Beckley, WV
186 0.8327 Detroit-Warren-Dearborn, MI
739 0.8317 Rocky Mount, NC
371 0.8245 Hammond, LA
2 0.8240 Atlantic City-Hammonton, NJ
577 0.8199 Monroe, LA
87 0.8197 Cleveland-Elyria, OH

One of the easiest ways to understand the dimensionality in the dataset is a heatmap of the variable correlation matrix, which we can do using the clustermap function from seaborn. A good way to pick out groups of variables is to scan the diagonal for blocks of red.

Code
sns.clustermap(
    results.corr(),
    cmap="RdBu_r",
    annot=True,
    fmt=".2f",
    figsize=(10, 10),
    annot_kws={"size": 6},
)

The large contiguous blocks are groups of variables that are highly intercorrelated, and capture largely the “same thing”. Whether the underlying concept being measured is a factor or a component is a matter of debate. In the segregation context, these different indices are probably better treated as components that capture slightly different measurements of the same construct, rather than manifest outcomes of some underlying latent process of, e.g. “unevenness segregation”. That is, the Gini and Dissimilarity indices are probably different ways of measuring ‘unevenness’, as opposed to a social process of unevenness that manifests as these two different variables, “gini” and “dissimilarity”… As Rees (1971) describes, factor analysis applied to urban data more often adopts the alternative: “more modest is the claim that components or factors represent concise descriptions of patterns of associations of attributes across observations.”

All the same, the Massey and Denton approach that treats dimensions as factors is a useful way of tackling the problem, and we can adopt the factor view to recreate their analysis.

17.2.2 Factor Analysis

According to the documentation for the R psych (Revelle, 2024):

Factor analysis is an attempt to approximate a correlation or covariance matrix with one of lesser rank. The basic model is that \(_nR_n \approx _n F_{kk}F'_n + U^2\) where \(k\) is much less than \(n\). There are many ways to do factor analysis, and maximum likelihood procedures are probably the most commonly preferred (see factanal ). The existence of uniquenesses is what distinguishes factor analysis from principal components analysis (e.g., principal). If variables are thought to represent a “true” or latent part then factor analysis provides an estimate of the correlations with the latent factor(s) representing the data. If variables are thought to be measured without error, then principal components provides the most parsimonious description of the data. Factor loadings will be smaller than component loadings for the later reflect unique error in each variable. The off diagonal residuals for a factor solution will be superior (smaller) that of a component model. Factor loadings can be thought of as the asymptotic component loadings as the number of variables loading on each factor increases

To assess the number of factors we need to extract, it is common to begin exploratory factor analyses with a ‘scree plot’. Here, we fit the same number of factors as variables in the data, then plot the amount of variance accounted by each factor. Moving to the right along the x-axis, we are looking for an “elbow” in the plot that shows where estimating additional factors does not explain much more variance.

Code
# first look for n_factors)
fa = FactorAnalyzer(rotation="oblimin", n_factors=results.shape[1])
fa.fit(results.fillna(0))
ev, v = fa.get_eigenvalues()
ev = pd.Series(ev)
ev.iloc[:10].plot(grid=True, style=".-", figsize=(6,6))
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

The scree plot suggests an albow at about 3, maybe 4. Revelle says do not use the eigenvalue >1 rule to determine n_factors

Code
fa = FactorAnalyzer(rotation="oblimin", n_factors=5)
fa.fit(results)
factors = pd.DataFrame.from_records(
    fa.loadings_, index=results.columns, columns=["F1", "F2", "F3", "F4", "F5"]
)
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

loadings less than .1 are considered unimportant (R and others suppress them). Revelle says ignore less than .3.

We will follow Massey and Denton and the clustering results above, and shoot for 5 factors

Code
factors = factors.mask(factors < 0.3)
factors
F1 F2 F3 F4 F5
AbsoluteCentralization NaN NaN NaN NaN 0.717832
AbsoluteClustering NaN NaN 0.695828 0.475336 NaN
AbsoluteConcentration NaN NaN NaN NaN 0.314571
Atkinson 0.927950 NaN NaN NaN NaN
BiasCorrectedDissim 0.975286 NaN NaN NaN NaN
BoundarySpatialDissim NaN 0.866075 NaN NaN NaN
ConProf 0.329183 NaN 0.560560 NaN NaN
CorrelationR 0.481336 NaN 0.670719 NaN NaN
Delta NaN 0.626410 NaN NaN 0.424750
DensityCorrectedDissim 0.526844 NaN NaN NaN NaN
Dissim 0.974563 NaN NaN NaN NaN
DistanceDecayInteraction NaN NaN NaN NaN NaN
DistanceDecayIsolation NaN NaN 0.526268 0.528260 NaN
Entropy 0.815438 NaN 0.377583 NaN NaN
Gini 0.971297 NaN NaN NaN NaN
Interaction NaN NaN NaN NaN NaN
Isolation NaN NaN 0.516079 0.360120 NaN
MinMax 0.974450 NaN NaN NaN NaN
ModifiedDissim 0.968980 NaN NaN NaN NaN
ModifiedGini 0.972560 NaN NaN NaN NaN
PARDissim 0.311544 0.859431 NaN NaN NaN
RelativeCentralization 0.314008 NaN NaN NaN 0.725198
RelativeClustering NaN 0.780695 NaN NaN NaN
RelativeConcentration NaN NaN NaN NaN 0.588542
SpatialDissim NaN 0.857307 NaN NaN NaN
SpatialProxProf NaN NaN NaN 0.864273 NaN
SpatialProximity NaN NaN 0.876189 NaN NaN

The latent factors are on the columns and the loadings for each variable are on the rows

Code
for f in factors.columns:
    print(f"{f}:\n",factors.dropna(subset=[f])[f].sort_values(ascending=False),"\n")
F1:
 BiasCorrectedDissim       0.975286
Dissim                    0.974563
MinMax                    0.974450
ModifiedGini              0.972560
Gini                      0.971297
ModifiedDissim            0.968980
Atkinson                  0.927950
Entropy                   0.815438
DensityCorrectedDissim    0.526844
CorrelationR              0.481336
ConProf                   0.329183
RelativeCentralization    0.314008
PARDissim                 0.311544
Name: F1, dtype: float64 

F2:
 BoundarySpatialDissim    0.866075
PARDissim                0.859431
SpatialDissim            0.857307
RelativeClustering       0.780695
Delta                    0.626410
Name: F2, dtype: float64 

F3:
 SpatialProximity          0.876189
AbsoluteClustering        0.695828
CorrelationR              0.670719
ConProf                   0.560560
DistanceDecayIsolation    0.526268
Isolation                 0.516079
Entropy                   0.377583
Name: F3, dtype: float64 

F4:
 SpatialProxProf           0.864273
DistanceDecayIsolation    0.528260
AbsoluteClustering        0.475336
Isolation                 0.360120
Name: F4, dtype: float64 

F5:
 RelativeCentralization    0.725198
AbsoluteCentralization    0.717832
RelativeConcentration     0.588542
Delta                     0.424750
AbsoluteConcentration     0.314571
Name: F5, dtype: float64 

first factor loads heavily on dissim, gini, minmax, entropy (evenness)

f2 loads on isolation and clustering, proximity (clustering)

this solution looks like there’s probably closer to 3 factors… 3 and 4 are almost collinear

Code
fa.phi_
array([[ 1.        ,  0.22549582,  0.04086467,  0.11374998,  0.31756587],
       [ 0.22549582,  1.        ,  0.09986776, -0.3944805 , -0.363344  ],
       [ 0.04086467,  0.09986776,  1.        , -0.03852471,  0.26115788],
       [ 0.11374998, -0.3944805 , -0.03852471,  1.        ,  0.46093543],
       [ 0.31756587, -0.363344  ,  0.26115788,  0.46093543,  1.        ]])

the phi_ attribute stores the factor correlation matrix, which looks pretty reasonable here. There is some correlation among the factors (as expected, since we used an oblique transform, intentionally), but nothing that looks overlapping. In other words, while the factors look like they may be related, each one is capturing unique information as well

A common visualization techqnique for factor analysis is to create path diagrams which show the relationship between input variables and the latent factors. The factor analysis ecosystem is a lot less developed in Python than it is in R, but we can recreate a typical visualization using networkx and pygraphviz, (specifically, by slightly tweaking the dot layout). First, we convert the factor loadings to a networkX network object, where the factors and variables are nodes in a hierarchical directed network, and the factor loadings represent the edge weights between factors and variables, then we use the graphviz dot layout to draw the graph

Code
G2 = nx.from_pandas_edgelist(
    factors.T.stack().rename("weight").reset_index().round(3),
    source="level_0",
    target="level_1",
    edge_attr="weight",
    edge_key="weight",
    create_using=nx.DiGraph,
)

f, ax = plt.subplots(figsize=(13, 15))

pos = graphviz_layout(G2, prog="dot", args='-Grankdir="LR"')
nx.draw_networkx(
    G2,
    pos=pos,
    with_labels=True,
    ax=ax,
    edge_cmap=plt.cm.Reds,
    edge_color=factors.T.stack().values,
    node_size=500,
    width=3,
    arrowsize=14,
)
labels = nx.get_edge_attributes(G2, "weight")
nx.draw_networkx_edge_labels(G2, pos, edge_labels=labels)
ax.margins(0.1, None)  # add some horizontal space to fit labels
ax.axis('off')

This plot shows the relationship between factors and variables as a kind of network graph. In this case, each of the factors is displayed as a node on the lefthand side and arrows point to each of the variables that load strongly onto the factor (with color and label denoting intensity). The direction of the arrow is important as it implies that the measured variables are partial outcomes from the latent factor.

This graphic can also be done in pure networkx (i.e. without pygraphviz) using a multipartite layout, but the result is not as nice because it doesnt group the manifest variables to avoid line crossings, and the latent variables all have the same spacing, which increases overlap

Code
f, ax = plt.subplots(figsize=(13, 14))

for layer, nodes in enumerate(reversed(tuple(nx.topological_generations(G2)))):
    # `multipartite_layout` expects the layer as a node attribute, so add the
    # numeric layer value as a node attribute
    for node in nodes:
        G2.nodes[node]["layer"] = layer

# Compute the multipartite_layout using the "layer" node attribute
pos = nx.multipartite_layout(
    G2,
    subset_key="layer",
    align="vertical",
)
nx.draw_networkx(
    G2,
    pos=pos,
    ax=ax,
    edge_cmap=plt.cm.Reds,
    edge_color=factors.T.stack().values,
    node_size=500,
    width=3,
    arrowsize=14,
)
nx.draw_networkx_edge_labels(G2, pos, edge_labels=labels)
ax.margins(0.1, None)  # add some horizontal space to fit labels
ax.axis('off')
(-0.027222222222222224,
 0.08277777777777777,
 -1.2100000000000002,
 1.2100000000000002)

17.2.3 Cluster Analysis

Instead of factor analysis, we could alternatively use cluster analysis to try and group segregation indices into categories.

Affinity propagation is a clustering algorithm where k is endogenous so it works well for exploratory analysis when we are not sure about the number of clusters. If we fit a cluster model to the correlation matrix of segregation indices, we’re looking for groups of variables that capture the same dimension. This works for our purposes here because we don’t really care about the factor loadings or measuring the latent construct per se (the segregation measures themselves are preferable for that). Instead, we’re asking whether these indices are providing unique information, and how many unique dimensions should we consider?

Since cluster assignments are discrete, and we think of cluster assignments as ‘best’ when they are unambiguous (i.e. clusters are well-separated and each observation belongs to only one cluster), this is like treating the dimensions as orthogonal in factor analysis… If we think the factors are correlated (oblique rotated), then the clusters wouldn’t be well-separated.

Code
ap = AffinityPropagation().fit_predict(results.corr())
ap = pd.Series(dict(zip(results.columns.values, ap)), name="Index Type")
ap.nunique()
5
Code
silhouette_score(results.corr(), ap)
0.6031665663470883
Code
ap.sort_values()
SpatialProximity            0
SpatialProxProf             0
DistanceDecayIsolation      0
CorrelationR                0
ConProf                     0
Isolation                   0
AbsoluteClustering          0
RelativeClustering          1
SpatialDissim               1
DensityCorrectedDissim      1
BoundarySpatialDissim       1
PARDissim                   1
Interaction                 2
DistanceDecayInteraction    2
AbsoluteConcentration       2
ModifiedGini                3
ModifiedDissim              3
Entropy                     3
Gini                        3
Dissim                      3
BiasCorrectedDissim         3
Atkinson                    3
MinMax                      3
Delta                       4
RelativeCentralization      4
RelativeConcentration       4
AbsoluteCentralization      4
Name: Index Type, dtype: int64
Code
clust_labels = AgglomerativeClustering(n_clusters=5).fit_predict(results.corr())
clust_labels = pd.Series(dict(zip(results.columns.values, clust_labels)))
clust_labels.sort_values()
SpatialProximity            0
AbsoluteClustering          0
ConProf                     0
CorrelationR                0
Isolation                   0
DistanceDecayIsolation      0
SpatialProxProf             0
AbsoluteConcentration       1
Interaction                 1
DistanceDecayInteraction    1
ModifiedGini                2
ModifiedDissim              2
MinMax                      2
Gini                        2
Entropy                     2
BiasCorrectedDissim         2
Atkinson                    2
Dissim                      2
Delta                       3
RelativeCentralization      3
RelativeConcentration       3
AbsoluteCentralization      3
DensityCorrectedDissim      4
BoundarySpatialDissim       4
PARDissim                   4
RelativeClustering          4
SpatialDissim               4
dtype: int64
Code
silhouette_score(results.corr(), clust_labels)
0.6031665663470883

The interesting thing is how well these results map onto Massey and Denton’s originals, despite the idea that clustering and exposure make more sense collapsed into a single category (that concept seems even more pronounced in these results, since isolation and interaction are basically inverses, but end up in different categories)

Code
clust_labels = AgglomerativeClustering(n_clusters=3).fit_predict(results.corr())
clust_labels = pd.Series(dict(zip(results.columns.values, clust_labels)))
silhouette_score(results.corr(), clust_labels)
0.5582547485763226
Code
clust_labels.sort_values()
AbsoluteCentralization      0
SpatialDissim               0
AbsoluteConcentration       0
RelativeConcentration       0
RelativeClustering          0
BoundarySpatialDissim       0
RelativeCentralization      0
PARDissim                   0
Delta                       0
DensityCorrectedDissim      0
DistanceDecayInteraction    0
Interaction                 0
Isolation                   1
SpatialProximity            1
DistanceDecayIsolation      1
CorrelationR                1
ConProf                     1
AbsoluteClustering          1
SpatialProxProf             1
Gini                        2
Dissim                      2
MinMax                      2
ModifiedDissim              2
ModifiedGini                2
BiasCorrectedDissim         2
Atkinson                    2
Entropy                     2
dtype: int64
Code
klabels = KMeans(n_clusters=5).fit_predict(results.corr())
Code
klabels = pd.Series(dict(zip(results.columns.values, klabels)))
silhouette_score(results.corr(), klabels.values)
0.6031665663470883

The clustering algorithms are all really stable in their assignments. When you tune hierarchical and kmeans using the optimal silhouette score, they agree on the exact assignments in the 5 cluster solution

17.2.4 Multi-Group Measures

Following, we can extend Massey et al’s analysis to multigroup segregation indices

Code
if not os.path.exists(f"../data/multigroup_measures.csv"):
    dfs = []
    for metro in msa_fips:
        try:
            df = gio.get_acs(datasets, msa_fips=metro, level="bg", years=[2021])
            df = df.dropna(subset=["geometry"]).to_crs(df.estimate_utm_crs())
            seg = batch_compute_multigroup(df, groups=multi_groups, distance=2000)
            dfs.append(seg.Statistic.rename(metro))
        except Exception as e:  # PR will fail
            print(e)
            df = None
            pass
    results = pd.concat(dfs, axis=1).T
    results.to_csv(f"../data/multigroup_measures.csv")
Code
results_multi = pd.read_csv("../data/multigroup_measures.csv", index_col=0)
results_multi
GlobalDistortion MultiDissim MultiDivergence MultiDiversity MultiGini MultiInfoTheory MultiNormExposure MultiRelativeDiversity MultiSquaredCoefVar SimpsonsConcentration SimpsonsInteraction
10180 3.0591 0.2673 0.0827 0.9546 0.3718 0.0867 0.0894 0.0847 0.0570 0.4510 0.5490
10420 17.1550 0.4556 0.1575 0.7908 0.6000 0.1992 0.2183 0.2032 0.1079 0.5714 0.4286
10500 6.0888 0.5272 0.2333 0.8154 0.6913 0.2861 0.3310 0.3163 0.1514 0.5043 0.4957
10540 0.7949 0.2326 0.0314 0.4983 0.3146 0.0630 0.0384 0.0369 0.0194 0.7323 0.2677
10580 14.5605 0.3502 0.1169 0.9452 0.4758 0.1236 0.1384 0.1170 0.0726 0.5073 0.4927
... ... ... ... ... ... ... ... ... ... ... ...
49420 8.1606 0.4097 0.1384 0.8083 0.5436 0.1713 0.2149 0.2098 0.0891 0.4848 0.5152
49620 10.1244 0.4475 0.1525 0.9006 0.5436 0.1694 0.1945 0.1630 0.0899 0.5057 0.4943
49660 12.2645 0.5026 0.1643 0.6844 0.6598 0.2400 0.2562 0.2309 0.1201 0.6336 0.3664
49700 2.2137 0.2239 0.0788 1.0918 0.3186 0.0721 0.0652 0.0652 0.0552 0.3738 0.6262
49740 8.5083 0.3735 0.1387 0.7294 0.5491 0.1901 0.2086 0.2006 0.0832 0.5565 0.4435

384 rows × 11 columns

Code
sns.clustermap(
    results_multi.corr(),
    cmap="RdBu_r",
    annot=True,
    fmt=".2f",
    annot_kws={"size": 8},
    figsize=(8, 8),
)

Code
famulti = FactorAnalyzer(rotation="oblimin", n_factors=results_multi.shape[1])
famulti.fit(results_multi.fillna(0))

ev, v = famulti.get_eigenvalues()
ev = pd.Series(ev)
ev.iloc[:10].plot(grid=True, style=".-")
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

Code
apmulti = AffinityPropagation().fit_predict(results_multi.corr())
apmulti = pd.Series(dict(zip(results_multi.columns.values, apmulti)), name="Index Type")
apmulti.nunique()
4
Code
silhouette_score(results_multi.corr(), apmulti)
0.604091489960529
Code
apmulti.sort_values()
GlobalDistortion          0
MultiDissim               1
MultiDivergence           1
MultiGini                 1
MultiInfoTheory           1
MultiNormExposure         1
MultiRelativeDiversity    1
MultiSquaredCoefVar       1
SimpsonsConcentration     2
MultiDiversity            3
SimpsonsInteraction       3
Name: Index Type, dtype: int64

17.3 Choosing a Segregation Measure

In practice, we have seen that segregation can compute two dozen segregation indices, but many of them capture the same concept. This yields a simple question, which index is most appropriate for studying e.g. racial and ethnic segregation in the city? There is no single answer for that question, but the literature does have good suggestions about the best defaults. In particular, Reardon & Firebaugh (2002) examine the mathematical properties of many multigroup measures, arguing that the information theory index (\(H\)) and isolation index (\(I\)) have the most attrative features for capturing evenness and exposure, respectively. In the next section we examine spatial segregation indices which provide more explicit avenues for assessing centralization, concentration, and clustering.