[[16]{.chapter-number}  [Segregation Measures]{.chapter-title}]{#sec-segregation .quarto-section-identifier}

16 Segregation Measures

Code

import os

import geopandas as gpd
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import seaborn as sns
import contextily as ctx
from factor_analyzer import FactorAnalyzer
from geosnap import DataStore
from geosnap import io as gio
from networkx.drawing.nx_agraph import graphviz_layout
from segregation.batch import batch_compute_multigroup, batch_compute_singlegroup
from segregation.local import LocalDistortion, MultiLocationQuotient
from segregation.multigroup import MultiInfoTheory, MultiGini, MultiDiversity
from segregation.singlegroup import Dissim, Gini, Entropy
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score
from scipy.stats import zscore

%load_ext watermark
%watermark -iv -a "eli knaap"

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: eli knaap

networkx       : 3.4.2
factor_analyzer: 0.5.1
segregation    : 2.5.2.dev4+gea53e4c.d20250128
geosnap        : 0.14.1.dev14+g0443e2a.d20250103
sklearn        : 1.6.0
contextily     : 1.6.2
pandas         : 2.2.3
seaborn        : 0.13.2
scipy          : 1.14.1
geopandas      : 1.0.1
matplotlib     : 3.10.0

Segregation is in many ways the core of urban inequality research; any discussion of urban analytics is incomplete without decent coverage of segregation measurement. One of the most obvious and long-studied avenues for creating inequality between groups is to partition access to different resources. This is also one of the most important legacies of institutionalized racism in the United States, where the unequal distribution of public resources has repeatedly been shown to be illegal. In other words, there is an important legacy of urban inequality research closely related to legal challenges against federal policy; if policies governing public resources (like education, housing, transportation infrastructure, clean air and water, etc.) result in segregation (for some protected class), then they are inherently unequal (and ultimately unconstitutional, thanks to the equal protection clause).

Segregation of white and colored children in public schools has a detrimental effect upon the colored children. The impact is greater when it has the sanction of the law, for the policy of separating the races is usually interpreted as denoting the inferiority of the Negro group… Any language in contrary to this finding is rejected. We conclude that in the field of public education the doctrine of ‘separate but equal’ has no place. Separate educational facilities are inherently unequal.

— Warren (1954)

In concept, segregation is about separation; when we measure residential segregation, we are asking whether people belonging to different groups share the same space, often conceived as the same ‘neighborhood’. This is a more ambiguous task than measuring, for example, educational segregation, where the shared resource such as schools are very well-defined. Residential space can be measured at the scale of the room, housing unit, building, collection of buildings, neighborhood, city, region, and all the way on (we explore this variable concept of scale in the next section).

Thus in the case of residential segregation we are forced to rely on the fuzzy notion of neighborhoods. In practice, this means that researchers simply adopt the tract or blockgroup as a placeholder for the neighborhood, then examine segregation as defined by these units. Here, we’ll use PySAL’s segregation module to analyze residential segregation by race and ethnicity in Southern California, and we begin by collecting data for the entire region, then partitioning it into the coastal and inland sections.

Code

datasets = DataStore()

socal = gio.get_acs(
    datasets,
    county_fips=["06037", "06025", "06059", "06071", "06073", "06065", "06111"],
    years=[2018],
)
socal = socal.to_crs(socal.estimate_utm_crs())
socal[["p_hispanic_persons", "geometry"]].assign(geometry=socal.geometry.simplify(100)).explore(
    column="p_hispanic_persons",
    scheme="quantiles",
    cmap="Blues",
    k=8,
    tooltip=["p_hispanic_persons"],
    style_kwds={"weight": 0.5},
    tiles='CartoDB Positron'
)
socal["county"] = socal.geoid.str[:5]
county_names = [
    "Imperial",
    "Los Angeles",
    "Orange",
    "Riverside",
    "San Bernadino",
    "San Diego",
    "Ventura",
]
county_fips = ["06025", "06037", "06059", "06065", "06071", "06073", "06111"]
namer = dict(zip(county_fips, county_names))
socal['county'] = socal.county.replace(to_replace=namer)

coastal = socal[socal.county.isin(["Los Angeles", "Orange", "San Diego", "Ventura"])]
inland = socal[socal.county.isin(['Riverside', "San Bernadino", "Imperial"])]

f, ax = plt.subplots(1,2, figsize=(10, 5))
coastal.plot(column='county', ax=ax[0])
inland.plot(column='county', ax=ax[1])

/Users/knaaptime/Dropbox/projects/geosnap/geosnap/io/constructors.py:217: UserWarning: Currency columns unavailable at this resolution; not adjusting for inflation
  warn(

16.1 Residential Segregation Measures

The segregation package calculates dozens of segregation indices, each of which captures something different about the ways that population groups interact or remain separated in space. Most of the commonly-used statistics are global or aggregate measures, meaning they summarize the total level of segregation across all units in a study region. For an overview of segregation measure nomenclature, formulae, and canonical citations, see the tables at the end of Grannis (2002).

16.1.1 Single-Group Indices

Single-group indices measure the partitioning of one population group relative to everyone else. Early segregation work in the U.S. tends to focus on Black-white segregation, but it is also common to see work focused on a particular minority population versus the rest of the population (e.g. Black vs all other groups). To generate a single-group measure using the segregation package, you pass a dataframe holding population counts for each geographic unit, and the names of the columns for the focal population and the reference population (i.e. the minority the total population).

Code

dissim_hisp = Dissim(socal, "n_hispanic_persons", "n_total_pop")
dissim_black = Dissim(socal, "n_nonhisp_black_persons", "n_total_pop")

gini_hisp = Gini(socal, "n_hispanic_persons", "n_total_pop")
gini_black = Gini(socal, "n_nonhisp_black_persons", "n_total_pop")

entropy_hisp = Entropy(socal, "n_hispanic_persons", "n_total_pop")
entropy_black = Entropy(socal, "n_nonhisp_black_persons", "n_total_pop")

Here, we fit three segregation measures: Dissimilarity (\(D\)), Gini (\(G\)), and Entropy (also called the Information Theory index, sometimes denoted [Thiel’s] \(H\)) for the Black population and the Hispanic/Latino populations in the southern California region. Each class has a statistic attribute that holds the computed value for each segregation measure.

Code

dissim_hisp.statistic

0.49957776952346783

Code

dissim_black.statistic

0.547197680270968

Code

gini_hisp.statistic

0.6602166788700566

Code

gini_black.statistic

0.7234615852802052

Code

entropy_hisp.statistic

0.2714618709533524

Code

entropy_black.statistic

0.2616509031724341

According to the Dissimilarity and Gini indices, the black population in southern California is more segregated than the Latinx/Hispanic population, but the reverse is true according to the Entropy index.

16.1.1.1 Batch Computation

To examine several indices at once, segregation provides a set of “batch_compute” functions. Instead of a fitted Class, the batch_compute_singlegroup function returns a table of segregation indices and is a convenient way of collecting many statistics simultaneously.

Code

socal_all_singlegroup = batch_compute_singlegroup(socal, "n_hispanic_persons", "n_total_pop")
socal_all_singlegroup

	Statistic
Name
AbsoluteCentralization	0.7737
AbsoluteClustering	0.2704
AbsoluteConcentration	0.6548
Atkinson	0.3726
BiasCorrectedDissim	0.4993
BoundarySpatialDissim	NaN
ConProf	0.4794
CorrelationR	0.3272
Delta	0.8843
DensityCorrectedDissim	0.3856
Dissim	0.4996
DistanceDecayInteraction	0.4788
DistanceDecayIsolation	0.5741
Entropy	0.2715
Gini	0.6602
Interaction	0.3723
Isolation	0.6277
MinMax	0.6663
ModifiedDissim	0.4901
ModifiedGini	0.6509
PARDissim	0.4808
RelativeCentralization	0.0739
RelativeClustering	0.0667
RelativeConcentration	0.3460
SpatialDissim	0.3593
SpatialProxProf	0.8412
SpatialProximity	1.1710

16.1.2 Multi-group Indices

Multi-group measures capture the partitioning of several population groups simultaneously. Most multi-group measures are extensions of single-group measures and have a more recent history in the literature (Reardon & Firebaugh, 2002).

Code

pop_groups = ['n_asian_persons', 'n_hispanic_persons', 'n_nonhisp_black_persons', 'n_nonhisp_white_persons']

multi_div_coast = MultiDiversity(coastal, pop_groups)
multi_div_inland = MultiDiversity(inland, pop_groups)


multi_info_coast = MultiInfoTheory(coastal, pop_groups)
multi_info_inland = MultiInfoTheory(inland, pop_groups)

For multigroup diversity:

Code

print(f"coast: {multi_div_coast.statistic}")
print(f"inland: {multi_div_inland.statistic}")

coast: 1.1839984141407496
inland: 1.06835444467478

for multigroup information theory:

Code

print(f"coast: {multi_info_coast.statistic}")
print(f"inland: {multi_info_inland.statistic}")

coast: 0.31802561870773133
inland: 0.20407072911853946

Regardless which index is used, multigroup segregation is higher in the coastal region than the inland one

16.1.2.1 Batch Computation

Again, the measures can be “batch computed”

Code

socal_all_multigroup = batch_compute_multigroup(socal, groups=pop_groups)

Code

socal_all_multigroup

	Statistic
Name
GlobalDistortion	335.5186
MultiDissim	0.5131
MultiDivergence	0.3498
MultiDiversity	1.1664
MultiGini	0.6766
MultiInfoTheory	0.2999
MultiNormExposure	0.3048
MultiRelativeDiversity	0.2955
MultiSquaredCoefVar	0.2508
SimpsonsConcentration	0.3517
SimpsonsInteraction	0.6483

16.1.3 Local Segregation Measures

Unlike global measures, local segregation statistics measure segregation in each geographic unit rather than summarizing segregation across the region. For example the recently proposed Distortion index is designed to visualize how segregation changes over a region (Bézenac et al., 2022; Olteanu et al., 2019).

The use of trajectory convergence analysis provides a flexible way for capturing change across all scales from small spatial units and how the rate of convergence to the citywide average modifies over space. Thus, the method provides an analysis of how far, in spatial terms, any individual or neighborhood is from the citywide multigroup distribution.

Code

d = LocalDistortion(socal, groups=pop_groups)
ax = d.data.plot('distortion',  scheme='quantiles', cmap='RdBu_r', alpha=0.6, )
ctx.add_basemap(ax=ax, crs=socal.crs)

Code

d.data.assign(geometry=d.data.geometry.simplify(100)).explore(
    "distortion",
    cmap="RdBu_r",
    style_kwds={"weight": 0.5},
    scheme="quantiles",
    tiles="CartoDB Positron",
)

Make this Notebook Trusted to load map: File -> Trust Notebook

16.2 The Dimensions of Residential Segregation

With more than 20 segregation indices at our disposal, it can be confusing to understand which measure is ‘best’ or whether multiple indices really capture slightly different versions of ‘the same thing’. A seminal contribution in segregation measurement is given by Massey & Denton (1988) who were the first to examine quantitatively the wide variety of segregation indexing strategies and the information each provides. Their paper is important first because of the sheer volume of work: in the late 80s, it was seriously laborious to gather data for a large number of metropolitan regions and compute a dozen segregation indices in each one; the paper’s second major contribution is the way it helped clarify the relationships among various indices that had been proposed over the years.

Like personality theory and item-response theory in psychology, Massey and Denton proposed that there were multiple ways to measure segregation that capture different concepts of the term (i.e. unevenness versus clustering)–but there probably are not 20 different concepts like there are indices in use. Instead, those two dozen indices are different ways of capturing the five or so dimensions of segregation that really exist (they argue). This was a dramatic step forward in understanding what we’re actually measuring when we study residential segregation. To conduct their analysis, Massey and colleagues gathered metropolitan data from around the country and computed every segregation index they could manage, then they ran a factor analysis on the set of results.

Assessing the outcome, they argued that segregation indices could be divided into five categories: evenness, exposure, concentration, centralization, and clustering.

Minority members may be distributed so that they are overrepresented in some areas and underrepresented in others, varying on the characteristic of evenness. They may be distributed so that their exposure to majority members is limited by virtue of rarely sharing a neighborhood with them. They may be spatially concentrated within a very small area, occupying less physical space than majority members. They may be spatially centralized, congregating around the urban core, and occupying a more central location than the majority. Finally, areas of minority settlement may be tightly clustered to form one large contiguous enclave, or be scattered widely around the urban area. (Massey & Denton, 1988, p. 283)

Following Massey & Denton (1988), Massey et al. (1996), and Massey (2012) we can look at dimensionality in segregation measures by computing two-group indices for Black, Hispanic, and Asian groups (vs all others) for every metro region in the country. Here we will use the 2021 ACS data at the blockgroup level, and use a 2km radius for computing generalized spatial measures, which helps account for heterogeneity in BG size

Code

multi_groups = [
    "n_nonhisp_white_persons",
    "n_nonhisp_black_persons",
    "n_hispanic_persons",
    "n_asian_persons",
]
singles = ["n_nonhisp_black_persons", "n_hispanic_persons", "n_asian_persons"]

datasets = DataStore()
msas = datasets.msas()
msas = msas[msas["type"].str.startswith("Metro")]  # only metros not micros
msa_fips = msas.geoid.values

16.2.1 Two-Group Measures

To recreate Massey’s results, we will batch compute every segregation index available in a for-loop using data for every metropolitan region in the country. This gives us the raw data we need to ask whether the indices can be collapsed into a smaller set of factors.

Code

for group in singles:
    if not os.path.exists(f"../data/{group}_measures.csv"):
        dfs = []
        for metro in msa_fips:
            try:
                # get blockgroup-level data for the MSA
                df = gio.get_acs(datasets, msa_fips=metro, level="bg", years=[2021])
                # create a temporary 'total' population of white/other
                df["temp_total"] = df[group] + df["n_nonhisp_white_persons"]
                df = df.dropna(subset=["geometry"]).to_crs(df.estimate_utm_crs())
                # compute all seg measures with a 2km neighborhood
                seg = batch_compute_singlegroup(
                    df, group_pop_var=group, total_pop_var="temp_total", distance=2000
                )
                dfs.append(seg.Statistic.rename(metro))
            except Exception as e:  # PR will fail
                print(e)
                df = None
                pass
        results = pd.concat(dfs, axis=1).T
        results.to_csv(f"../data/{group}_measures.csv")
results = pd.concat(
    [pd.read_csv(f"../data/{group}_measures.csv", index_col=0, converters={0:str}) for group in singles]
)

results

	AbsoluteCentralization	AbsoluteClustering	AbsoluteConcentration	Atkinson	BiasCorrectedDissim	BoundarySpatialDissim	ConProf	CorrelationR	Delta	DensityCorrectedDissim	...	MinMax	ModifiedDissim	ModifiedGini	PARDissim	RelativeCentralization	RelativeClustering	RelativeConcentration	SpatialDissim	SpatialProxProf	SpatialProximity
10180	0.9042	0.2060	0.9826	0.1854	0.2700	0.4386	0.0912	0.0711	0.9389	0.2354	...	0.4268	0.2526	0.3842	0.5578	0.2732	2.8646	0.8783	0.4294	0.3103	1.1767
10420	0.6662	0.1589	0.9315	0.3920	0.5184	0.5253	0.3129	0.2655	0.7895	0.2338	...	0.6831	0.5113	0.6634	0.6148	0.3112	1.2762	0.6891	0.5175	0.3230	1.1463
10500	0.7307	0.1498	0.7150	0.4416	0.5449	0.3733	0.6500	0.3603	0.7642	0.3928	...	0.7058	0.5366	0.7039	0.5224	0.3250	-0.3211	0.6843	0.3549	1.2321	1.1838
10740	0.9356	0.0814	0.9715	0.1645	0.3076	NaN	0.0594	0.0435	0.9527	0.3065	...	0.4720	0.2865	0.4018	0.5727	0.1409	1.8258	0.5618	0.5001	0.1721	1.0728
10780	0.8175	0.2732	0.8932	0.4960	0.5688	0.4437	0.4854	0.3902	0.8486	0.3596	...	0.7253	0.5609	0.7229	0.6129	0.4747	0.6398	0.7614	0.4309	0.6865	1.2401
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49420	0.8819	0.1295	0.9884	0.3293	0.4473	0.6066	0.0821	0.0729	0.9325	0.4331	...	0.6193	0.4248	0.5729	0.6500	0.3188	5.3510	0.7655	0.6027	0.1564	1.1197
49620	0.5282	0.0403	0.9106	0.2805	0.3691	0.6393	0.0256	0.0222	0.8133	0.3260	...	0.5422	0.3401	0.4777	0.6631	0.2947	3.7648	0.4683	0.6369	0.0929	1.0365
49660	0.5693	0.0525	0.9268	0.4412	0.4899	0.7501	0.0290	0.0272	0.8686	0.4154	...	0.6617	0.4498	0.6161	0.7616	0.1188	9.3811	0.4258	0.7487	0.0423	1.0503
49700	0.8876	0.2421	0.9240	0.1872	0.3156	0.3531	0.1736	0.1205	0.8744	0.2860	...	0.4801	0.3039	0.4390	0.4757	0.1219	1.8433	0.7137	0.3512	0.3706	1.1877
49740	0.9686	0.0656	0.9851	0.2637	0.3569	0.5350	0.0465	0.0397	0.9611	0.3272	...	0.5278	0.3238	0.4655	0.5713	0.2100	2.1184	0.7531	0.5346	0.1533	1.0529

906 rows × 27 columns

The results dataframe now stores every segregation index for every metropolitan region in the country, with each row representing a different metro region and each column storing a different segregation index. As such, we could use it to quickly glance at the most segregated places in the country in 2021. According to the Gini index, the top ten were essentially all in the South or the Midwest.

Code

results.index = results.index.astype(str)
results[["Gini"]].sort_values("Gini", ascending=False).merge(
    msas, left_index=True, right_on="geoid"
).head(10)[["Gini", "name"]]

	Gini	name
176	0.8751	Decatur, AL
278	0.8516	Chicago-Naperville-Elgin, IL-IN-WI
567	0.8500	Milwaukee-Waukesha, WI
29	0.8387	Beckley, WV
186	0.8327	Detroit-Warren-Dearborn, MI
739	0.8317	Rocky Mount, NC
371	0.8245	Hammond, LA
2	0.8240	Atlantic City-Hammonton, NJ
577	0.8199	Monroe, LA
87	0.8197	Cleveland-Elyria, OH

One of the easiest ways to understand the dimensionality in the dataset is a heatmap of the variable correlation matrix, which we can do using the clustermap function from seaborn. A good way to pick out groups of variables is to scan the diagonal for blocks of red.

Code

sns.clustermap(
    results.corr(),
    cmap="RdBu_r",
    annot=True,
    fmt=".2f",
    figsize=(10, 10),
    annot_kws={"size": 6},
)

The large contiguous blocks are groups of variables that are highly intercorrelated, and capture largely the “same thing”. Whether the underlying concept being measured is a factor or a component is a matter of debate. In the segregation context, these different indices are probably better treated as components that capture slightly different measurements of the same construct, rather than manifest outcomes of some underlying latent process of, e.g. “unevenness segregation”. That is, the Gini and Dissimilarity indices are probably different ways of measuring ‘unevenness’, as opposed to a social process of unevenness that manifests as these two different variables, “gini” and “dissimilarity”… As Rees (1971) describes, factor analysis applied to urban data more often adopts the alternative: “more modest is the claim that components or factors represent concise descriptions of patterns of associations of attributes across observations.”

All the same, the Massey and Denton approach that treats dimensions as factors is a useful way of tackling the problem, and we can adopt the factor view to recreate their analysis.

16.2.2 Factor Analysis

According to the documentation for the R psych (Revelle, 2024):

Factor analysis is an attempt to approximate a correlation or covariance matrix with one of lesser rank. The basic model is that \(_nR_n \approx _n F_{kk}F'_n + U^2\) where \(k\) is much less than \(n\). There are many ways to do factor analysis, and maximum likelihood procedures are probably the most commonly preferred (see factanal )… If variables are thought to represent a “true” or latent part then factor analysis provides an estimate of the correlations with the latent factor(s) representing the data. If variables are thought to be measured without error, then principal components provides the most parsimonious description of the data.

To assess the number of factors we need to extract, it is common to begin exploratory factor analyses with a ‘scree plot’. Here, we fit the same number of factors as variables in the data, then plot the amount of variance accounted by each factor. Moving to the right along the x-axis, we are looking for an “elbow” in the plot that shows where estimating additional factors does not explain much more variance.

Code

# first look for n_factors)
fa = FactorAnalyzer(rotation="oblimin", n_factors=results.shape[1])
fa.fit(results.fillna(0))
ev, v = fa.get_eigenvalues()
ev = pd.Series(ev)
ev.iloc[:10].plot(grid=True, style=".-", figsize=(6,6))

/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

The scree plot suggests an albow at about 3, maybe 4. Revelle says do not use the eigenvalue >1 rule to determine n_factors

Code

fa = FactorAnalyzer(rotation="oblimin", n_factors=5)
fa.fit(results)
factors = pd.DataFrame.from_records(
    fa.loadings_, index=results.columns, columns=["F1", "F2", "F3", "F4", "F5"]
)

/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

loadings less than .1 are considered unimportant (R and others suppress them). Revelle says ignore less than .3.

We will follow Massey and Denton and the clustering results above, and shoot for 5 factors

Code

factors = factors.mask(factors < 0.3)
factors

	F1	F2	F3	F4	F5
AbsoluteCentralization	NaN	NaN	NaN	NaN	0.717832
AbsoluteClustering	NaN	NaN	0.695828	0.475336	NaN
AbsoluteConcentration	NaN	NaN	NaN	NaN	0.314571
Atkinson	0.927950	NaN	NaN	NaN	NaN
BiasCorrectedDissim	0.975286	NaN	NaN	NaN	NaN
BoundarySpatialDissim	NaN	0.866075	NaN	NaN	NaN
ConProf	0.329183	NaN	0.560560	NaN	NaN
CorrelationR	0.481336	NaN	0.670719	NaN	NaN
Delta	NaN	0.626410	NaN	NaN	0.424750
DensityCorrectedDissim	0.526844	NaN	NaN	NaN	NaN
Dissim	0.974563	NaN	NaN	NaN	NaN
DistanceDecayInteraction	NaN	NaN	NaN	NaN	NaN
DistanceDecayIsolation	NaN	NaN	0.526268	0.528260	NaN
Entropy	0.815438	NaN	0.377583	NaN	NaN
Gini	0.971297	NaN	NaN	NaN	NaN
Interaction	NaN	NaN	NaN	NaN	NaN
Isolation	NaN	NaN	0.516079	0.360120	NaN
MinMax	0.974450	NaN	NaN	NaN	NaN
ModifiedDissim	0.968980	NaN	NaN	NaN	NaN
ModifiedGini	0.972560	NaN	NaN	NaN	NaN
PARDissim	0.311544	0.859431	NaN	NaN	NaN
RelativeCentralization	0.314008	NaN	NaN	NaN	0.725198
RelativeClustering	NaN	0.780695	NaN	NaN	NaN
RelativeConcentration	NaN	NaN	NaN	NaN	0.588542
SpatialDissim	NaN	0.857307	NaN	NaN	NaN
SpatialProxProf	NaN	NaN	NaN	0.864273	NaN
SpatialProximity	NaN	NaN	0.876189	NaN	NaN

The latent factors are on the columns and the loadings for each variable are on the rows

Code

for f in factors.columns:
    print(f"{f}:\n",factors.dropna(subset=[f])[f].sort_values(ascending=False),"\n")

F1:
 BiasCorrectedDissim       0.975286
Dissim                    0.974563
MinMax                    0.974450
ModifiedGini              0.972560
Gini                      0.971297
ModifiedDissim            0.968980
Atkinson                  0.927950
Entropy                   0.815438
DensityCorrectedDissim    0.526844
CorrelationR              0.481336
ConProf                   0.329183
RelativeCentralization    0.314008
PARDissim                 0.311544
Name: F1, dtype: float64 

F2:
 BoundarySpatialDissim    0.866075
PARDissim                0.859431
SpatialDissim            0.857307
RelativeClustering       0.780695
Delta                    0.626410
Name: F2, dtype: float64 

F3:
 SpatialProximity          0.876189
AbsoluteClustering        0.695828
CorrelationR              0.670719
ConProf                   0.560560
DistanceDecayIsolation    0.526268
Isolation                 0.516079
Entropy                   0.377583
Name: F3, dtype: float64 

F4:
 SpatialProxProf           0.864273
DistanceDecayIsolation    0.528260
AbsoluteClustering        0.475336
Isolation                 0.360120
Name: F4, dtype: float64 

F5:
 RelativeCentralization    0.725198
AbsoluteCentralization    0.717832
RelativeConcentration     0.588542
Delta                     0.424750
AbsoluteConcentration     0.314571
Name: F5, dtype: float64

first factor loads heavily on dissim, gini, minmax, entropy (evenness)

f2 loads on isolation and clustering, proximity (clustering)

this solution looks like there’s probably closer to 3 factors… 3 and 4 are almost collinear

Code

fa.phi_

array([[ 1.        ,  0.22549582,  0.04086467,  0.11374998,  0.31756587],
       [ 0.22549582,  1.        ,  0.09986776, -0.3944805 , -0.363344  ],
       [ 0.04086467,  0.09986776,  1.        , -0.03852471,  0.26115788],
       [ 0.11374998, -0.3944805 , -0.03852471,  1.        ,  0.46093543],
       [ 0.31756587, -0.363344  ,  0.26115788,  0.46093543,  1.        ]])

the phi_ attribute stores the factor correlation matrix, which looks pretty reasonable here. There is some correlation among the factors (as expected, since we used an oblique transform, intentionally), but nothing that looks overlapping. In other words, while the factors look like they may be related, each one is capturing unique information as well

A common visualization technique for factor analysis is to create path diagrams which show the relationship between input variables and the latent factors. The factor analysis ecosystem is a lot less developed in Python than it is in R, but we can recreate a typical visualization using networkx and pygraphviz, (specifically, by slightly tweaking the dot layout). First, we convert the factor loadings to a networkX network object, where the factors and variables are nodes in a hierarchical directed network, and the factor loadings represent the edge weights between factors and variables, then we use the graphviz dot layout to draw the graph

Code

G2 = nx.from_pandas_edgelist(
    factors.T.stack().rename("weight").reset_index().round(3),
    source="level_0",
    target="level_1",
    edge_attr="weight",
    edge_key="weight",
    create_using=nx.DiGraph,
)

f, ax = plt.subplots(figsize=(13, 15))

pos = graphviz_layout(G2, prog="dot", args='-Grankdir="LR"')
nx.draw_networkx(
    G2,
    pos=pos,
    with_labels=True,
    ax=ax,
    edge_cmap=plt.cm.Reds,
    edge_color=factors.T.stack().values,
    node_size=500,
    width=3,
    arrowsize=14,
)
labels = nx.get_edge_attributes(G2, "weight")
nx.draw_networkx_edge_labels(G2, pos, edge_labels=labels)
ax.margins(0.1, None)  # add some horizontal space to fit labels
ax.axis('off')

This plot shows the relationship between factors and variables as a kind of network graph. In this case, each of the factors is displayed as a node on the lefthand side and arrows point to each of the variables that load strongly onto the factor (with color and label denoting intensity). The direction of the arrow is important as it implies that the measured variables are partial outcomes from the latent factor.

This graphic can also be done in pure networkx (i.e. without pygraphviz) using a multipartite layout, but the result is not as nice because it doesnt group the manifest variables to avoid line crossings, and the latent variables all have the same spacing, which increases overlap

Code

f, ax = plt.subplots(figsize=(13, 14))

for layer, nodes in enumerate(reversed(tuple(nx.topological_generations(G2)))):
    # `multipartite_layout` expects the layer as a node attribute, so add the
    # numeric layer value as a node attribute
    for node in nodes:
        G2.nodes[node]["layer"] = layer

# Compute the multipartite_layout using the "layer" node attribute
pos = nx.multipartite_layout(
    G2,
    subset_key="layer",
    align="vertical",
)
nx.draw_networkx(
    G2,
    pos=pos,
    ax=ax,
    edge_cmap=plt.cm.Reds,
    edge_color=factors.T.stack().values,
    node_size=500,
    width=3,
    arrowsize=14,
)
nx.draw_networkx_edge_labels(G2, pos, edge_labels=labels)
ax.margins(0.1, None)  # add some horizontal space to fit labels
ax.axis('off')

(-0.027222222222222224,
 0.08277777777777777,
 -1.2100000000000002,
 1.2100000000000002)

16.2.3 Cluster Analysis

Instead of factor analysis, we could alternatively use cluster analysis to try and group segregation indices into categories.

Affinity propagation is a clustering algorithm where k is endogenous so it works well for exploratory analysis when we are not sure about the number of clusters. If we fit a cluster model to the correlation matrix of segregation indices, we’re looking for groups of variables that capture the same dimension. This works for our purposes here because we don’t really care about the factor loadings or measuring the latent construct per se (the segregation measures themselves are preferable for that). Instead, we’re asking whether these indices are providing unique information, and how many unique dimensions should we consider?

Since cluster assignments are discrete, and we think of cluster assignments as ‘best’ when they are unambiguous (i.e. clusters are well-separated and each observation belongs to only one cluster), this is like treating the dimensions as orthogonal in factor analysis… If we think the factors are correlated (oblique rotated), then the clusters wouldn’t be well-separated.

Code

ap = AffinityPropagation().fit_predict(results.corr())
ap = pd.Series(dict(zip(results.columns.values, ap)), name="Index Type")
ap.nunique()

Code

silhouette_score(results.corr(), ap)

0.6031665663470883

Code

ap.sort_values()

SpatialProximity            0
SpatialProxProf             0
DistanceDecayIsolation      0
CorrelationR                0
ConProf                     0
Isolation                   0
AbsoluteClustering          0
RelativeClustering          1
SpatialDissim               1
DensityCorrectedDissim      1
BoundarySpatialDissim       1
PARDissim                   1
Interaction                 2
DistanceDecayInteraction    2
AbsoluteConcentration       2
ModifiedGini                3
ModifiedDissim              3
Entropy                     3
Gini                        3
Dissim                      3
BiasCorrectedDissim         3
Atkinson                    3
MinMax                      3
Delta                       4
RelativeCentralization      4
RelativeConcentration       4
AbsoluteCentralization      4
Name: Index Type, dtype: int64

Code

clust_labels = AgglomerativeClustering(n_clusters=5).fit_predict(results.corr())
clust_labels = pd.Series(dict(zip(results.columns.values, clust_labels)))
clust_labels.sort_values()

SpatialProximity            0
AbsoluteClustering          0
ConProf                     0
CorrelationR                0
Isolation                   0
DistanceDecayIsolation      0
SpatialProxProf             0
AbsoluteConcentration       1
Interaction                 1
DistanceDecayInteraction    1
ModifiedGini                2
ModifiedDissim              2
MinMax                      2
Gini                        2
Entropy                     2
BiasCorrectedDissim         2
Atkinson                    2
Dissim                      2
Delta                       3
RelativeCentralization      3
RelativeConcentration       3
AbsoluteCentralization      3
DensityCorrectedDissim      4
BoundarySpatialDissim       4
PARDissim                   4
RelativeClustering          4
SpatialDissim               4
dtype: int64

Code

silhouette_score(results.corr(), clust_labels)

0.6031665663470883

The interesting thing is how well these results map onto Massey and Denton’s originals, despite the idea that clustering and exposure make more sense collapsed into a single category (that concept seems even more pronounced in these results, since isolation and interaction are basically inverses, but end up in different categories)

Code

clust_labels = AgglomerativeClustering(n_clusters=3).fit_predict(results.corr())
clust_labels = pd.Series(dict(zip(results.columns.values, clust_labels)))
silhouette_score(results.corr(), clust_labels)

0.5582547485763226

Code

clust_labels.sort_values()

AbsoluteCentralization      0
SpatialDissim               0
AbsoluteConcentration       0
RelativeConcentration       0
RelativeClustering          0
BoundarySpatialDissim       0
RelativeCentralization      0
PARDissim                   0
Delta                       0
DensityCorrectedDissim      0
DistanceDecayInteraction    0
Interaction                 0
Isolation                   1
SpatialProximity            1
DistanceDecayIsolation      1
CorrelationR                1
ConProf                     1
AbsoluteClustering          1
SpatialProxProf             1
Gini                        2
Dissim                      2
MinMax                      2
ModifiedDissim              2
ModifiedGini                2
BiasCorrectedDissim         2
Atkinson                    2
Entropy                     2
dtype: int64

Code

klabels = KMeans(n_clusters=5).fit_predict(results.corr())

Code

klabels = pd.Series(dict(zip(results.columns.values, klabels)))
silhouette_score(results.corr(), klabels.values)

0.6031665663470883

The clustering algorithms are all really stable in their assignments. When you tune hierarchical and kmeans using the optimal silhouette score, they agree on the exact assignments in the 5 cluster solution

16.2.4 Multi-Group Measures

Following, we can extend Massey et al’s analysis to multigroup segregation indices

Code

if not os.path.exists(f"../data/multigroup_measures.csv"):
    dfs = []
    for metro in msa_fips:
        try:
            df = gio.get_acs(datasets, msa_fips=metro, level="bg", years=[2021])
            df = df.dropna(subset=["geometry"]).to_crs(df.estimate_utm_crs())
            seg = batch_compute_multigroup(df, groups=multi_groups, distance=2000)
            dfs.append(seg.Statistic.rename(metro))
        except Exception as e:  # PR will fail
            print(e)
            df = None
            pass
    results = pd.concat(dfs, axis=1).T
    results.to_csv(f"../data/multigroup_measures.csv")

Code

results_multi = pd.read_csv("../data/multigroup_measures.csv", index_col=0)
results_multi

	GlobalDistortion	MultiDissim	MultiDivergence	MultiDiversity	MultiGini	MultiInfoTheory	MultiNormExposure	MultiRelativeDiversity	MultiSquaredCoefVar	SimpsonsConcentration	SimpsonsInteraction
10180	3.0591	0.2673	0.0827	0.9546	0.3718	0.0867	0.0894	0.0847	0.0570	0.4510	0.5490
10420	17.1550	0.4556	0.1575	0.7908	0.6000	0.1992	0.2183	0.2032	0.1079	0.5714	0.4286
10500	6.0888	0.5272	0.2333	0.8154	0.6913	0.2861	0.3310	0.3163	0.1514	0.5043	0.4957
10540	0.7949	0.2326	0.0314	0.4983	0.3146	0.0630	0.0384	0.0369	0.0194	0.7323	0.2677
10580	14.5605	0.3502	0.1169	0.9452	0.4758	0.1236	0.1384	0.1170	0.0726	0.5073	0.4927
...	...	...	...	...	...	...	...	...	...	...	...
49420	8.1606	0.4097	0.1384	0.8083	0.5436	0.1713	0.2149	0.2098	0.0891	0.4848	0.5152
49620	10.1244	0.4475	0.1525	0.9006	0.5436	0.1694	0.1945	0.1630	0.0899	0.5057	0.4943
49660	12.2645	0.5026	0.1643	0.6844	0.6598	0.2400	0.2562	0.2309	0.1201	0.6336	0.3664
49700	2.2137	0.2239	0.0788	1.0918	0.3186	0.0721	0.0652	0.0652	0.0552	0.3738	0.6262
49740	8.5083	0.3735	0.1387	0.7294	0.5491	0.1901	0.2086	0.2006	0.0832	0.5565	0.4435

384 rows × 11 columns

Code

sns.clustermap(
    results_multi.corr(),
    cmap="RdBu_r",
    annot=True,
    fmt=".2f",
    annot_kws={"size": 8},
    figsize=(8, 8),
)

Code

famulti = FactorAnalyzer(rotation="oblimin", n_factors=results_multi.shape[1])
famulti.fit(results_multi.fillna(0))

ev, v = famulti.get_eigenvalues()
ev = pd.Series(ev)
ev.iloc[:10].plot(grid=True, style=".-")

/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

Code

apmulti = AffinityPropagation().fit_predict(results_multi.corr())
apmulti = pd.Series(dict(zip(results_multi.columns.values, apmulti)), name="Index Type")
apmulti.nunique()

Code

silhouette_score(results_multi.corr(), apmulti)

0.604091489960529

Code

apmulti.sort_values()

GlobalDistortion          0
MultiDissim               1
MultiDivergence           1
MultiGini                 1
MultiInfoTheory           1
MultiNormExposure         1
MultiRelativeDiversity    1
MultiSquaredCoefVar       1
SimpsonsConcentration     2
MultiDiversity            3
SimpsonsInteraction       3
Name: Index Type, dtype: int64

16.3 Choosing a Segregation Measure

In practice, we have seen that segregation can compute two dozen segregation indices, but many of them capture the same concept. This yields a simple question, which index is most appropriate for studying e.g. racial and ethnic segregation in the city? There is no single answer for that question, but the literature has good suggestions about the best defaults. In particular, Reardon & Firebaugh (2002) examine the mathematical properties of many multigroup measures, arguing that the information theory index (\(H\)) and isolation index (what they term \(P\ast\)) have the most attractive features for capturing evenness and exposure/isolation, respectively.

the spatial exposure/isolation index \(\tilde{P\ast}\) which can be interpreted as a measure of the average composition of individuals’ local spatial environments–and the spatial information theory index \(\tilde{H}\) which can be interpreted as a measure of the variation in the diversity of the local spatial environments of each individual—are the most conceptually and mathematically satisfactory of the proposed spatial indices.

– Reardon & O’Sullivan (2004)

In the segregation package, the singlegroup and multigroup classes for these measures are Entropy and MultiInfoTheory for \(H\), and Isolation/Exposure and MultiNormExposure for \(P\ast\). In the next section we examine spatial segregation indices which provide more explicit avenues for assessing centralization, concentration, and clustering, as well as methods that incorporate space into the entropy and isolation measures in flexible and realistic ways.