[[23]{.chapter-number}  [Geodemographic Clustering ]{.chapter-title}]{#sec-geodemographics .quarto-section-identifier}

23 Geodemographic Clustering

Code

import contextily as ctx
import geopandas as gpd
import matplotlib.pyplot as plt
from geosnap import DataStore
from geosnap import analyze as gaz
from geosnap import io as gio
from geosnap import visualize as gvz

%load_ext watermark
%watermark -a  'eli knaap' -iv -d -u

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Author: eli knaap

Last updated: 2025-11-25

matplotlib: 3.10.8
geopandas : 1.1.1
geosnap   : 0.15.3
contextily: 1.6.2

Geodemographic analysis, which includes the application of unsupervised learning algorithms to demographic and socioeconomic data, is a widely-used technique that falls under the broad umbrella of “spatial data science”. Technically there is no formal spatial analysis in traditional geodemographics, however given its emphasis on geographic units of analysis (and subsequent mapping of the results) it is often viewed as a first (if not requisite step) in exploratory analyses of a particular study area.

The intellectual roots of geodemographics extend from analytical sociology and classic studies from Factorial Ecology and Social Area Analysis. Today, demogemographic analysis is routinely applied in academic studies of neighborhood segregation and neighborhood change, and used extremely frequently in industry, particularly marketing where products like tapestry and mosaic are sold for their predictive power. Whereas social scientists often look at the resulting map of neighborhood types and ask how these patterns came to be, practitioners often look at the map and ask how they can use the patterns to inform better strategic decisions.

In urban social science, our goal is often to undertand the social composition of neighborhoods in a region, understand whether they have changed over time (and where) and whether these neighborhood types are consistent over time and across places. That requires a common pipeline of collecting the same variable sets, standardizing them (often within the same time period so they can be pooled with other time periods) then clustering the entire long-form dataset followed by further analysis and visualization of the results. Most often, this process happens repeatedly using diffferent combinations of variables or different algorithms or cluster outputs (and in different places at different times). Geosnap provides a set of tools to simplify this pipeline

Code

datasets = DataStore()

atl = gio.get_acs(datasets, msa_fips="12060", years=2021, level="tract")

atl[["median_household_income", "geometry"]].explore(
    "median_household_income",
    scheme="quantiles",
    k=8,
    cmap="magma",
    tiles="CartoDB Positron",
)

columns = [
    "median_household_income",
    "median_home_value",
    "p_edu_college_greater",
    "p_edu_hs_less",
    "p_nonhisp_white_persons",
    "p_nonhisp_black_persons",
    "p_hispanic_persons",
    "p_asian_persons",
]

/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/geosnap/io/util.py:273: UserWarning: Unable to find local adjustment year for 2021. Attempting from online data
  warn(
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/geosnap/io/constructors.py:218: UserWarning: Currency columns unavailable at this resolution; not adjusting for inflation
  warn(

23.1 Cross-Sectional Clustering

To create a simple geodemographic typology, use the cluster function and pass a geodataframe, a set of columns to include in the cluster analysis, the algorithm to use and the number of clusters to fit (though some algorithms require different arguments and/or discover the number of clusters endogenously. By default, this will z-standardize all the input columns, drop observations with missing values for input columns, realign the geometries for the input data, and return a geodataframe with the cluster labels as a new column (named after the clustering algorithm)

Code

atl_kmeans, atl_k_model = gaz.cluster(
    gdf=atl, method="kmeans", n_clusters=5, columns=columns, return_model=True
)

atl_kmeans[columns + ["geometry", "kmeans"]].explore(
    "kmeans", categorical=True, cmap="Accent", tiles="CartoDB Positron"
)

Make this Notebook Trusted to load map: File -> Trust Notebook

Code

ax = atl_kmeans[columns + ["geometry", "kmeans"]].plot(
    "kmeans", categorical=True, cmap="Accent", figsize=(6, 6), alpha=0.6
)
ax.axis("off")
ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron, crs=atl.crs)
plt.show()

K-means Geodemographic Typology in the Atlanta Metro

There are two obvious patterns in this map. There’s an urban-rural distinction, where type 1 dominates the periphery of the metro, and is basically absent from the core. There is also a north-south distinction, where type 0 is exclusively located in the southern portion and type 3 is exclusively in the north. Types 2 and 4 seem to have their own distinctive pockets. Even at this surface level, the map tells a pretty powerful story about multivariate segregation in the region. The kmeans clustering algorithm does not consider space at all, yet all of the types form distinctive spatial clusters that are apparent in the map. We can also conduct formal tests for whether these patterns diverge from spatial randomness using join count statistics either individually for each cluster or jointly for all clusters (though it is pretty obvious here that a co-location statistic for these five variables would be wildly significant)

To understand what the clusters are identifying, another data visualisation technique is useful. Here, we rely on the Tufte-ian principle of “small multiples”, and create a set of violin plots that show how each variable is distributed across each of the clusters. Each of the inset boxes shows a different variable, with cluster assignments on the x-axis, and the variable itself on the y-axis The boxplot in the center shows the median and IQR, and the “body” of the “violin” is a kernel-density estimate reflected across the x-axis. In short, the fat parts of the violin show where the bulk of the observations are located, and the skinny “necks” show the long tails.

Code

gvz.plot_violins_by_cluster(atl_kmeans, columns, cluster_col="kmeans")
plt.show()

Violin Plots for Atlanta’s K-means Clusters

Note that in this case we are visualizing the same attributes used to create the cluster model because this is what differentiates the clusters from one another, but the columns argument accepts a list of any variables present in the dataframe. This can be useful for examining the distribution of another variable across the clusters. For example if the clusters are based purely on sociodemographic data and an analyst wants to treat cluster assignments as “exogenous” and examine how another resource like affordable housing or exposure to pollution is spread across the different neighborhood types

We can also use a statistic to tell us how well this model fits the data. To do so, we can use scikit-learn’s silhouette score

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each >sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest >cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= >n_samples - 1.

This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

The geosnap ModelResults class that is returned with the geodataframe includes a set of methods for quickly computing and visualizing these fit statistics (and their distribution in space)

Code

atl_k_model.plot_silhouette()
plt.show()

Code

atl_k_model.plot_silhouette_map(figsize=(6, 6))

/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/geosnap/visualize/mapping.py:171: UserWarning: `proplot` is not installed.  Falling back to matplotlib
  warn("`proplot` is not installed.  Falling back to matplotlib")

Map of Silhouette Scores in Greater Atlanta

If there is spatial patterning in the silhouette scores for each observation, then it suggests there are particular places where the cluster model does not fit well. That could mean that a different model may work better, e.g. by increasing (or decreasing) the number of clusters, or by using a different clustering algorithm. There are many algorithms to choose from, and they differ in their ability to capture different kinds of structure in the set of attributes. For more details, see the scikit-learn clustering page, including this excellent graphic:

Alternative Cluster Types (from the scikit-learn documentation)

Code

cluster_types = [
    "kmeans",
    "ward",
    "affinity_propagation",
    "gaussian_mixture",
    "spectral",
]

To get a feel for how these different algorithms generate alternative results for geodemographic typologies, we can use the same input data for each of the clustering techniques and map the results.

Code

f, ax = plt.subplots(2, 3, figsize=(15, 10))
ax = ax.flatten()
clusters = {}

for i, k in enumerate(cluster_types):
    df, mod = gaz.cluster(
        gdf=atl, method=k, columns=columns, n_clusters=6, return_model=True
    )
    clusters[k] = (df, mod)
    df.to_crs(3857).sort_values(k).plot(
        k, categorical=True, figsize=(7, 7), ax=ax[i], alpha=0.6
    )
    ax[i].axis("off")
    ax[i].set_title(k + f" ({mod.silhouette_score.round(3)})", fontsize=14)
    ctx.add_basemap(ax=ax[i], source=ctx.providers.CartoDB.Positron)
plt.tight_layout()

/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/geosnap/analyze/_cluster_wrappers.py:272: UserWarning: Note: Gaussian Mixture Clustering is probabilistic--cluster labels may be different for different runs. If you need consistency, you should set the `random_state` parameter
  warn(

Geodemographic Typologies using Alternative Clustering Methods

Note that the colors do not mean anything when comparing across maps, only the patterns matter. Thus, even though the kmeans, ward, and affinity propagation maps use different colors, the algorithms are largely capturing similar patterns, as the tracts tend to end up in similar groups across the maps. According to the silhouette score, the kmeans solution fits the best (.284). Needless to say, these clusting algorithms pick up on different spatial patterns, but the most appropriate algorithm is a choice determined by the research question (or an analyst’s preferred fit metric)

To compare solutions directly, you can use holoviews to create a linked plot that will pan together (this requires hvplot and geoviews, which are not dependencies)

Code

import geoviews as gv
import hvplot.pandas
import matplotlib.pyplot as plt
import seaborn as sns

gv.extension("matplotlib", "bokeh")
gv.output(backend="bokeh")

Using a linked interactive plot we can explor similarity in the kmeans and affinity propagation solutions, which have the highes silhouette scores. This is useful to see whether they discover the same spatial patterns, but in this case also because affinity propagation is an algorithm where k is endogenous (so the affinity propagation solution here only includes 4 clusters vs the k-means’ 6).

Code

clusters["kmeans"][0].dropna(subset=["kmeans"]).hvplot(
    c="kmeans",
    cmap="tab10",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    frame_height=450,
    colorbar=False,
    title='K-Means'
) + clusters["affinity_propagation"][0].dropna(subset=["affinity_propagation"]).hvplot(
    c="affinity_propagation",
    cmap="tab10",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    frame_height=450,
    colorbar=False,
    title='Affinity Propagation'
)

(note, if you click the “scroll-zoom” button, you can zoom and pan on both maps simultaneously for a better view). The maps show that the algorithms are picking up similar but not identical patterns. Since they have a different number of clusters, this also raises the question of how to determine the appopriate \(k\) parameter for the number of clusters. One method is to use a performance metric, like a silhouette score to help select \(k\) based on some fit criterion (in the case of geodemographics, we always need to use unsupervised metrics, because there is no such things as ‘ground truth’).

To help automate that process, geosnap provides functions that brute force cluster algorithms for a range of n_clusters to help visualize which model returns the best score. We can do this for the kmeans algorithm to look at fits between 2 and 10 clusters like so:

Code

ks_best, ks_table = gaz.find_k(
    gdf=atl, method="kmeans", columns=columns, min_k=2, max_k=10, return_table=True
)
ks_best

	best_k
silhouette_score	2
calinski_harabasz_score	2
davies_bouldin_score	6

The units here do not matter, so the values on the y-axis can be safely ignored; what matters for these metrics are the min/max values which describe the best or worst fitting cluster models (depending the metric) for these data. For the silhouette score and the Calinski-Harabasz scores, the highest score fits best, whereas for the Davies Bouldin, the lowest score has the best fit.

According to the silhouette and calinski harabasz scores, 2 clusters provide the best fit for this collection of variables, whereas the davies bouldin score suggests 6 clusters. But these metrics dont tell the full story, so it can be useful to graph the score for each \(k\)

Code

ks_table.columns

Index(['silhouette_score', 'calinski_harabasz_score', 'davies_bouldin_score'], dtype='object')

Code

f, ax = plt.subplots(1, 3, figsize=(15, 5))
ax = ax.flatten()

for i, metric in enumerate(ks_table.columns.values):
    ks_table[metric].plot(ax=ax[i])
    ax[i].set_title(metric)
f

Plot of Fit Metrics for Increasing Values of k — Plot of Fit Metrics for Increasing Values of `k`

The graphs make clear that a silhouette score of 5 is nearly as high as the maximum silhouette score that’s achieved at 2. Since the goal is to understand what differentiates places, there are natural reasons to choose the 5 cluster solution over the 2 cluster solution. The davies bouldin graph also suggests that 5 would not be a bad choice, so using a collection of these measures to help decide a preferred solution is generally a good idea (so geosnap tries to make that easy by providing all of these at once).