9  The Spatial Graph

In formal spatial models, we need a way of defining relationships between observations. In the spatial analysis and econometrics literature, the data structure used to relate observations to one another is canonically called a spatial weights matrix, or \(W\), and is used to specify the hypothesized interaction structure that may exist between spatial units. These relationships between units form a special kind of network graph embedded in geographical space, and using the libpysal.graph module, we can explore these relationships in depth.

Code
%load_ext jupyter_black
%load_ext watermark
%watermark -v -a "author: eli knaap" -d -u -p libpysal
Author: author: eli knaap

Last updated: 2024-01-31

Python implementation: CPython
Python version       : 3.10.8
IPython version      : 8.9.0

libpysal: 4.9.2

Note: the Graph class is only available in versions of libpysal >=4.9 (on versions prior to that, you need to use the W class, which works a little differently)

Code
import contextily as ctx
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import networkx as nx
import osmnx as ox
import pandana as pdna

from geosnap import DataStore
from geosnap import io as gio
from libpysal.graph import Graph

datasets = DataStore()
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

The graph will use the geodataframe index to identify unique observations, so when we have something meaningful (like a fips code) that serves as an “id variable”, then we should set that as the index

Code
dc = gio.get_acs(datasets, state_fips="11", years=2021)
dc = dc.to_crs(dc.estimate_utm_crs())
dc = dc.set_index("geoid")
/Users/knaaptime/Dropbox/projects/geosnap/geosnap/io/util.py:275: UserWarning: Unable to find local adjustment year for 2021. Attempting from online data
  warn(
/Users/knaaptime/Dropbox/projects/geosnap/geosnap/io/constructors.py:215: UserWarning: Currency columns unavailable at this resolution; not adjusting for inflation
  warn(
Code
dc.plot()

The most common way of creating a spatial interaction graph is by “building” one from an existing geodataframe. The libpysal.graph.Graph class has a several methods available to generate these relationships, all of which use the Graph.build_* convention.

9.1 Contiguity Graphs

A contiguity graph assumes observations have a relationship when they share an edge and/or vertex.

9.1.1 Rook

Code
g_rook = Graph.build_contiguity(dc)
Code
g_rook.n
571

the number of observations in our graph should match the number of rows in the dataframe from which it was constructed.

Code
dc.shape[0]
571

One way to interact with the graph structure directly is to use the adjacency attribute which returns the neighbor relationships as a multi-indexed pandas series

Code
g_rook.adjacency
focal         neighbor    
110010001011  110010001021    1
              110010001022    1
              110010001023    1
              110010041003    1
              110010055032    1
                             ..
110019800001  110010108002    1
              110010108003    1
              110010108004    1
              110010108005    1
              110010110021    1
Name: weight, Length: 2948, dtype: int64
Code
g_rook.adjacency.reset_index()
focal neighbor weight
0 110010001011 110010001021 1
1 110010001011 110010001022 1
2 110010001011 110010001023 1
3 110010001011 110010041003 1
4 110010001011 110010055032 1
... ... ... ...
2943 110019800001 110010108002 1
2944 110019800001 110010108003 1
2945 110019800001 110010108004 1
2946 110019800001 110010108005 1
2947 110019800001 110010110021 1

2948 rows × 3 columns

“slicing” into a Graph object returns a pandas series where the index is the neighbor’s id/key and the value is the “spatial weight” that encodes the strengrth of the connection between observations

Code
g_rook["110019800001"]
neighbor
110010001023    1
110010047021    1
110010056022    1
110010056023    1
110010058011    1
110010058022    1
110010058023    1
110010058024    1
110010059002    1
110010065001    1
110010065002    1
110010066001    1
110010073013    1
110010082002    1
110010082004    1
110010083012    1
110010101001    1
110010101003    1
110010102021    1
110010102023    1
110010102025    1
110010105001    1
110010105004    1
110010106031    1
110010107001    1
110010108002    1
110010108003    1
110010108004    1
110010108005    1
110010110021    1
Name: weight, dtype: int64

Note this is shorthand for slicing into the adjacency list itself

Code
g_rook.adjacency.loc["110019800001"]
neighbor
110010001023    1
110010047021    1
110010056022    1
110010056023    1
110010058011    1
110010058022    1
110010058023    1
110010058024    1
110010059002    1
110010065001    1
110010065002    1
110010066001    1
110010073013    1
110010082002    1
110010082004    1
110010083012    1
110010101001    1
110010101003    1
110010102021    1
110010102023    1
110010102025    1
110010105001    1
110010105004    1
110010106031    1
110010107001    1
110010108002    1
110010108003    1
110010108004    1
110010108005    1
110010110021    1
Name: weight, dtype: int64

If you prefer, the neighbors and weights values are also encoded as dictionaries on the Graph, available under the corresponding attributes

Code
g_rook.neighbors["110019800001"]
('110010001023',
 '110010047021',
 '110010056022',
 '110010056023',
 '110010058011',
 '110010058022',
 '110010058023',
 '110010058024',
 '110010059002',
 '110010065001',
 '110010065002',
 '110010066001',
 '110010073013',
 '110010082002',
 '110010082004',
 '110010083012',
 '110010101001',
 '110010101003',
 '110010102021',
 '110010102023',
 '110010102025',
 '110010105001',
 '110010105004',
 '110010106031',
 '110010107001',
 '110010108002',
 '110010108003',
 '110010108004',
 '110010108005',
 '110010110021')
Code
g_rook.weights["110019800001"]
(1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1)

The cardinalities attribute stores the number of connections/relationships for each observation. In graph terminology, that would be the number of edges for each node, and in simple terms, this is the number of neighbors for each unit of analysis

Code
g_rook.cardinalities
focal
110010001011     5
110010001021     8
110010001022     5
110010001023     8
110010002011     3
                ..
110010110022     5
110010111001     4
110010111002     5
110010111003    14
110019800001    30
Name: cardinalities, Length: 571, dtype: int64
Code
g_rook.cardinalities.describe()
count    571.000000
mean       5.162872
std        1.919363
min        2.000000
25%        4.000000
50%        5.000000
75%        6.000000
max       30.000000
Name: cardinalities, dtype: float64

We can get a sense for what the neighbor distribution looks like by plotting a histogram of the cardinalities

Code
g_rook.cardinalities.hist()

The attribute pct_nonzero stores the share of entries in the \(n \times n\) connectivity matrix that are non-zero. If every observation were connected to every other observation, this attribute would equal 1

Code
g_rook.pct_nonzero
0.9041807625421343

In this graph, less than 1% of the entries in that matrix are non-zero, showing that this is a very sparse connectivity graph indeed. Note, to compute the pct_nonzero measure ourselves, we vould alternatively rely on properties of the graph:

Code
(g_rook.n_edges / g_rook.n_nodes**2) * 100
0.9041807625421342

To see the full \(n \times n\) matrix representation of the graph, the sparse (scipy) representation is available under the sparse attribute

Code
g_rook.sparse
<571x571 sparse array of type '<class 'numpy.float64'>'
    with 2948 stored elements in COOrdinate format>
Code
g_rook.sparse.todense()
array([[0., 1., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

To see the dense version, just convert from sparse to dense using scipy conventions. Both rows and columns of the matrix representation are ordered as found in the geodataframe from which the Graph was constructed (or in the order given, when using a _from_* method). Currently this order is stored under the unique_ids attribute.

Code
g_rook.unique_ids
Index(['110010001011', '110010001021', '110010001022', '110010001023',
       '110010002011', '110010002012', '110010002021', '110010002022',
       '110010002023', '110010002024',
       ...
       '110010109002', '110010110011', '110010110012', '110010110013',
       '110010110021', '110010110022', '110010111001', '110010111002',
       '110010111003', '110019800001'],
      dtype='object', name='focal', length=571)
Code
g_rook.unique_ids.equals(dc.index)
True

Summing across rows of the matrix is another way to count neighbors for each observation

Code
g_rook.sparse.todense().sum(axis=1)[:20].astype(int)
array([5, 8, 5, 8, 3, 6, 6, 6, 4, 6, 5, 3, 9, 6, 6, 7, 8, 3, 6, 7])
Code
g_rook.cardinalities.values[:20]
array([5, 8, 5, 8, 3, 6, 6, 6, 4, 6, 5, 3, 9, 6, 6, 7, 8, 3, 6, 7])

Classic spatial connectivity graphs are typically very sparse. That is, we generally do not consider every observation to have a relationship with every other observation. Rather, we tend to encode relationships such that observations are influenced directly by other observations that are nearby in physical space. But because nearby observations also interact with their neighbors, it is possible for an influence from one observation to influence another observation far away by traversing the graph.

One useful way to understand how the Graph encodes spatial relationships is to use the plot method, which embeds the graph in planar space showing how observations are connected to one another. In the default plot, each observation (or its centroid, if the observations are polygons) is shown as a dot (or node in graph terminology), and a line (or edge) is drawn between nodes that are connected. Since we are using a rook contiguity rule, the edge, here, indicates that the observations share a common border.

Code
ax = g_rook.plot(dc, node_kws=dict(alpha=0.4), edge_kws=dict(alpha=0.4), figsize=(8, 8))
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron, crs=dc.crs)
ax.axis("off")

The explore method creates an interactive webmap version of the same plot, though the arguments are a bit different.

Code
m = dc.explore(tiles="CartoDB Positron", tooltip=["geoid"])
g_rook.explore(
    dc,
    m=m,
)
Make this Notebook Trusted to load map: File -> Trust Notebook
Code
cap = gpd.tools.geocode("the national mall, washington d.c.").to_crs(dc.crs)
Code
cap
geometry address
0 POINT (324817.891 4306654.613) National Gallery of Art, Madison Drive Northwe...

The national gallery of art is not exactly the national mall, but it is close enough

Code
cap.explore(tiles="CartoDB Positron", marker_kwds=dict(radius=20), zoom_start=15)
Make this Notebook Trusted to load map: File -> Trust Notebook
Code
mall = gpd.overlay(cap, dc.reset_index())

Using an overlay operation from geopandas, we can collect the geoid we need to select from the Graph. For each observation in cap, the overlay will return the observations from dc that meet the spatial predicate (the default is intersect). (we reset the index here because we need to get the ‘geoid’ from the dataframe)

Code
mall
address geoid n_total_housing_units n_vacant_housing_units n_occupied_housing_units n_owner_occupied_housing_units n_renter_occupied_housing_units n_housing_units_multiunit_structures_denom n_housing_units_multiunit_structures n_total_housing_units_sample ... p_hispanic_persons p_native_persons p_asian_persons p_hawaiian_persons p_asian_indian_persons p_edu_hs_less p_edu_college_greater p_veterans year geometry
0 National Gallery of Art, Madison Drive Northwe... 110019800001 5.0 0.0 5.0 0.0 5.0 5.0 5.0 5.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 2021 POINT (324817.891 4306654.613)

1 rows × 59 columns

Code
mall_id = mall.geoid.tolist()[0]
Code
mall_id
'110019800001'

Since the graph encodes the notion of a “neighborhood” for each unit, it can be a useful way of selecting (and visualizing) “neighborhoods” around a particular observation. For example here are all the blockgroups that neighbor the National Mall national gallery of art (well, the blockgroup that contains the national gallery of art)

Code
m = dc.loc[[mall_id]].explore(color="red", tooltip=None, tiles="CartoDB Positron")
dc.loc[g_rook[mall_id].index].explore(m=m, tooltip=None)
Make this Notebook Trusted to load map: File -> Trust Notebook

Since blockgroups vary in size, and the national mall is in an oddly shaped blockgroup, its neighbor set includes blockgroups from Georgetown (Northwest), to Capitol Hill (Northeast), to Anacostia (Southeast)…

When plotting the graph, it is also possible to select subsets and/or style nodes and edges differently. Here we can contrast the neighborhood structure for the blockgroup containing the national mall versus another blockgroup near rock creek park. Passing those two indices (in a list) to the focal argument will plot only the subgraph for those two nodes.

Code
m = dc.explore(tiles="CartoDB Positron", tooltip=["geoid"])
g_rook.explore(
    dc,
    m=m,
    focal=["110019800001", "110010020012"],
    edge_kws=dict(color="red"),
    node_kws=dict(style_kwds=dict(radius=4, color="yellow")),
)
Make this Notebook Trusted to load map: File -> Trust Notebook

Since the relationship graph is just a specific kind of network, we can also export directly to NetworkX, e.g. to leverage other graph theoretic measures. For example we could compute the degree of each node in the network and plot its distribution

Code
nxg = g_rook.to_networkx()
Code
import pandas as pd

pd.Series([i[1] for i in nxg.degree()]).hist()

And here it is easy to see that the degree of each node is the number of neighbors (which is the same as the cardinalities attribute of the original libpysal graph)

9.1.2 Queen

Code
g_queen = Graph.build_contiguity(dc, rook=False)

Since the ‘queen’ rule is more liberal and allows connections through the vertices as well as the edges, we would expect that each observation would have more neighbors, shifting our cardinality histogram to the right, and densifying the graph.

Code
g_queen.pct_nonzero
1.1096763903926192
Code
g_rook.pct_nonzero
0.9041807625421343
Code
g_queen.cardinalities.rename("queen").plot(
    kind="hist", legend=True, alpha=0.6, bins=range(32)
)
g_rook.cardinalities.rename("rook").plot(
    kind="hist", legend=True, alpha=0.6, bins=range(32)
)

Or the kernel density continuous estimate

Code
g_queen.cardinalities.rename("queen").plot(kind="density", legend=True)
g_rook.cardinalities.rename("rook").plot(kind="density", legend=True)

Ultimately, though, the structure of the graph has not changed that substantially. If we plot both figures and scrutinize them side-by-side, a handful of differences emerge, but not many.

Code
f, ax = plt.subplots(1, 2, figsize=(12, 9))
ax = ax.flatten()

g_rook.plot(
    dc, ax=ax[0], node_kws=dict(alpha=0.4, s=3), edge_kws=dict(alpha=0.4)
).set_title("rook", fontsize=20)

g_queen.plot(
    dc, ax=ax[1], node_kws=dict(alpha=0.4, s=3), edge_kws=dict(alpha=0.4)
).set_title("queen", fontsize=20)

for ax in ax:
    ax.axis("off")
    ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron, crs=dc.crs)
plt.tight_layout()

The Queen grah subsumes the Rook graph, so if we plot the queen relationships beneath the rook relationships in a different color, then the colors still visible underneath show the difference between queen and rook graphs

Code
ax = dc.plot(alpha=0.4, color="lightgrey", figsize=(8, 8), edgecolor="black")
ax = g_queen.plot(dc, ax=ax, nodes=False, edge_kws=dict(color="blue"), figsize=(8, 8))

g_rook.plot(dc, ax=ax, nodes=False, edge_kws=dict(color="red"))

ax.axis("off")
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron, crs=dc.crs)

Anything visible in blue is a connection through a vertex. But there is also a better way to do this…

9.1.3 Bishop Weights

In theory, a “Bishop” weighting scheme is one that arises when only polygons that share vertexes are considered to be neighboring. But, since Queen contiguigy requires either an edge or a vertex and Rook contiguity requires only shared edges, the following relationship is true:

\[ \mathcal{Q} = \mathcal{R} \cup \mathcal{B} \]

where \(\mathcal{Q}\) is the set of neighbor pairs via queen contiguity, \(\mathcal{R}\) is the set of neighbor pairs via Rook contiguity, and \(\mathcal{B}\) via Bishop contiguity. Thus:

\[ \mathcal{Q} \setminus \mathcal{R} = \mathcal{B}\]

Bishop weights entail all Queen neighbor pairs that are not also Rook neighbors.

PySAL does not have a dedicated bishop weights constructor, but you can construct very easily using the difference method. Other types of set operations between Graph objects are also available. This is the more important point than actually constructing Bishop weights (which encode an odd interaction structure in the social sciences and are essentially never used in practice). Combining Graph objects using set operations allows flexible composability.

Code
bishop = g_queen.difference(g_rook)

These are the connections that are shown in blue above

Code
m = dc.explore(
    tooltip=None,
    style_kwds=dict(color="black", fill=False),
    tiles="CartoDB Positron",
)
bishop.explore(
    dc, m=m, nodes=True, edge_kws=dict(color="purple", style_kwds=dict(weight=5))
)
Make this Notebook Trusted to load map: File -> Trust Notebook

9.2 Distance Graphs

9.2.1 Distance-Band

When building a graph based on distance relationships, we need to make a simple assumption: since the distance between two polygons is undefined (is it the distance between the nearest point of each polygon? the furthest points? the average?), we use the polygon centroid as it’s representation. The centroid is the geometric center of each polygon, so it serves as a reasonable approximation for the “average” distance of the poygon

Note that when building a distance band, the threshold argument is measured in units defined by the coordinate reference system (CRS) of the input dataframe. Here, we are using a UTM CRS, which means our units are measured in meters, so these graphs correspond to 1km and 2km thresholds.

Code
dc.crs
<Derived Projected CRS: EPSG:32618>
Name: WGS 84 / UTM zone 18N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamaica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.
- bounds: (-78.0, 0.0, -72.0, 84.0)
Coordinate Operation:
- name: UTM zone 18N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
Code
g_dist_500 = Graph.build_distance_band(dc.centroid, threshold=500)
g_dist_1000 = Graph.build_distance_band(dc.centroid, threshold=1000)
g_dist_2000 = Graph.build_distance_band(dc.centroid, threshold=2000)
g_dist_3000 = Graph.build_distance_band(dc.centroid, threshold=3000)
Code
f, ax = plt.subplots(2, 2, figsize=(9, 10))
gs = [g_dist_500, g_dist_1000, g_dist_2000, g_dist_3000]
labels = ["500 m", "1 km", "2 km", "3 km"]
ax = ax.flatten()

for i in range(len(ax)):
    gs[i].plot(
        dc, ax=ax[i], node_kws=dict(alpha=0.4, s=2), edge_kws=dict(alpha=0.4)
    ).set_title(labels[i], fontsize=14)

    ax[i].axis("off")
    ctx.add_basemap(ax[i], source=ctx.providers.CartoDB.Positron, crs=dc.crs)
plt.tight_layout()

Naturally, distance-based relationships are sensitive to the density of observations in space. Since the distance threshold is fixed, denser locations will necessarily have more neighbors, whereas sparse locations may have few or no neighbors (e.g. the three observations in southeast DC in the 1km graph). This can be a useful property of distance-band graphs, because it keeps the notion of ‘neighborhood’ fixed across observations, and in many cases it makes sense for density to increase the neighbor set.

Code
bins = range(0, 120)
g_dist_500.cardinalities.rename("500m").hist(alpha=0.6, legend=True, bins=bins)
g_dist_1000.cardinalities.rename("1km").hist(alpha=0.5, legend=True, bins=bins)
g_dist_2000.cardinalities.rename("2km").hist(alpha=0.4, legend=True, bins=bins)

As the distance band increases, the neighbor distribution becomes more dispersed. In graphs with the smallest bandwidths, there are places in the study region with observations that fail to capture any neighbors (within the 500m and 1km thresholds), which creates disconnections in the graph. The different connected regions of the graph are called components, and it can be useful to know about them for a few reasons.

First, if there is more than one component, it implies that the interaction structure in the study region is segmented. Disconnected components in the graph make it impossible, by definition, for a shock (innovation, message, etc) to traverse the graph from a node in one component to a node in another component. This situation has implications for certain applications such as spatial econometric modeling because certain conditions (like higher-order neighbors) may not exist.

If a study region contains multiple disconnected components, this information can also be used to partition or inspect the dataset. For example, if the region has multiple portions (e.g. Los Angeles and the Channel Islands, or the Hawaiian Islands, or Japan), then the component labels can be used to select groups of observations that belong to each component

Code
g_dist_1000.n_components
5

The 1km distance-band graph has 5 components (look back at the map): a large central component, and four lonely observations in the Southern portion along the waterfront, all of which have no neighbors. When an observation has no neighbors, it’s called an isolate.

Code
g_dist_1000.isolates
Index(['110010073011', '110010073012', '110010073013', '110010109002'], dtype='object', name='focal')
Code
g_dist_1000.component_labels
focal
110010001011    0
110010001021    0
110010001022    0
110010001023    0
110010002011    0
               ..
110010110022    0
110010111001    0
110010111002    0
110010111003    0
110019800001    0
Name: component labels, Length: 571, dtype: int32

Since the distance-band is fixed but the density of observations varies across the study area, distance-band weights can result in high variance in the neighbor cardinalities, as described above (e.g. some observations have many neighbors, others have few or none). This can be a nuisance in some applications, but a useful feature in others. Moving back to a graph-theoretic framework, we can look at how certain nodes (census blockgroups) influence others by virtue of their position in the network. That is, some blockgroups are more central or more connected than others, and thus can exert a more influential role in the network as a whole.

Code
nxg_dist = g_dist_1000.to_networkx()
Code
pd.Series(nx.degree_centrality(nxg_dist)).hist()

Three graph-theoretic measures of centrality, which capture the concept of connectedness are closness centralty, betweenness centrality, and degree centrality, all of which (and more) can be computed for each node. By converting the graph to a NetworkX object, we can compute these three measures, then plot the most central node in the network of DC blockgroups according to each concept.

Code
close = (
    pd.Series(nx.closeness_centrality(nxg_dist)).sort_values(ascending=False).index[0]
)

between = (
    pd.Series(nx.betweenness_centrality(nxg_dist)).sort_values(ascending=False).index[0]
)

degree = pd.Series(nx.degree_centrality(nxg_dist)).sort_values(ascending=False).index[0]
Code
focal_kws = dict(marker_kwds=dict(radius=10), style_kwds=dict(fillOpacity=1))
centralities = zip(["red", "yellow", "blue"], [close, between, degree])

m = g_dist_1000.explore(
    dc,
    color="gray",
    tiles="CartoDB Darkmatter",
    edge_kws=dict(style_kwds=dict(opacity=0.4)),
)

for color, measure in centralities:
    m = g_dist_1000.explore(
        dc,
        focal=measure,
        color=color,
        focal_kws=focal_kws,
        m=m,
    )
m
Make this Notebook Trusted to load map: File -> Trust Notebook

The map helps clarify the different concepts of centrality. The nodes with the highest closeness and degree centrality are in the densest parts of the graph near downtown DC. The blue subgraph shows the node with the highest degree centrality (i.e. the node with the greatest number of connections within a 1km radius), which is argubaly the node with the greatest local connectivity and the red subgraph shows the node with the highest closeness centrality, which is the node with the greatest global connectivity. The yellow subgraph shows the node with the highest betweenness centrality. Unlike the others, the yellow graph does not have many connections (neighbors) of its own, but instead, it connects three very dense subgraphs and thus serves as a pivotal bridge between these pieces of space. In this case, the blockgroup is literally a polygon that contains several bridges across the Anacostia river, which helps demonstrate the importance of considering infrastructure and the built environment when traversing space–and how much power you wield if, say, you control a bridge for political gain

9.2.2 K-Nearest Neighbors (KNN)

Unlike the distance-band concept of a neighborhood, which defines neighbors according to a proximity criterion, regardless of how many neighbors meet that definition, the K-nearest neighbors (KNN) concept defines the neighbor set according to a minimum criterion, regardless of the distance necessary to travel to meet it. This is more akin to a relative concept of neighborhood, in that both dense and sparse places will have exactly the same number of neighbors, but the distance necessary to travel to interact with each neighbor can adapt to local conditions.

This concept has a different set of benefits relative to the distance-band approach, for example, there is no longer any variation in the neighbor cardinalities (which can be useful, and can ensure a much more connected graph in cases with observations of varying density. But the KNN relationships can also induce asymmetry, which may be a problem in certain spatial econometric applications.

Code
g_knn5 = Graph.build_knn(dc.centroid, k=5)
g_knn10 = Graph.build_knn(dc.centroid, k=10)
Code
f, ax = plt.subplots(1, 2, figsize=(12, 9))
ax = ax.flatten()

g_knn5.plot(
    dc, ax=ax[0], node_kws=dict(alpha=0.4, s=3), edge_kws=dict(alpha=0.4)
).set_title("k=5", fontsize=18)

g_knn10.plot(
    dc, ax=ax[1], node_kws=dict(alpha=0.4, s=3), edge_kws=dict(alpha=0.4)
).set_title("k=10", fontsize=18)

for ax in ax:
    ax.axis("off")
    ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron, crs=dc.crs)
plt.tight_layout()

Unlike distance relationships and contiguity relationships nearest-neighbor relationships are often asymmetric. That is, the closest observation to me may have several other neighbors closer than I am (i.e., in this scenario, I am not my neighbor’s neighbor)

Code
g_dist_1000.asymmetry()
Series([], Name: neighbor, dtype: object)
Code
g_queen.asymmetry()
Series([], Name: neighbor, dtype: object)
Code
g_knn10.asymmetry().head(10)
focal
110010001011    110010001021
110010001011    110010002022
110010001011    110010002023
110010001011    110010055012
110010001011    110010055023
110010001011    110010056011
110010001021    110010001011
110010001021    110010002012
110010001021    110010005013
110010001021    110010041002
Name: neighbor, dtype: object
Code
g_knn10.asymmetry().groupby("focal").first().shape[0]
558
Code
g_knn10.pct_nonzero
1.7513134851138354

In the DC blockgroup data, the vast majority of observations in a KNN-10 graph contain an asymmetric relationship (that’s 98% of the data). Note, this does not mean that 98% of all relationships are asymmetric, only that 98% of the observations have at least one neighbor which does not consider the reciprocal observation a neighbor.

Code
g_knn10.asymmetry().groupby("focal").first().shape[0] / g_knn10.n
0.9772329246935202
Code
symmetric = g_knn10.unique_ids[
    ~g_knn10.unique_ids.isin(g_knn10.asymmetry().index.values)
]
Code
symmetric
Index(['110010013012', '110010013042', '110010024002', '110010024003',
       '110010030001', '110010043002', '110010044021', '110010052023',
       '110010080012', '110010092011', '110010093021', '110010095081',
       '110010108003'],
      dtype='object', name='focal')
Code
ax = g_knn10.plot(
    dc,
    node_kws=dict(alpha=0.4),
    edge_kws=dict(alpha=0.4),
    focal=symmetric,
    focal_kws=dict(color="blue"),
    figsize=(5, 5),
)
ax.set_title("k=10 symmetric communties")
ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron, crs=dc.crs)
ax.axis("off")

All of the focal nodes plotted here are also neighbors with each of their neighbors (that sounds strange…)

Code
g_knn10.cardinalities.hist()

9.2.3 Kernel Functions

In the general cases of the contiguity, distance-band, and KNN graphs, considered so far the relationships between observations are treated as binary: a relationship either exists or does not, and is encoded as a 1 if it exists, and as a 0 otherwise. But they can also all be extended to allow for the relationship to be weighted continuously by some additional measure. For example, a contiguity graph can be weighted by the lengh of the border shared between observations, which would allow observations to have a “stronger” relationship when they have larger sections of common borders.

In the KNN or distance-band cases, it can alternatively make sense to weight the observations according to distance from the focal observation (hence the longstanding terminology for the spatial graph as a spatial weights matrix). Applying a weighting function (often called a distance decay function) in this way helps encode the first law of geography, allowing nearer observations to have a greater influence than further observations. A distance-base weight with a decay function often takes the form of a [kernel](https://en.wikipedia.org/wiki/Kernel_(statistics) weight, where the kernel refers to the shape of the decay function (the commonly-used distance-band weight is actually a just uniform kernel, but sometimes it makes sense to allow variation within the band).

Code
g_kernel_linear = Graph.build_kernel(dc.centroid, kernel="triangular", bandwidth=1000)
g_kernel_gaussian = Graph.build_kernel(dc.centroid, kernel="gaussian", bandwidth=1000)

As with the distance-band graph, the kernel graph is only defined for point geometries, so we use the centroid of each polygon as our representation of each unit, and the bandwidth is measured in units of the CRS (so defined as 1km here). Providing different arguments to the kernel parameter changes the weighting function applied at each distance.

Code
g_kernel_linear.adjacency.head(5)
focal         neighbor    
110010001011  110010001021    0.153964
              110010001022    0.470890
              110010041001    0.070222
              110010041003    0.387909
              110010055011    0.449958
Name: weight, dtype: float64
Code
g_kernel_gaussian.adjacency.head(5)
focal         neighbor    
110010001011  110010001021    0.188200
              110010001022    0.209865
              110010001023    0.156752
              110010002011    0.091724
              110010002012    0.119690
Name: weight, dtype: float64

The explore method can actually plot the value of the weight for each edge in the graph, although the visualization can be a bit tough to take in. It is easier to visualize one node at a time. Here, the shorter lines will be encoded with lighter colors and longer lines (representing longer distances) are darker as they get assigned lower weights. Note you can re-run the cell below to visualize a different node (or change the focal argument to a particular node instead of a random selection)

Code
g_kernel_linear.explore(
    dc,
    # pick a random node in the graph to visualize
    focal=dc.reset_index().geoid.sample(),
    edge_kws=dict(
        column="weight",
        style_kwds=dict(weight=5),
    ),
    tiles="CartoDB Darkmatter",
)
Make this Notebook Trusted to load map: File -> Trust Notebook

9.3 Transformations

In many empirical applications of the Graph, it is useful to normalize or standardize the weight structure for each unit (i.e. to rescale the weight value for each unit such that the sum equals 1). In the spatial econometrics literature, this operation is canonically called a transform of the spatial weights matrix. The Graph inherits this terminology; thus, it is extremely common to transform the Graph prior to using it.

Code
g_rook.adjacency.head(10)
focal         neighbor    
110010001011  110010001021    1
              110010001022    1
              110010001023    1
              110010041003    1
              110010055032    1
110010001021  110010001011    1
              110010001022    1
              110010002021    1
              110010002022    1
              110010004002    1
Name: weight, dtype: int64

The Graph’s transform method allows each observation to be standardized using different techniques

Code
?g_rook.transform
Signature: g_rook.transform(transformation)
Docstring:
Transformation of weights
Parameters
----------
transformation : str
    Transformation method. The following are
    valid transformations.
    - **B** -- Binary
    - **R** -- Row-standardization (global sum :math:`=n`)
    - **D** -- Double-standardization (global sum :math:`=1`)
    - **V** -- Variance stabilizing
Returns
-------
Graph
    transformed weights
Raises
------
ValueError
    Value error for unsupported transformation
File:      ~/mambaforge/envs/urban_analysis/lib/python3.10/site-packages/libpysal/graph/base.py
Type:      method
Code
g_rook.transform("r").adjacency.head(10)
focal         neighbor    
110010001011  110010001021    0.200
              110010001022    0.200
              110010001023    0.200
              110010041003    0.200
              110010055032    0.200
110010001021  110010001011    0.125
              110010001022    0.125
              110010002021    0.125
              110010002022    0.125
              110010004002    0.125
Name: weight, dtype: float64

Now the weight values for first observation (110010001011) are rescaled to 0.2 for each neighbor (since 110010001011 has 5 neighbors). Note that unlike its the predecessor (the libpysal.weights.Weights class), the Graph is immutable, which means to store the transformation permanently, it is necessary to save it back into a variable!

Non-constant weights can also be transformed

Code
g_kernel_linear.adjacency
focal         neighbor    
110010001011  110010001021    0.153964
              110010001022    0.470890
              110010041001    0.070222
              110010041003    0.387909
              110010055011    0.449958
                                ...   
110010111002  110010094003    0.246456
              110010111001    0.189410
110010111003  110010089031    0.013635
110019800001  110010102025    0.276095
              110010102026    0.031204
Name: weight, Length: 8732, dtype: float64
Code
g_kernel_linear.transform("r").adjacency
focal         neighbor    
110010001011  110010001021    0.032531
              110010001022    0.099494
              110010041001    0.014837
              110010041003    0.081961
              110010055011    0.095071
                                ...   
110010111002  110010094003    0.297544
              110010111001    0.228674
110010111003  110010089031    1.000000
110019800001  110010102025    0.898457
              110010102026    0.101543
Name: weight, Length: 8732, dtype: float64

Using a row-transformation, the weight values for each observation sum to 1

Code
g_kernel_linear.transform("r").adjacency.groupby(level=0).sum()
focal
110010001011    1.0
110010001021    1.0
110010001022    1.0
110010001023    1.0
110010002011    1.0
               ... 
110010110022    1.0
110010111001    1.0
110010111002    1.0
110010111003    1.0
110019800001    1.0
Name: weight, Length: 571, dtype: float64

If we doubly standardize, the weight values for all observations combined sums to 1

Code
g_kernel_linear.transform("d").adjacency.groupby(level=0).sum().sum()
1.0

9.4 Advanced Graphs

9.4.1 Transport Network-Based Graphs

The Graph class is designed to be flexible, and it can be constructed from a variety of formats. For example, you could create a full pairwise distance matrix from a set of input points and build a graph from that matrix. Here, we will create a new graph based on walk network distances from OpenStreetMap using the pandana library

This function will download a “walk” network from openstreetmap using osmnx and return a node and edge table used to construct a pandana.Network

Code
def pdna_network_from_gdf(gdf, network_type="walk"):
    graph = ox.graph_from_polygon(gdf.unary_union, network_type=network_type)
    n, e = ox.utils_graph.graph_to_gdfs(graph)
    e = e.reset_index()

    return pdna.Network(
        edge_from=e["u"],
        edge_to=e["v"],
        edge_weights=e[["length"]],
        node_x=n["x"],
        node_y=n["y"],
    )
Code
dc_wgs = dc.to_crs(4326)
Code
dc_walk_net = pdna_network_from_gdf(dc_wgs)
Generating contraction hierarchies with 10 threads.
Setting CH node vector of size 67676
Setting CH edge vector of size 212726
Range graph removed 215286 edges of 425452
. 10% . 20% . 30% . 40% . 50% . 60% . 70% . 80% . 90% . 100%

Then, we get the nearest node in the OSM network (i.e. the nearest street intersection) and attach it to the nearest centroid of each observation

Code
nodes = dc_walk_net.get_node_ids(dc_wgs.geometry.centroid.x, dc_wgs.geometry.centroid.y)
/var/folders/79/cknfb1sx2pv16rztkpg6wzlw0000gn/T/ipykernel_16959/2680508403.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  nodes = dc_walk_net.get_node_ids(dc_wgs.geometry.centroid.x, dc_wgs.geometry.centroid.y)
Code
dc["node_ids"] = nodes
Code
adj = dc_walk_net.nodes_in_range(nodes, 2000)
Code
sparse_graph = (
    adj.set_index(["source", "destination"])["length"]
    .astype("Sparse[float]")
    .reindex(nodes, level=0)
    .reindex(nodes, level=1)
    .sparse.to_coo(sort_labels=True)[0]
)

When using metric="precomputed", the first argument of the builder function is a sparse matrix that encodes relationships between each pair of observations (here, the shortest walking path between each of the observation centroids–provided the path distance is less than 2km)

Code
g_network = Graph.build_kernel(sparse_graph, metric="precomputed", bandwidth=2000)
Code
f, ax = plt.subplots(1, 2, figsize=(12, 9))
ax = ax.flatten()

g_network.plot(
    dc.reset_index(), ax=ax[0], node_kws=dict(alpha=0.4, s=2), edge_kws=dict(alpha=0.2)
).set_title("Network-based 2km", fontsize=16)

g_dist_2000.plot(
    dc, ax=ax[1], node_kws=dict(alpha=0.4, s=2), edge_kws=dict(alpha=0.2)
).set_title("Euclidean-based 2km", fontsize=16)

for ax in ax:
    ax.axis("off")
    ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron, crs=dc.crs)
plt.tight_layout()

Including the constraint that distance must be measured along the walking network between observations (rather than Euclidean distance) results in some notable distinctions between the two resulting Graphs. In particular, the observations in Northwest DC are no longer connected through rock creek park, and the blockgroups in Southeast are no longer connected across the Anacostia river. This is a more realistic depiction of the potential interaction structure based on infrastructure throughput and helps show how the built environment can be used to develop alternative concepts of spatial connectivity Graphs

Code
g_dist_2000.pct_nonzero
9.264479007241421
Code
g_network.pct_nonzero
6.473112277290279

The graph based on network-constrained 2km distance is more sparse than the graph based on Euclidean distances, showing that transport networks (which are not straight lines) make travel distances further than they sometimes appear. Thus, using them to construct the graph distances results in fewer connections because most observations are further apartwhen measured along the walk network.

9.4.2 Flow-based Graphs

Instead of spatial proximity, the relationship between observations could be represented by other types of flow data, such as trade, transactions, or observed commutes. For example, the Census LODES data contain information on flows between residential locations and job locations at the block level. This uses LODES V7 which is based on 2010 census block definitions

Code
dc_flows = pd.read_csv(
    "https://lehd.ces.census.gov/data/lodes/LODES7/dc/od/dc_od_main_JT00_2019.csv.gz"
)

The data are formatted as an adjacency list, with the first three columns containing all the necessary information:

  • “w_geocode”: workplace block fips code
  • “h_geocode”: home block fips code
  • “S000”: the total number of jobs (i.e. flows) between home and work
Code
gio.store_blocks_2010()
Loading manifest: 100%|██████████| 56/56 [00:00<00:00, 25.2k/s]
Successfully installed package 'census/blocks_2010', tophash=e703fea from s3://spatial-ucr
Code
dc_blks = datasets.blocks_2010(states="11")

Once again, we pivot this adjacency list into a properly-ordered sparse matrix

Code
dc_flows[["w_geocode", "h_geocode"]] = dc_flows[["w_geocode", "h_geocode"]].astype(str)
dc_flows["S000"] = dc_flows["S000"].astype(float)
Code
dc_blks = dc_blks[dc_blks.geoid.isin(dc_flows.w_geocode.unique())]
ids = dc_blks.geoid.values
Code
dc_flows_sparse = (
    dc_flows.set_index(["w_geocode", "h_geocode"])["S000"]
    .astype("Sparse[float]")
    .reindex(ids, level=0)
    .reindex(ids, level=1)
    .sparse.to_coo(sort_labels=True)[0]
)
Code
g_flow = Graph.from_sparse(dc_flows_sparse, ids=ids)
Code
g_flow.cardinalities.hist()

Code
g_flow.pct_nonzero
1.28891109652
Code
g_flow.adjacency.reset_index().describe()
weight
count 128950.000000
mean 1.268313
std 0.960348
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 144.000000
Code
dc_blks.plot()

Code
g_flow.adjacency
focal            neighbor       
110010011004003  110010106001001    1.0
110010014023001  110010050021002    1.0
                 110010021011002    1.0
110010012001008  110010027021000    1.0
110010012004020  110010095051004    1.0
                                   ... 
110010002023010  110010025023004    1.0
                 110010058002004    1.0
                 110010007013001    1.0
                 110010043002005    1.0
110010097001012  110010044001017    1.0
Name: weight, Length: 128950, dtype: float64
Code
g_flow.n
3163
Code
dc_blks.shape[0]
3163
Code
ax = g_flow.plot(
    dc_blks.set_index("geoid").to_crs(3857),
    nodes=False,
    edge_kws=dict(alpha=0.004, linestyle="dotted"),
    figsize=(9, 7),
)
ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron)
ax.axis("off")

Here, the plot shows flows between origins and destinations. The strong central tendency shows how commutes tend to flow toward the center of the city

9.4.3 Coincident Nodes

Classic spatial graphs are built from polygon data, which are typically exhaustive and planar-enforced. The Census block and blockgroup data we have seen until now are typical in this way; the relationships expressed by polygon contiguity (or distance) are unambiguous and straightforward to model using a simple set of rules, but these spatial representations are often over-simplified or aggregate summaries of real phenomena. Blockgroups, for example, are zones created from aggregated household data, and the relationships can become more complex when working with disaggregate microdata recorded at the household level.

Continuing with the household example, in urban areas, it is common for many observations to be located at the same “place”, for example an apartment building with several units located at the same physical address (in some cases additional information such as floor level could be available, allowing 3D distance computations, but this is an unusually detailed case that also requires the actual floor layout of each building to be realistic…). In these cases, (i.e. with several coincident observations in space) certain graph-building rules like triangulation or k-nearest neighbors can become invalid when the relatioships between points are undefined, yielding, e.g. 10 nearest neighbors at exactly the same location when the user has requested a graph with k=5. How do we choose which 5 among the coincident 10 should be neighbors?

…We dont. Instead, we offer two solutions that make the rule valid:

  • jitter introduces a small amount of random noise into each location, dispersing a set of coincident points into a small point cloud. This generates a valid topology that can be operated upon using the original construction rule. The tradeoff, here, is that the neighbor-set is stochastic and any two versions of a jittered Graph will likely have very different neighbors.

  • clique expands the graph internally, as necessary, so that the construction rule applies to each unique location rather than each unique observation. This effectively makes every coincident observation a neighbor with all other coincident observations, then proceeds with the normal construction rule. In other words, we operate on the subset of points for which the original construction rule is valid, then retroactively break the rule to reinsert the omitted observations back into their neighborhoods. In many cases, this can result in a “conceptually accurate” representation of the nighborhood (e.g. where units in the same apartment building are all considered neighbors), however the construction ‘rule’ will ultimately end up broken, which removes some of the desirable properties of certain graphs. For example, a KNN graph with 10 neighbors will now have variation, as some locations will have greater than 10 neighbors.

:::