Visualizing Flows with `libpysal` and `lonboard`

the Graph goes great with the GPU

Published

July 25, 2024

import geodatasets
import geopandas as gpd
import mapclassify
import pandas as pd
from mapclassify.util import get_color_array

%load_ext watermark
%watermark -a 'eli knaap' -iv

Author: eli knaap

geopandas  : 0.14.2
geodatasets: 2024.7.0
mapclassify: 2.7.1.dev0+gaf62513092fd.d20240723
pandas     : 2.1.4

TL DR

the upcoming release of the PySAL metapackage is the first to include the new Graph class, our next-gen (ugh) data structure for encoding spatial relationships
plotting a Graph object is straightforward, and the interactive version works nicely for reasonably-sized graphs.
- but large graphs, e.g. a Graph representing commute flows are stressful for the browser, especially considering that a good O-D travel visualization needs to be (at least) regional in scale.
lonboard works great for large graphs (though I haven’t quite perfected it). To show the whole graph, remove your pointer from the map window entirely. Sometimes it is easier to start zoomed into a node and carefully zoom out
- See an example of a simple rook graph here
- See a much more complex example of daily commute flows in the Washington DC region here
- (yes, I always use DC as my first example. It’s the place I lived the longest; easist to ground-truth because i have a reasonable handle on the economic and political geography)
- this isnt anywhere close to production-ready, but given the flexibility of Graph, it’s still very fast to go from zero to viz:

Plotting the `Graph`

The new libpysal Graph class has a builtin method for interactively visualizing spatial interaction graphs. This works great, even for relatively large graphs–as long as the contiguity structure is sparse.

from libpysal.graph import Graph

df = gpd.read_file(geodatasets.get_path("geoda cincinnati")).to_crs(4326)

g = Graph.build_contiguity(df)
g.explore(df, tiles='CartoDB Positron')

Make this Notebook Trusted to load map: File -> Trust Notebook

When each observation has many connections, the visualization gets messy really quickly (see, for example, the three-kilometer graph here). In these cases, it can be useful to leverage another dimension (like arcing the edges connecting observations in 3D), or by using another interactivity method, like mousing over an observation to reveal its neighbors (as opposed to revealing the entire graph all at once). Both of those things are tough or impossible to do with traditional web mapping tools (thank you leaflet!(!) …you’re just getting old, like the rest of us 😕).

So what if we wanted to visualize a much larger and more complicated Graph (without crashing the browser)? For example a Graph that visualizes transport flows, but at a regional scale?

Using DeckGL as a backend can solve some of the issues above (if you’re an R person, check out rdeck or Kyle Walker’s webgl package). In the past I have done that with PyDeck, deck.gl’s own python API. But the PyDeck interface to DeckGL is not actually all that scalable (when working with geodataframes anyway) because it resorts to storing data in GeoJSON which is wickedly inefficient. Lately I’ve been really impressed with lonboard, which uses DeckGL as a backend but passes all of its data via arrow, which results in a nice-to-use API and lovely scalable maps rendered by the GPU (sans GeoJSON bloat). Exploring their docs yesterday, the county-level migration example struck me as extremely similar to my example with LODES–and thus should be pretty easy to replicate with a Graph.

So I decided to kick the tires with lonboard a bit yesterday, and this was pleasantly easier than I anticipated. The Graph class works great for this and props to Kyle Barron for lonboard. Very impressive.

Proof of Concept

adapting from lonboard’s county migration example, it should be straightforward to use the arclayer as a 3D vizualization for the Graph

First create a Graph indexed by geometries so we can visualize point to point

g = Graph.build_contiguity(df.assign(idx=df.representative_point()).set_index("idx"))

Here’s a neat trick with a Graph… it will use the geodataframe index to identify observations. So if we set the geometry column as the index, we get a point-to-point adjacency list.

Two aligned arrays of XY coords, origin and destination:

origins = gpd.GeoSeries(g.adjacency.reset_index().focal).get_coordinates().values

destinations = (
    gpd.GeoSeries(g.adjacency.reset_index().neighbor).get_coordinates().values
)

destinations

array([[-84.48188467,  39.12086   ],
       [-84.47872758,  39.12541   ],
       [-84.47710758,  39.12521   ],
       ...,
       [-84.49431244,  39.14766   ],
       [-84.49648467,  39.14526   ],
       [-84.49791194,  39.14791   ]])

from lonboard import Map, PolygonLayer, ScatterplotLayer, SolidPolygonLayer
from lonboard.experimental import ArcLayer

from lonboard.layer_extension import BrushingExtension

brushing_extension = BrushingExtension()
brushing_radius = 30

import pyarrow as pa

table = pa.table({"value": g.adjacency.values})

Then we create an arc layer (which basically just creates edges between our two sets of aligned nodes), and add the brushing extension so only the subgraph for the observation being moused-over shows up

arc_layer = ArcLayer(
    table=table,
    get_source_position=origins,
    get_target_position=destinations,
    get_width=4,
    opacity=0.4,
    pickable=False,
    extensions=[brushing_extension],
    brushing_radius=brushing_radius,
)

Turn source and target points into their own layers (this is following the example from lonboard, we might just be able to pass the original set of centroids once. The scatterplot layers dont care about order or linking, they just draw the OD points)

source_gdf = gpd.GeoDataFrame(geometry=g.adjacency.reset_index()["focal"], crs=4326)
target_gdf = gpd.GeoDataFrame(geometry=g.adjacency.reset_index()["neighbor"], crs=4326)

tgt = ScatterplotLayer.from_geopandas(
    target_gdf,
    radius_scale=30,
    pickable=True,
    stroked=False,
    filled=True,
    line_width_min_pixels=2,
    extensions=[brushing_extension],
    brushing_radius=brushing_radius,
)

src = ScatterplotLayer.from_geopandas(
    source_gdf,
    radius_scale=15,
    pickable=False,
    stroked=False,
    filled=True,
    line_width_min_pixels=2,
    extensions=[brushing_extension],
    brushing_radius=brushing_radius,
)

Also show the original boundaries for context

bounds = PolygonLayer.from_geopandas(
    df,
    get_fill_color=[255, 255, 255, 200],
    stroked=True,
    line_width_min_pixels=0.5,
    pickable=False,
)

And make the representative points visible on the map even when the graph is not shown. This makes it easier to find and mouseover the nodes

ogb = ScatterplotLayer.from_geopandas(
    source_gdf,
    get_fill_color=[255, 255, 255, 200],
    stroked=True,
    line_width_min_pixels=0.5,
    pickable=False,
)

gmap = Map(layers=[bounds, src, tgt, arc_layer,ogb], picking_radius=500)

gmap.set_view_state(**dict(pitch=55))

gmap.to_html('rook_map.html')

# the map has trouble rendering inside a jupyter widget when published via quarto
#gmap.as_html()

That works!

See the full-sized map here note: take you cursor off the map entirely to show all the links

(option + click to change the orientation).

You have to mouseover the centroids not the polygons, so it can be hard to find the node you want. Also not sure if we need layers for both source and target if its ultimately the same point set. Brushing scale factor is hard to set blindly

Flow Graph

That works, so lets do something more intense

from geosnap import DataStore
from geosnap import io as gio

datasets = DataStore()

defs = datasets.msa_definitions()

First lets get tract boundaries and LODES data for the DC region. That’s a pain because LODES is provided in CSV format by state and the DC region spans four states. It’s provided at the block level too, which is very high resolution, so this is a lot of data to read in once then cut down

stcos = defs[defs["CBSA Title"].str.contains("DC")].stcofips.astype(str)

states = stcos.str[:2].unique()

states

array(['24', '11', '51', '54'], dtype=object)

# need the main and aux tables. Main includes only intra-state flows.
# aux includes flows into the state from other states.
ods = []
for state in ["md", "dc", "va", "wv"]:
    ods.append(
        pd.read_csv(
            f"https://lehd.ces.census.gov/data/lodes/LODES7/{state}/od/{state}_od_main_JT00_2019.csv.gz"
        )
    )
    ods.append(
        pd.read_csv(
            f"https://lehd.ces.census.gov/data/lodes/LODES7/{state}/od/{state}_od_aux_JT00_2019.csv.gz"
        )
    )

dfs = pd.concat(ods)

dfs[["w_geocode", "h_geocode"]] = dfs[["w_geocode", "h_geocode"]].astype(str)
dfs["S000"] = dfs["S000"].astype(float)

trim this down to tracts for ease of presentation. I think lonboard can handle the size of the block-level OD matrix, but there are so many arcs the map becomes indecipherable

dfs["w_tract"] = dfs["w_geocode"].str[:11]
dfs["h_tract"] = dfs["h_geocode"].str[:11]

# S000 is now flows at the tract level
dfs = dfs.groupby(["w_tract", "h_tract"])["S000"].sum().reset_index()

Now we keep only flows that begin and end inside the DC region (alternatively we could add a single node that represents “external flows” or something), then pivot that into a sparse matrix and instantiate a Graph (indexed by the tract IDs)

# only keep flows inside the region
dfs = dfs[
    (dfs.w_tract.astype(str).str[:5].isin(stcos))
    & (dfs.h_tract.astype(str).str[:5].isin(stcos))
]

dfs = dfs.rename(columns={"w_tract": "w_geocode", "h_tract": "h_geocode"})

tracts = gio.get_acs(datasets, years=2019, level="tract", msa_fips="47900")

ids = tracts.geoid.values

dc_flows_sparse = (
    dfs.set_index(["w_geocode", "h_geocode"])["S000"]
    .astype("Sparse[float]")
    .reindex(ids, level=0)
    .reindex(ids, level=1)
    .sparse.to_coo(sort_labels=True)[0]
)

flow_g = Graph.from_sparse(dc_flows_sparse, ids=ids)

flow_g

<Graph of 1361 nodes and 498263 nonzero edges indexed by
 ['11001000100', '11001000201', '11001000202', '11001000300', '1100100040...]>

flow g represents the daily commutes between blocks. The weight holds the count of commuters. Now we need to replace the focal and neighbor IDs with their geometries

adj = flow_g.adjacency.reset_index()

Now we swap the tract geoids for their geometries and recreate the Graph object (getting us back to the point-to-point matrix representation)

adj = adj.merge(
    tracts.set_index("geoid")["geometry"].representative_point().rename("focal_geom"),
    left_on="focal",
    right_index=True,
)
adj = adj.merge(
    tracts.set_index("geoid")["geometry"]
    .representative_point()
    .rename("neighbor_geom"),
    left_on="neighbor",
    right_index=True,
)

flow_g = Graph.from_adjacency(adj, focal_col="focal_geom", neighbor_col="neighbor_geom")

flow_g.adjacency.sum()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 flow_g.adjacency.sum()

NameError: name 'flow_g' is not defined

flow_g.pct_nonzero

26.899495281865292

27% nonzero is a very dense graph

flow_g.adjacency

focal                       neighbor                  
POINT (-77.05916 38.90522)  POINT (-77.05916 38.90522)    131.0
                            POINT (-77.07334 38.90902)      2.0
                            POINT (-77.06668 38.90913)     49.0
                            POINT (-77.07558 38.91751)     85.0
                            POINT (-77.06624 38.92443)      1.0
                                                          ...  
POINT (-77.72566 38.30944)  POINT (-77.61238 38.25595)      2.0
                            POINT (-77.87730 38.45920)      2.0
                            POINT (-77.72566 38.30944)      6.0
POINT (-76.46551 38.51178)  POINT (-76.46551 38.51178)      0.0
POINT (-77.29484 38.43915)  POINT (-77.29484 38.43915)      0.0
Name: weight, Length: 498265, dtype: float64

Now we just follow the same steps above to create a vizualization of the Graph

table = pa.table({"value": flow_g.adjacency.values**0.75})

origins = gpd.GeoSeries(flow_g.adjacency.reset_index().focal).get_coordinates().values

destinations = (
    gpd.GeoSeries(flow_g.adjacency.reset_index().neighbor).get_coordinates().values
)

origins.shape

(498265, 2)

destinations.shape

(498265, 2)

import numpy as np
from lonboard import basemap

brushing_extension = BrushingExtension()
brushing_radius = 250

The difference here is we have weighted edges in this Graph, so we will color the arcs proportional to the level of flow between O and D (or i and j if you prefer)

edge_colors = get_color_array(flow_g.adjacency.values, cmap='YlOrBr', scheme='quantiles', k=8)

/Users/knaaptime/Dropbox/projects/mapclassify/mapclassify/classifiers.py:1653: UserWarning: Not enough unique values in array to form 8 classes. Setting k to 6.
  self.bins = quantile(y, k=k)

# color and width should scale to reflect volume of the flow

arc_layer = ArcLayer(
    table=table,
    get_source_position=origins,
    get_target_position=destinations,
    get_source_color=edge_colors,
    get_target_color=edge_colors,
    get_width=table["value"],
    opacity=0.2,
    pickable=True,
    extensions=[brushing_extension],
    brushing_radius=brushing_radius,
)

origin_bounds = tracts[tracts.geoid.isin(adj.focal)]

source_gdf = gpd.GeoDataFrame(geometry=flow_g.adjacency.reset_index()["focal"], crs=4326)
target_gdf = gpd.GeoDataFrame(geometry=flow_g.adjacency.reset_index()["neighbor"], crs=4326)

ogb = ScatterplotLayer.from_geopandas(
    origin_bounds.set_geometry(origin_bounds.representative_point())[['geometry']],
    radius_min_pixels=1.5,
    radius_max_pixels=10,
    get_fill_color=[0, 0, 0, 255],
    stroked=True,
    line_width_min_pixels=0.5,
    pickable=False,
)

/Users/knaaptime/mambaforge/envs/geosnap/lib/python3.11/site-packages/lonboard/_geoarrow/ops/reproject.py:97: UserWarning: Input being reprojected to EPSG:4326 CRS
  warnings.warn("Input being reprojected to EPSG:4326 CRS")

bounds = PolygonLayer.from_geopandas(
    tracts[['geometry']],
    get_fill_color=[255, 255, 255, 100],
    stroked=True,
    line_width_min_pixels=0.5,
    pickable=False,
)

/Users/knaaptime/mambaforge/envs/geosnap/lib/python3.11/site-packages/lonboard/_geoarrow/ops/reproject.py:97: UserWarning: Input being reprojected to EPSG:4326 CRS
  warnings.warn("Input being reprojected to EPSG:4326 CRS")

tgt = ScatterplotLayer.from_geopandas(
    target_gdf,
    radius_scale=60,
    pickable=True,
    stroked=False,
    filled=True,
    line_width_min_pixels=2,
    extensions=[brushing_extension],
    brushing_radius=brushing_radius,
)

/Users/knaaptime/mambaforge/envs/geosnap/lib/python3.11/site-packages/lonboard/_geoarrow/ops/reproject.py:97: UserWarning: Input being reprojected to EPSG:4326 CRS
  warnings.warn("Input being reprojected to EPSG:4326 CRS")

src = ScatterplotLayer.from_geopandas(
    source_gdf,
    radius_scale=60,
    pickable=True,
    stroked=True,
    filled=True,
    line_width_min_pixels=2,
    extensions=[brushing_extension],
    brushing_radius=brushing_radius,
)

/Users/knaaptime/mambaforge/envs/geosnap/lib/python3.11/site-packages/lonboard/_geoarrow/ops/reproject.py:97: UserWarning: Input being reprojected to EPSG:4326 CRS
  warnings.warn("Input being reprojected to EPSG:4326 CRS")

This map shows flows into each tract (though, like, this commute goes both directions each day. This shows the morning commute). …wait, that’s not the right way to say that either… My point is the mouseover is keyed on workplace, so we’re looking at imports to job centers, not exodus from population centers.

The dot shows where you need to mouseover to show the arcs for that observation. It can be hard to nail down the exact spot

gmap = Map(
    layers=[bounds, ogb, tgt, src, arc_layer],
    picking_radius=1000,
    basemap_style=basemap.CartoBasemap.Positron,
)

gmap.set_view_state(**dict(pitch=25, bearing=-40,))

gmap.to_html('dc_commutes.html')

# again, the map has trouble loading inside a cell when published
#gmap

There we go.

See the full sized map here (again, remove your cursor from the entire map to show all the links)

Tract-level flows in the DC region with color and size scaled by commute volume (…assuming every paid of O-D generates a single trip…)

The map is a lot to take in. Here are a few screenshots of places:

Fredericksburg, VA

Fredericksburg is an old and pretty self-contained place. It gets large flows from nearby, and a bunch of little flows from relatively close. Basically nothing from Maryland except a smattering from inner Prince George’s County

Alexandria, VA

Alexandria, interestingly, draws lots of little trips from all over the place (hence all the little yellow lines and no thick ones), though they’re all fairly local

College Park, MD

College Park (go Terps) pulls really heavily from MD, especially nearby in College Park, but has enough gravity to pull a lot from Frederick and even a decent amount from VA

Capitol Hill

The largest employment center in the region, downtown DC pulls large flows from everywhere (this is that node in the harbor)

Frederick, MD

Similar to Fredericksburg on the other side of the metro, downtown Frederick Maryland is pretty self contained but also gets lots of small trips from both VA and northern Montgomery and PG counties

Bethesda, MD

Screenshot 2024-07-25 at 10.03.48 AM.png

Home to NIH and several other large employers, Bethesda pulls from all over the region as well, but particularly dstrongly from the I-270 corridor (unsurprisingly)

(…i have no idea if this is the tract with NIH in it–probably not. But that’s part of the agglomeration economy that makes Bethesda pop out as an employment center)

Conclusion

There’s still a lot of work to do if we want to encapsulate this into something like a re-useable method (ArcLayer isn’t even stable over in lonboard)–but the bones are there! More importantly, this was really easy to do with the Graph. Although we designed it for exactly this kind of flexibility, it’s great to see that intention pan out in practice. I hope this also helps demonstrate how useful the spatial Graph is, as a general data structure. This used to be a \(W\), that was not used terribly widely, outside of spatial econometrics (or the odd Moran’s I computation), so as part of developing the new Graph class, we’ve tried to demonstrate the utility of a first-class object that encodes observed or hypothesized spatial interactions e.g. to study things like segregation measurement, economic development, neighborhood change, or urban morphology among many others.

And now, visualizing O-D travel matrices in three dimensions :) (obviously you dont need a Graph to do that, but I sure found it an easy and efficient way to get the data into the proper structure)

All of the data collection and computation in this notebook is pretty quick, all things told. It would be pretty simple to wrap this up into an interface usign something like streamlit so you could look at flows in any metro in the country by selecting from a dropdown…

Reuse

CC BY-NC-ND 4.0

Plotting the Graph