Areal interpolation is an exceedingly useful technique in spatial science used to transfer data collected for one geographic representation into another representation. This process is sometimes called “change of support” because the analyst is altering the boundaries (support) of the dataset referring to a particular place. A common example is converting between different polygonal representations of the same study area, for example to move between different administrative zones, such as census tracts in 2020 to census tracts in 2010, or to move from an irregular polygon representation (like a county boundary) into a standardized one (like a regular grid).
We call typically call this process ‘areal interpolation’ as opposed to ‘spatial interpolation’ because we are moving data between discrete entities rather than approximating a continuous surface. This distinction is critically important in geography because physical geographers often examine continuous spatial processes (like air temperature, elevation, or water pressure), whereas human geographers face greater restrictions. That is, we know that people are not distributed in a continuous dispersion along a plane, but rather cluster together into discrete locations. Thus, often our estimates refer to very specific zones, and we need a way to move data to another representation while maintaining the integrity of the original measurement.
“The ability of some GIS to reaggregate data from one set of areal units to another makes the modifiable areal unit problem even more important”
import contextily as ctximport geopandas as gpdimport geoviews as gvimport pandas as pdimport hvplot.pandasimport matplotlib.pyplot as pltimport numpy as npimport pydeck as pdkfrom geosnap import DataStorefrom geosnap import io as gio#from lonboard import Map, PolygonLayer, vizfrom mapclassify.util import get_color_arrayfrom tobler.area_weighted import area_interpolatefrom tobler.dasymetric import masked_area_interpolate, extract_raster_featuresfrom tobler.pycno import pycno_interpolatefrom tobler.util import h3fyfrom matplotlib.colors import to_hexfrom matplotlib.lines import Line2Dfrom matplotlib.patches import Patch%load_ext jupyter_black%load_ext watermark%watermark -a 'eli knaap'-iv
The jupyter_black extension is already loaded. To reload it, use:
%reload_ext jupyter_black
The watermark extension is already loaded. To reload it, use:
%reload_ext watermark
Author: eli knaap
tobler : 0.12.1.dev2+g4eba45f.d20250123
geopandas : 1.0.1
pydeck : 0.8.0b4
contextily : 1.6.2
geosnap : 0.14.1.dev14+g0443e2a.d20250103
numpy : 1.26.4
geoviews : 1.14.0
mapclassify: 2.8.2.dev2+ga421b4b.d20250107
pandas : 2.2.3
matplotlib : 3.10.0
hvplot : 0.11.2
Let’s say we want to represent the median income and total population using a regular hexgrid, but the only data available is provided at the the census-tract level. Since we can’t aggregate data upward into our new boundaries, we need some way to transfer information between the two sets of geographic units. The tobler package provides several different ways to estimate data collected at one geography using the boundaries of a different geography.
The simplest technique available in tobler is simple areal interpolation in which variables from the source data are weighted according to the overlap between source and target polygons, then reaggregated to fit the target polygon geometries. The critical assumption in area weighted interpolation is that the variable(s) of interest is distributed uniformly across the source geometry. In practice, that is rarely true, but there are some ways to relax that assumption (in the next section).
A good way to think about this is with the figure from a classic Tobler paper on the subject. Imagine you were trying to transfer data collected at the state level, and convert it to some other meso-geography, like a PUMA or large hexgrid. The assumption we make with area-weighted interpolation is that each variable is distributed uniformly across space within each state
Population Density by State as shown in Tobler (1979)
California Population by County
That is problematic, because first, we are technically encoding both the ecological fallacy (Openshaw, 1984) and the modifiable areal unit problem (Manley, 2014) because we are literally making estimates about small areas based only on our data collected at a different resolution. And second, we know the source data is not technically uniform. Do we really think all of Texas or California or Florida have constant population density that high across the entire state? Of course not. But all we know for certain is the population density collected at the state level, so that’s the information we use to distribute information to another geographic unit. It is important to be critical about when the uniformity condition is likely to be violated badly.
Another important consideration is the difference in size between the source and target geometries.
If source\(\gg\)target, then the interpolation operation is effectively downscaling the original information, which will induce a lot of autocorrelation (because target units falling inside a single source geometry all get assigned the same value). This is one way you can induce ‘spatial dependence’ as a consequence of measurement error (Anselin, 1988).
If target\(\gg\)source, then the operation is effectively aggregating source polygons, in which case the estimates should be pretty accurate, but at the cost of additional information. In this case, you are throwing away variation from higher-resolution data.
Code
results = area_interpolate( source_df=dc, target_df=hexes, intensive_variables=["median_household_income"], extensive_variables=["n_total_pop"],)fig, ax = plt.subplots(1, 2, figsize=(6, 4))dc.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[0], alpha=0.7)results.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[1], alpha=0.7)ax[0].set_title("original")ax[1].set_title("interpolated")for a in ax: a.axis("off") ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)fig.suptitle("Total Population (extensive)")plt.tight_layout()
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_44098/3164303058.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
results = area_interpolate(
/Users/knaaptime/Dropbox/projects/tobler/tobler/area_weighted/area_interpolate.py:325: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
den = source_df.area.values
/Users/knaaptime/Dropbox/projects/tobler/tobler/util/util.py:60: UserWarning: nan values in variable: median_household_income, replacing with 0
warn(f"nan values in variable: {column}, replacing with 0")
fig, ax = plt.subplots(1, 2, figsize=(8, 6))dc.to_crs(3857).plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[0])results.to_crs(3857).plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[1])ax[0].set_title("original")ax[1].set_title("interpolated")for a in ax: a.axis("off") ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)fig.suptitle("Median Income (intensive)")plt.tight_layout()
It can be useful to create linked interactive maps to see the change from one geometry to the other. To use hvplot effectively, we also need to classify the data manually using mapclassify.
The National Land Cover Database provides land cover data for the U.S. at a 30m grid resolution. Sometimes we want to
want to know how much land use has changed inside some other geographic unit, like
Code
vals = datasets.nlcd_definitions()vals
code
color
classification
description
0
11
70,107,159
Open Water
areas of open water, generally with less than ...
1
12
209,222,248
Perennial Snow/Ice
areas characterized by a perennial cover of ic...
2
21
222,197,197
Developed, Open Space
areas with a mixture of some constructed mater...
3
22
217,146,130
Developed, Low Intensity
areas with a mixture of constructed materials ...
4
23
235,0,0
Developed, Medium Intensity
areas with a mixture of constructed materials ...
5
24
171,0,0
Developed High Intensity
highly developed areas where people reside or ...
6
31
179,172,159
Barren Land (Rock/Sand/Clay)
areas of bedrock, desert pavement, scarps, tal...
7
41
104,171,95
Deciduous Forest
areas dominated by trees generally greater tha...
8
42
28,95,44
Evergreen Forest
areas dominated by trees generally greater tha...
9
43
181,197,143
Mixed Forest
areas dominated by trees generally greater tha...
10
52
204,184,121
Shrub/Scrub
areas dominated by shrubs; less than 5 meters ...
11
71
223,223,194
Grassland/Herbaceous
areas dominated by gramanoid or herbaceous veg...
12
81
220,217,57
Pasture/Hay
areas of grasses, legumes, or grass-legume mix...
13
82
171,108,40
Cultivated Crops
areas used for the production of annual crops,...
14
90
184,217,235
Woody Wetlands
areas where forest or shrubland vegetation acc...
15
95
108,159,184
Emergent Herbaceous Wetlands
areas where perennial herbaceous vegetation ac...
Code
nlcd_path = ("https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif")dclu = extract_raster_features( dc_blocks, nlcd_path, pixel_values=vals["code"].tolist(), collapse_values=False)dclu["value"] = dclu["value"].replace( vals.set_index("code")["classification"].to_dict())dclu = dclu.to_crs(dc_blocks.crs)# convert the column of [R,G,B,A] colors to hexcolorsvals["hexcolor"] = vals["color"].apply(lambda x: to_hex(np.fromstring(x, sep=",") /255))# align colors with values using a left-joinhexcolors = dclu.merge(vals, left_on="value", right_on="classification", how="left")["hexcolor"].values# plot the map, passing the colors manuallyax = dclu.plot(color=hexcolors)# create a colorpatch/label for each entry in the nlcd definitions tablelegend_elements = [ Patch( facecolor=row[1]["hexcolor"], edgecolor="gray", label=row[1]["classification"] )for row in vals.iterrows()]# create legend and move it to the right of the mapax.legend( handles=legend_elements,).set_bbox_to_anchor([2, 1, 0, 0])ax
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.aux.xml: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.aux: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.AUX: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.aux: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.AUX: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.xml: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.XML: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.msk: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.MSK: 403