5  Areal Interpolation

Areal interpolation is an exceedingly useful technique in spatial science used to transfer data collected for one geographic representation into another representation. This process is sometimes called “change of support” because the analyst is altering the boundaries (support) of the dataset referring to a particular place. A common example is converting between different polygonal representations of the same study area, for example to move between different administrative zones, such as census tracts in 2020 to census tracts in 2010, or to move from an irregular polygon representation (like a county boundary) into a standardized one (like a regular grid).

We call typically call this process ‘areal interpolation’ as opposed to ‘spatial interpolation’ because we are moving data between discrete entities rather than approximating a continuous surface. This distinction is critically important in geography because physical geographers often examine continuous spatial processes (like air temperature, elevation, or water pressure), whereas human geographers face greater restrictions. That is, we know that people are not distributed in a continuous dispersion along a plane, but rather cluster together into discrete locations. Thus, often our estimates refer to very specific zones, and we need a way to move data to another representation while maintaining the integrity of the original measurement.

“The ability of some GIS to reaggregate data from one set of areal units to another makes the modifiable areal unit problem even more important”

Fotheringham et al. (2007, p. 29)

Code
import contextily as ctx
import geopandas as gpd
import geoviews as gv
import pandas as pd
import hvplot.pandas
import matplotlib.pyplot as plt
import numpy as np
import pydeck as pdk
from geosnap import DataStore
from geosnap import io as gio
#from lonboard import Map, PolygonLayer, viz
from mapclassify.util import get_color_array
from tobler.area_weighted import area_interpolate
from tobler.dasymetric import masked_area_interpolate, extract_raster_features
from tobler.pycno import pycno_interpolate
from tobler.util import h3fy
from matplotlib.colors import to_hex
from matplotlib.lines import Line2D
from matplotlib.patches import Patch

%load_ext jupyter_black
%load_ext watermark
%watermark -a 'eli knaap' -iv
The jupyter_black extension is already loaded. To reload it, use:
  %reload_ext jupyter_black
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: eli knaap

tobler     : 0.12.1.dev2+g4eba45f.d20250123
geopandas  : 1.0.1
pydeck     : 0.8.0b4
contextily : 1.6.2
geosnap    : 0.14.1.dev14+g0443e2a.d20250103
numpy      : 1.26.4
geoviews   : 1.14.0
mapclassify: 2.8.2.dev2+ga421b4b.d20250107
pandas     : 2.2.3
matplotlib : 3.10.0
hvplot     : 0.11.2

Let’s say we want to represent the median income and total population using a regular hexgrid, but the only data available is provided at the the census-tract level. Since we can’t aggregate data upward into our new boundaries, we need some way to transfer information between the two sets of geographic units. The tobler package provides several different ways to estimate data collected at one geography using the boundaries of a different geography.

Code
datasets = DataStore()

dc = gio.get_acs(datasets, state_fips="11", years=[2019], level="tract")
hexes = h3fy(dc, resolution=8)

f, ax = plt.subplots(1, 2, figsize=(8, 4))

dc.plot("n_total_pop", ax=ax[0])
hexes.plot(ax=ax[1])

5.1 Area-Weighted Interpolation

The simplest technique available in tobler is simple areal interpolation in which variables from the source data are weighted according to the overlap between source and target polygons, then reaggregated to fit the target polygon geometries. The critical assumption in area weighted interpolation is that the variable(s) of interest is distributed uniformly across the source geometry. In practice, that is rarely true, but there are some ways to relax that assumption (in the next section).

A good way to think about this is with the figure from a classic Tobler paper on the subject. Imagine you were trying to transfer data collected at the state level, and convert it to some other meso-geography, like a PUMA or large hexgrid. The assumption we make with area-weighted interpolation is that each variable is distributed uniformly across space within each state

Population Density by State as shown in Tobler (1979)

Population Density by State as shown in Tobler (1979)

California Population by County

California Population by County

That is problematic, because first, we are technically encoding both the ecological fallacy (Openshaw, 1984) and the modifiable areal unit problem (Manley, 2014) because we are literally making estimates about small areas based only on our data collected at a different resolution. And second, we know the source data is not technically uniform. Do we really think all of Texas or California or Florida have constant population density that high across the entire state? Of course not. But all we know for certain is the population density collected at the state level, so that’s the information we use to distribute information to another geographic unit. It is important to be critical about when the uniformity condition is likely to be violated badly.

Another important consideration is the difference in size between the source and target geometries.

  • If source \(\gg\) target, then the interpolation operation is effectively downscaling the original information, which will induce a lot of autocorrelation (because target units falling inside a single source geometry all get assigned the same value). This is one way you can induce ‘spatial dependence’ as a consequence of measurement error (Anselin, 1988).
  • If target \(\gg\) source, then the operation is effectively aggregating source polygons, in which case the estimates should be pretty accurate, but at the cost of additional information. In this case, you are throwing away variation from higher-resolution data.
Code
results = area_interpolate(
    source_df=dc,
    target_df=hexes,
    intensive_variables=["median_household_income"],
    extensive_variables=["n_total_pop"],
)

fig, ax = plt.subplots(1, 2, figsize=(6, 4))

dc.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[0], alpha=0.7)
results.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[1], alpha=0.7)

ax[0].set_title("original")
ax[1].set_title("interpolated")
for a in ax:
    a.axis("off")
    ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)

fig.suptitle("Total Population (extensive)")
plt.tight_layout()
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_44098/3164303058.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  results = area_interpolate(
/Users/knaaptime/Dropbox/projects/tobler/tobler/area_weighted/area_interpolate.py:325: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df.area.values
/Users/knaaptime/Dropbox/projects/tobler/tobler/util/util.py:60: UserWarning: nan values in variable: median_household_income, replacing with 0
  warn(f"nan values in variable: {column}, replacing with 0")

Code
dc.to_crs(4326).hvplot(
    c="n_total_pop",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
) + results.hvplot(
    c="n_total_pop",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
).opts(
    active_tools=["wheel_zoom"]
)
Code
fig, ax = plt.subplots(1, 2, figsize=(8, 6))

dc.to_crs(3857).plot(
    "median_household_income", scheme="quantiles", cmap="magma", ax=ax[0]
)
results.to_crs(3857).plot(
    "median_household_income", scheme="quantiles", cmap="magma", ax=ax[1]
)

ax[0].set_title("original")
ax[1].set_title("interpolated")
for a in ax:
    a.axis("off")
    ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)

fig.suptitle("Median Income (intensive)")
plt.tight_layout()

It can be useful to create linked interactive maps to see the change from one geometry to the other. To use hvplot effectively, we also need to classify the data manually using mapclassify.

Code
from mapclassify import Quantiles

qmi_orig = Quantiles(dc.median_household_income.dropna().values)
qmi_interp = Quantiles(results.median_household_income.values)
dc.to_crs(4326).dropna(subset="median_household_income").assign(
    label=qmi_orig.yb
).hvplot(
    c="label",
    cmap="magma",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
) + results.assign(
    label=qmi_interp.yb
).hvplot(
    c="label",
    cmap="magma",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
).opts(
    active_tools=["wheel_zoom"]
)

5.2 Categorical Variables

5.2.1 Land-Use by Census Block

Code
dc_blocks = datasets.blocks_2020(states=["11"])
dc_blocks = dc_blocks.to_crs(dc_blocks.estimate_utm_crs())
dc_blocks.plot()

The National Land Cover Database provides land cover data for the U.S. at a 30m grid resolution. Sometimes we want to

want to know how much land use has changed inside some other geographic unit, like

Code
vals = datasets.nlcd_definitions()
vals
code color classification description
0 11 70,107,159 Open Water areas of open water, generally with less than ...
1 12 209,222,248 Perennial Snow/Ice areas characterized by a perennial cover of ic...
2 21 222,197,197 Developed, Open Space areas with a mixture of some constructed mater...
3 22 217,146,130 Developed, Low Intensity areas with a mixture of constructed materials ...
4 23 235,0,0 Developed, Medium Intensity areas with a mixture of constructed materials ...
5 24 171,0,0 Developed High Intensity highly developed areas where people reside or ...
6 31 179,172,159 Barren Land (Rock/Sand/Clay) areas of bedrock, desert pavement, scarps, tal...
7 41 104,171,95 Deciduous Forest areas dominated by trees generally greater tha...
8 42 28,95,44 Evergreen Forest areas dominated by trees generally greater tha...
9 43 181,197,143 Mixed Forest areas dominated by trees generally greater tha...
10 52 204,184,121 Shrub/Scrub areas dominated by shrubs; less than 5 meters ...
11 71 223,223,194 Grassland/Herbaceous areas dominated by gramanoid or herbaceous veg...
12 81 220,217,57 Pasture/Hay areas of grasses, legumes, or grass-legume mix...
13 82 171,108,40 Cultivated Crops areas used for the production of annual crops,...
14 90 184,217,235 Woody Wetlands areas where forest or shrubland vegetation acc...
15 95 108,159,184 Emergent Herbaceous Wetlands areas where perennial herbaceous vegetation ac...
Code
nlcd_path = (
    "https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif"
)
dclu = extract_raster_features(
    dc_blocks, nlcd_path, pixel_values=vals["code"].tolist(), collapse_values=False
)
dclu["value"] = dclu["value"].replace(
    vals.set_index("code")["classification"].to_dict()
)
dclu = dclu.to_crs(dc_blocks.crs)

# convert the column of [R,G,B,A] colors to hexcolors
vals["hexcolor"] = vals["color"].apply(
    lambda x: to_hex(np.fromstring(x, sep=",") / 255)
)
# align colors with values using a left-join
hexcolors = dclu.merge(vals, left_on="value", right_on="classification", how="left")[
    "hexcolor"
].values
# plot the map, passing the colors manually
ax = dclu.plot(color=hexcolors)
# create a colorpatch/label for each entry in the nlcd definitions table
legend_elements = [
    Patch(
        facecolor=row[1]["hexcolor"], edgecolor="gray", label=row[1]["classification"]
    )
    for row in vals.iterrows()
]
# create legend and move it to the right of the map
ax.legend(
    handles=legend_elements,
).set_bbox_to_anchor([2, 1, 0, 0])
ax
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.aux.xml: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.aux: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.AUX: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.aux: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.AUX: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.xml: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.XML: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.msk: 403
WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif.MSK: 403

Code
dclu.head(10)
value geometry
0 Developed, Open Space POLYGON ((323226.647 4318276.105, 323256.195 4...
1 Developed, Open Space POLYGON ((323183.792 4318224.692, 323213.34 43...
2 Developed, Low Intensity POLYGON ((323331.53 4318191.928, 323390.626 43...
3 Developed, Open Space POLYGON ((323140.937 4318173.279, 323170.484 4...
4 Developed, Open Space POLYGON ((323436.413 4318107.752, 323465.961 4...
5 Developed, Medium Intensity POLYGON ((323242.887 4318211.586, 323331.53 43...
6 Developed, Low Intensity POLYGON ((323282.021 4318111.533, 323311.569 4...
7 Developed, Open Space POLYGON ((323459.307 4318072.217, 323488.855 4...
8 Developed, Open Space POLYGON ((323104.735 4318150.849, 323134.283 4...
9 Developed, Open Space POLYGON ((323061.88 4318099.437, 323091.427 43...
Code
lu_blks = area_interpolate(
    source_df=dclu,
    target_df=dc_blocks.set_index("geoid"),
    categorical_variables=["value"],
)
lu_blks.columns = lu_blks.columns.str.replace("value_", "")
Code
lu_blks.head()
Developed, Open Space Developed, Low Intensity Developed, Medium Intensity Woody Wetlands Developed High Intensity Mixed Forest Deciduous Forest Shrub/Scrub Pasture/Hay Open Water Emergent Herbaceous Wetlands Grassland/Herbaceous Evergreen Forest Barren Land (Rock/Sand/Clay) geometry
geoid
110019800001089 0.207453 0.603459 0.135434 0.000000 0.053653 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 POLYGON ((322969.039 4306546.708, 322970.142 4...
110010056012005 0.000000 0.000000 0.502513 0.000000 0.497487 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 POLYGON ((321580.797 4307997.329, 321596.402 4...
110010005013003 0.062934 0.131362 0.476541 0.000000 0.328051 0.0 0.001112 0.0 0.0 0.0 0.0 0.0 0.0 0.0 POLYGON ((321758.637 4310250.787, 321759.503 4...
110010005011000 0.384194 0.402416 0.136406 0.000000 0.003791 0.0 0.073193 0.0 0.0 0.0 0.0 0.0 0.0 0.0 POLYGON ((321810.7 4311101.634, 321883.434 431...
110010111003035 0.347988 0.319033 0.147562 0.063681 0.021227 0.0 0.100508 0.0 0.0 0.0 0.0 0.0 0.0 0.0 POLYGON ((329084.877 4308909.414, 329085.096 4...
Code
lu_blks.explore("Developed High Intensity", tiles="CartoDB Positron", style_kwds=dict(weight=0.5))
Make this Notebook Trusted to load map: File -> Trust Notebook

5.3 Pycnophylactic Interpolation

Pycnophylactic interpolation is based on Tobler’s method and does not distinguish between extensive and intensive variables

“Census enumerations are usually packaged in irregularly shaped geographical regions. Interior values can be interpolated for such regions, without specification of “control points,” by using an analogy to elliptical partial differential equations. A solution procedure is suggested, using finite difference methods with classical boundary conditions. In order to estimate densities, an additional nonnegativity condition is required. Smooth contour maps, which satisfy the volume preserving and nonnegstivity constraints, illustrate the method using actual geographical data. It is suggested that the procedure may be used to convert observations from one bureaucratic partitioning of a geographical area to another.”

Tobler (1979)

Tobler’s approach does not assume a uniform distribution in the source zones but instead estimates a smooth density surface based on the centroid of the input geometries; then, it uses the source boundaries as a constraint and iteratively updates the density surface estimate to ensure that the source volume is respected (i.e. the pycnophylactic property is preserved). This can have negative consequences when the surface is poorly estimated from the input centroids

Code
hexes = hexes.to_crs(hexes.estimate_utm_crs())

dc = dc.to_crs(dc.estimate_utm_crs())
Code
dc_pycno = pycno_interpolate(
    source_df=dc,
    target_df=hexes,
    variables=["n_total_pop", "median_household_income"],
    cellsize=100,
    handle_null=True,
)
WARNING: nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this. [astropy.convolution.convolve]
WARNING:astropy:nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this.
WARNING: nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this. [astropy.convolution.convolve]
WARNING:astropy:nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this.
Code
fig, ax = plt.subplots(1, 2, figsize=(8, 6))

dc_pycno.plot("n_total_pop", scheme="quantiles", ax=ax[0])
dc.plot("n_total_pop", scheme="quantiles", ax=ax[1])

ax[0].set_title("interpolated")
ax[1].set_title("original")
for ax in ax:
    ax.axis("off")
fig.suptitle("Total Population (extensive)")
plt.tight_layout()

Code
fig, ax = plt.subplots(1, 2, figsize=(8, 6))

dc_pycno.plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[0])
dc.plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[1])

ax[0].set_title("interpolated")
ax[1].set_title("original")
for ax in ax:
    ax.axis("off")
fig.suptitle("Median Income (intensive)")
plt.tight_layout()