Areal interpolation is an exceedingly useful technique in spatial science used to transfer data collected for one geographic representation into another representation. This process is sometimes called “change of support” because the analyst is altering the boundaries (support) of the dataset referring to a particular place. A common example is converting between different polygonal representations of the same study area, for example to move between different administrative zones, such as census tracts in 2020 to census tracts in 2010, or to move from an irregular polygon representation (like a county boundary) into a standardized one (like a regular grid).
We call typically call this process ‘areal interpolation’ as opposed to ‘spatial interpolation’ because we are moving data between discrete entities rather than approximating a continuous surface. This distinction is critically important in geography because physical geographers often examine [unconstrained] continuous spatial processes (like air temperature, elevation, or water pressure), whereas human geographers face greater restrictions. That is, we know that people are not distributed in a continuous dispersion along a plane, but rather cluster together into discrete locations. Thus, often our estimates refer to very specific zones, and we need a way to move data to another representation while maintaining the integrity of the original measurement.
In demographic applications of areal interpolation, the absence of empirical data concerning the actual distribution of population at the target spatial resolution is a major obstacle for evaluating any model of population density surface (Martin et al., 2000). Owing to a lack of point-level data, original data reproduction—whether a surface model reproduces the original areal data when the predicted population surface is reaggregated over the spatial units used to collect population data—becomes an essential requirement for accurate interpolation (Tobler 1979; Lam 1983). This property of areal interpolation is also known as the mass-preserving or pycnophylactic condition in the literature.
This is not to say that processes in human geography are not continuous in space… population density is a continuous function, central to most of what we do. The distinction I’m trying to draw here is that measurements in this arena are typically collected/provided in zonal form, and the integrity of the estimate depends on preserving the pycnophylactic property–that the estimated volume in target zones matches the volume in the source zones (Tobler, 1979). This property (which generally does not apply to things like mineral deposits) means that techniques developed for geoscience (like kriging) typically do not work very well1 because they are based on models like variograms rather than volume preservation. The focus on estimating values inside collections of small polygons means this literature is also sometimes called “small area estimation”.
Suppose we want to represent the median income and total population using a regular hexgrid, but the only data available is provided at the the census-tract level. Since we can’t aggregate data upward into our new boundaries, we need some way to transfer information between the two sets of geographic units. The tobler package provides several different ways to estimate data collected at one geography using the boundaries of a different geography.
The simplest technique available in tobler is simple areal interpolation in which variables from the source data are weighted according to the overlap between source and target polygons, then reaggregated to fit the target polygon geometries. The critical assumption in area weighted interpolation is that the variable(s) of interest is distributed uniformly across the source geometry. In practice, that is rarely true, but there are some ways to relax that assumption (in the next section). Following Mennis & Hultgren (2006), we have
The most basic method for areal interpolation is areal weighting, in which a homogeneous distribution of the data throughout each source zone is assumed. Each source zone therefore contributes to the target zone a portion of its data proportional to the percentage of its area that the target zone occupies. In the case of the alternate geography problem, if we denote a choropleth source zone s and an area-class map zone z, then the target zone t = z. The estimation of the count for the target zone is:
where: \(y_t\) = the estimated count of the target zone; \(y_s\) = the count of the source zone; \(A_{s\cap z}\) = the area of the intersection between the source and target zone; \(A_s\) = the area of the source zone; and \(n\) = the number of source zones with which \(z\) overlaps (Goodchild & Lam, 1980)
A good way to think about this is with the figure from a classic Tobler paper on the subject. Imagine you were trying to transfer data collected at the state level, and convert it to some other meso-geography, like a PUMA or large hexgrid. The assumption we make with area-weighted interpolation is that each variable is distributed uniformly across space within each state
Population Density by State as shown in Tobler (1979)
California Population by County
That is problematic, because first, we are technically encoding both the ecological fallacy (Openshaw, 1984) and the modifiable areal unit problem (Manley, 2014) because we are literally making estimates about small areas based only on our data collected at a different resolution. And second, we know the source data is not technically uniform. Do we really think all of Texas or California or Florida have constant population density that high across the entire state? Of course not. But all we know for certain is the population density collected at the state level, so that’s the information we use to distribute information to another geographic unit. It is important to be critical about when the uniformity condition is likely to be violated badly.
Another important consideration is the difference in size between the source and target geometries.
If source\(\gg\)target, then the interpolation operation is effectively downscaling the original information, which will induce a lot of autocorrelation (because target units falling inside a single source geometry all get assigned the same value). This is one way you can induce ‘spatial dependence’ as a consequence of measurement error (Anselin, 1988).
If target\(\gg\)source, then the operation is effectively aggregating source polygons, in which case the estimates should be pretty accurate, but at the cost of additional information. In this case, you are throwing away variation from higher-resolution data.
4.1.1 Extensive and Intensive Variables
To allocate variables from source zones to target zones accurately, we first need to distinguish between two types of quantitative variables, intensive and extensive, where an “intensive property or intensive quantity is one whose magnitude is independent of the size of the system”, whereas an extensive property is one whose magnitude is additive for subsystems. In the context of areal interpolation, the ‘system’ refers to the the collection of units, and the subsystem refer to the units themselves. Thus, extensive quantities are things like counts or areas which are additive, whereas intensive variables include things like rates, ratios, densities and averages. During the process of interpolation, we take the weighted sum of extensive variables at each target location, whereas we take the weighted average for intensive variables.
Code
results = area_interpolate( source_df=dc, target_df=hexes, intensive_variables=["median_household_income"], extensive_variables=["n_total_pop"],)fig, ax = plt.subplots(1, 2, figsize=(7, 5))dc.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[0], alpha=0.7)results.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[1], alpha=0.7)ax[0].set_title("original")ax[1].set_title("interpolated")for a in ax: a.axis("off") ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)fig.suptitle("Total Population (extensive)")plt.tight_layout()
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_41836/483499942.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
results = area_interpolate(
/Users/knaaptime/Dropbox/projects/tobler/tobler/area_weighted/area_interpolate.py:325: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
den = source_df.area.values
/Users/knaaptime/Dropbox/projects/tobler/tobler/util/util.py:60: UserWarning: nan values in variable: median_household_income, replacing with 0
warn(f"nan values in variable: {column}, replacing with 0")
fig, ax = plt.subplots(1, 2, figsize=(7, 5))dc.to_crs(3857).plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[0])results.to_crs(3857).plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[1])ax[0].set_title("original")ax[1].set_title("interpolated")for a in ax: a.axis("off") ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)fig.suptitle("Median Income (intensive)")plt.tight_layout()
It can be useful to create linked interactive maps to see the change from one geometry to the other. To use hvplot effectively, we also need to classify the data manually using mapclassify so the colors take on the same values in the side-by-side maps.
Thus far we have relied on extensive and intensive quantitative variables in the interpolation process, but sometimes it is also useful to transfer categorical data between zonal representations2. In these cases, we generally compute the share of land in each target zone covered by each categorical variable in the source polygons. For example, a common case is to have land-use data that you may want to summarize inside an administrative boundary, for example to look at the share of developed land inside a given municipality (i.e. as some measure of growth or land consumption over time).
4.1.2.1 Land-Use by Census Block
The National Land Cover Database provides land cover data for the U.S. at a 30m grid resolution. If we want to summarize each census block in D.C. by the share of land use it contains, then the following workflow is a straightforward way to do so. We first collect blocks for D.C., then use the extract_raster_features function from tobler to convert NLCD data into a geodataframe which is much easier to manipulate for this use-case. For ease of presentation, we will also convert the pixel values from integers into descriptions so each row in the dataframe belongs to a readable category.
Code
vals = datasets.nlcd_definitions()vals
code
color
classification
description
0
11
70,107,159
Open Water
areas of open water, generally with less than ...
1
12
209,222,248
Perennial Snow/Ice
areas characterized by a perennial cover of ic...
2
21
222,197,197
Developed, Open Space
areas with a mixture of some constructed mater...
3
22
217,146,130
Developed, Low Intensity
areas with a mixture of constructed materials ...
4
23
235,0,0
Developed, Medium Intensity
areas with a mixture of constructed materials ...
5
24
171,0,0
Developed High Intensity
highly developed areas where people reside or ...
6
31
179,172,159
Barren Land (Rock/Sand/Clay)
areas of bedrock, desert pavement, scarps, tal...
7
41
104,171,95
Deciduous Forest
areas dominated by trees generally greater tha...
8
42
28,95,44
Evergreen Forest
areas dominated by trees generally greater tha...
9
43
181,197,143
Mixed Forest
areas dominated by trees generally greater tha...
10
52
204,184,121
Shrub/Scrub
areas dominated by shrubs; less than 5 meters ...
11
71
223,223,194
Grassland/Herbaceous
areas dominated by gramanoid or herbaceous veg...
12
81
220,217,57
Pasture/Hay
areas of grasses, legumes, or grass-legume mix...
13
82
171,108,40
Cultivated Crops
areas used for the production of annual crops,...
14
90
184,217,235
Woody Wetlands
areas where forest or shrubland vegetation acc...
15
95
108,159,184
Emergent Herbaceous Wetlands
areas where perennial herbaceous vegetation ac...
Code
# get census block boundary data for Washington DCdc_blocks = datasets.blocks_2020(states=["11"])dc_blocks = dc_blocks.to_crs(dc_blocks.estimate_utm_crs())nlcd_path = ("https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif")# convert the NLCD into vector representationdclu = extract_raster_features( dc_blocks, nlcd_path, pixel_values=vals["code"].tolist(), collapse_values=False)# replace the integer pixel values with their string descriptions for readibilitydclu["value"] = dclu["value"].replace( vals.set_index("code")["classification"].to_dict())dclu = dclu.to_crs(dc_blocks.crs)
Now the dclu variable holds our land use data in a geodataframe of square polygons rather than a raster. While it is straightforward to plot this new data, with a little bit of extra work, we can create a ‘proper’ land use map that matches the colors of the NLCD. These colors are already present in the vals table shown above, so our task is to join the colors to their appropriate row, convert from RGBA color representation to hexcode representation, then create a legend to match.
Code
# convert the column of [R,G,B,A] colors to hexcolorsvals["hexcolor"] = vals["color"].apply(lambda x: to_hex(np.fromstring(x, sep=",") /255))# align colors with values using a left-joinhexcolors = dclu.merge(vals, left_on="value", right_on="classification", how="left")["hexcolor"].values# plot the map, passing the colors manuallyax = dclu.plot(color=hexcolors)# create a colorpatch/label for each entry in the nlcd definitions tablelegend_elements = [ Patch( facecolor=row[1]["hexcolor"], edgecolor="gray", label=row[1]["classification"] )for row in vals.iterrows()]# create legend and move it to the right of the mapax.legend( handles=legend_elements,).set_bbox_to_anchor([2, 1, 0, 0])ax
In the resulting dataframe lu_blks, we now have separate columns for each category of land use in our original source, and the values in each column hold the share of that land-use in each census block.
Code
lu_blks.head()
Developed, Open Space
Developed, Low Intensity
Developed, Medium Intensity
Woody Wetlands
Developed High Intensity
Mixed Forest
Deciduous Forest
Shrub/Scrub
Pasture/Hay
Open Water
Emergent Herbaceous Wetlands
Grassland/Herbaceous
Evergreen Forest
Barren Land (Rock/Sand/Clay)
geometry
geoid
110019800001089
0.207453
0.603459
0.135434
0.000000
0.053653
0.0
0.000000
0.0
0.0
0.0
0.0
0.0
0.0
0.0
POLYGON ((322969.039 4306546.708, 322970.142 4...
110010056012005
0.000000
0.000000
0.502513
0.000000
0.497487
0.0
0.000000
0.0
0.0
0.0
0.0
0.0
0.0
0.0
POLYGON ((321580.797 4307997.329, 321596.402 4...
110010005013003
0.062934
0.131362
0.476541
0.000000
0.328051
0.0
0.001112
0.0
0.0
0.0
0.0
0.0
0.0
0.0
POLYGON ((321758.637 4310250.787, 321759.503 4...
110010005011000
0.384194
0.402416
0.136406
0.000000
0.003791
0.0
0.073193
0.0
0.0
0.0
0.0
0.0
0.0
0.0
POLYGON ((321810.7 4311101.634, 321883.434 431...
110010111003035
0.347988
0.319033
0.147562
0.063681
0.021227
0.0
0.100508
0.0
0.0
0.0
0.0
0.0
0.0
0.0
POLYGON ((329084.877 4308909.414, 329085.096 4...
Code
lu_blks.explore("Developed High Intensity", tiles="CartoDB Positron", style_kwds=dict(weight=0.5))
Make this Notebook Trusted to load map: File -> Trust Notebook