[4]{.chapter-number}  [Areal Interpolation]{.chapter-title}

4 Areal Interpolation

Code

import contextily as ctx
import geopandas as gpd
import geoviews as gv
import pandas as pd
import hvplot.pandas
import matplotlib.pyplot as plt
import numpy as np
import pydeck as pdk
from geosnap import DataStore
from geosnap import io as gio
#from lonboard import Map, PolygonLayer, viz
from mapclassify.util import get_color_array
from tobler.area_weighted import area_interpolate
from tobler.dasymetric import masked_area_interpolate, extract_raster_features
from tobler.pycno import pycno_interpolate
from tobler.util import h3fy
from matplotlib.colors import to_hex
from matplotlib.lines import Line2D
from matplotlib.patches import Patch

%load_ext jupyter_black
%load_ext watermark
%watermark -a 'eli knaap' -iv

The jupyter_black extension is already loaded. To reload it, use:
  %reload_ext jupyter_black
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: eli knaap

pydeck     : 0.8.0b4
matplotlib : 3.10.0
mapclassify: 2.8.2.dev2+ga421b4b.d20250107
geopandas  : 1.0.1
contextily : 1.6.2
tobler     : 0.12.1.dev2+g4eba45f.d20250123
geoviews   : 1.14.0
geosnap    : 0.14.1.dev14+g0443e2a.d20250103
hvplot     : 0.11.2
pandas     : 2.2.3
numpy      : 1.26.4

“The ability of some GIS to reaggregate data from one set of areal units to another makes the modifiable areal unit problem even more important”

— Fotheringham et al. (2007, p. 29)

Areal interpolation is an exceedingly useful technique in spatial science used to transfer data collected for one geographic representation into another representation. This process is sometimes called “change of support” because the analyst is altering the boundaries (support) of the dataset referring to a particular place. A common example is converting between different polygonal representations of the same study area, for example to move between different administrative zones, such as census tracts in 2020 to census tracts in 2010, or to move from an irregular polygon representation (like a county boundary) into a standardized one (like a regular grid).

We call typically call this process ‘areal interpolation’ as opposed to ‘spatial interpolation’ because we are moving data between discrete entities rather than approximating a continuous surface. This distinction is critically important in geography because physical geographers often examine [unconstrained] continuous spatial processes (like air temperature, elevation, or water pressure), whereas human geographers face greater restrictions. That is, we know that people are not distributed in a continuous dispersion along a plane, but rather cluster together into discrete locations. Thus, often our estimates refer to very specific zones, and we need a way to move data to another representation while maintaining the integrity of the original measurement.

In demographic applications of areal interpolation, the absence of empirical data concerning the actual distribution of population at the target spatial resolution is a major obstacle for evaluating any model of population density surface (Martin et al., 2000). Owing to a lack of point-level data, original data reproduction—whether a surface model reproduces the original areal data when the predicted population surface is reaggregated over the spatial units used to collect population data—becomes an essential requirement for accurate interpolation (Tobler 1979; Lam 1983). This property of areal interpolation is also known as the mass-preserving or pycnophylactic condition in the literature.

— Yoo et al. (2010)

This is not to say that processes in human geography are not continuous in space… population density is a continuous function, central to most of what we do. The distinction I’m trying to draw here is that measurements in this arena are typically collected/provided in zonal form, and the integrity of the estimate depends on preserving the pycnophylactic property–that the estimated volume in target zones matches the volume in the source zones (Tobler, 1979). This property (which generally does not apply to things like mineral deposits) means that techniques developed for geoscience (like kriging) typically do not work very well¹ because they are based on models like variograms rather than volume preservation. The focus on estimating values inside collections of small polygons means this literature is also sometimes called “small area estimation”.

Suppose we want to represent the median income and total population using a regular hexgrid, but the only data available is provided at the the census-tract level. Since we can’t aggregate data upward into our new boundaries, we need some way to transfer information between the two sets of geographic units. The tobler package provides several different ways to estimate data collected at one geography using the boundaries of a different geography.

Code

datasets = DataStore()

dc = gio.get_acs(datasets, state_fips="11", years=[2019], level="tract")
hexes = h3fy(dc, resolution=8)

f, ax = plt.subplots(1, 2, figsize=(8, 4))

dc.plot("n_total_pop", ax=ax[0])
hexes.plot(ax=ax[1])

4.1 Area-Weighted Interpolation

The simplest technique available in tobler is simple areal interpolation in which variables from the source data are weighted according to the overlap between source and target polygons, then reaggregated to fit the target polygon geometries. The critical assumption in area weighted interpolation is that the variable(s) of interest is distributed uniformly across the source geometry. In practice, that is rarely true, but there are some ways to relax that assumption (in the next section). Following Mennis & Hultgren (2006), we have

The most basic method for areal interpolation is areal weighting, in which a homogeneous distribution of the data throughout each source zone is assumed. Each source zone therefore contributes to the target zone a portion of its data proportional to the percentage of its area that the target zone occupies. In the case of the alternate geography problem, if we denote a choropleth source zone s and an area-class map zone z, then the target zone t = z. The estimation of the count for the target zone is:

\[\hat{y_t} = \sum^n_{s=1} \frac{y_s A_{s\cap z}}{A_s} \]

where:
\(y_t\) = the estimated count of the target zone;
\(y_s\) = the count of the source zone;
\(A_{s\cap z}\) = the area of the intersection between the source and target zone;
\(A_s\) = the area of the source zone; and
\(n\) = the number of source zones with which \(z\) overlaps (Goodchild & Lam, 1980)

A good way to think about this is with the figure from a classic Tobler paper on the subject. Imagine you were trying to transfer data collected at the state level, and convert it to some other meso-geography, like a PUMA or large hexgrid. The assumption we make with area-weighted interpolation is that each variable is distributed uniformly across space within each state

Population Density by State as shown in Tobler (1979)

That is problematic, because first, we are technically encoding both the ecological fallacy (Openshaw, 1984) and the modifiable areal unit problem (Manley, 2014) because we are literally making estimates about small areas based only on our data collected at a different resolution. And second, we know the source data is not technically uniform. Do we really think all of Texas or California or Florida have constant population density that high across the entire state? Of course not. But all we know for certain is the population density collected at the state level, so that’s the information we use to distribute information to another geographic unit. It is important to be critical about when the uniformity condition is likely to be violated badly.

Another important consideration is the difference in size between the source and target geometries.

If source \(\gg\) target, then the interpolation operation is effectively downscaling the original information, which will induce a lot of autocorrelation (because target units falling inside a single source geometry all get assigned the same value). This is one way you can induce ‘spatial dependence’ as a consequence of measurement error (Anselin, 1988).
If target \(\gg\) source, then the operation is effectively aggregating source polygons, in which case the estimates should be pretty accurate, but at the cost of additional information. In this case, you are throwing away variation from higher-resolution data.

4.1.1 Extensive and Intensive Variables

To allocate variables from source zones to target zones accurately, we first need to distinguish between two types of quantitative variables, intensive and extensive, where an “intensive property or intensive quantity is one whose magnitude is independent of the size of the system”, whereas an extensive property is one whose magnitude is additive for subsystems. In the context of areal interpolation, the ‘system’ refers to the the collection of units, and the subsystem refer to the units themselves. Thus, extensive quantities are things like counts or areas which are additive, whereas intensive variables include things like rates, ratios, densities and averages. During the process of interpolation, we take the weighted sum of extensive variables at each target location, whereas we take the weighted average for intensive variables.

Code

results = area_interpolate(
    source_df=dc,
    target_df=hexes,
    intensive_variables=["median_household_income"],
    extensive_variables=["n_total_pop"],
)

fig, ax = plt.subplots(1, 2, figsize=(7, 5))

dc.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[0], alpha=0.7)
results.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[1], alpha=0.7)

ax[0].set_title("original")
ax[1].set_title("interpolated")
for a in ax:
    a.axis("off")
    ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)

fig.suptitle("Total Population (extensive)")
plt.tight_layout()

/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_41836/483499942.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  results = area_interpolate(
/Users/knaaptime/Dropbox/projects/tobler/tobler/area_weighted/area_interpolate.py:325: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df.area.values
/Users/knaaptime/Dropbox/projects/tobler/tobler/util/util.py:60: UserWarning: nan values in variable: median_household_income, replacing with 0
  warn(f"nan values in variable: {column}, replacing with 0")

Code

dc.to_crs(4326).hvplot(
    c="n_total_pop",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
) + results.hvplot(
    c="n_total_pop",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
).opts(
    active_tools=["wheel_zoom"]
)

Code

fig, ax = plt.subplots(1, 2, figsize=(7, 5))

dc.to_crs(3857).plot(
    "median_household_income", scheme="quantiles", cmap="magma", ax=ax[0]
)
results.to_crs(3857).plot(
    "median_household_income", scheme="quantiles", cmap="magma", ax=ax[1]
)

ax[0].set_title("original")
ax[1].set_title("interpolated")
for a in ax:
    a.axis("off")
    ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)

fig.suptitle("Median Income (intensive)")
plt.tight_layout()

It can be useful to create linked interactive maps to see the change from one geometry to the other. To use hvplot effectively, we also need to classify the data manually using mapclassify so the colors take on the same values in the side-by-side maps.

Code

from mapclassify import Quantiles

qmi_orig = Quantiles(dc.median_household_income.dropna().values)
qmi_interp = Quantiles(results.median_household_income.values)
dc.to_crs(4326).dropna(subset="median_household_income").assign(
    label=qmi_orig.yb
).hvplot(
    c="label",
    cmap="magma",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
) + results.assign(
    label=qmi_interp.yb
).hvplot(
    c="label",
    cmap="magma",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    colorbar=False,
    frame_height=500,
).opts(
    active_tools=["wheel_zoom"]
)

4.1.2 Categorical Variables

Thus far we have relied on extensive and intensive quantitative variables in the interpolation process, but sometimes it is also useful to transfer categorical data between zonal representations². In these cases, we generally compute the share of land in each target zone covered by each categorical variable in the source polygons. For example, a common case is to have land-use data that you may want to summarize inside an administrative boundary, for example to look at the share of developed land inside a given municipality (i.e. as some measure of growth or land consumption over time).

4.1.2.1 Land-Use by Census Block

The National Land Cover Database provides land cover data for the U.S. at a 30m grid resolution. If we want to summarize each census block in D.C. by the share of land use it contains, then the following workflow is a straightforward way to do so. We first collect blocks for D.C., then use the extract_raster_features function from tobler to convert NLCD data into a geodataframe which is much easier to manipulate for this use-case. For ease of presentation, we will also convert the pixel values from integers into descriptions so each row in the dataframe belongs to a readable category.

Code

vals = datasets.nlcd_definitions()
vals

	code	color	classification	description
0	11	70,107,159	Open Water	areas of open water, generally with less than ...
1	12	209,222,248	Perennial Snow/Ice	areas characterized by a perennial cover of ic...
2	21	222,197,197	Developed, Open Space	areas with a mixture of some constructed mater...
3	22	217,146,130	Developed, Low Intensity	areas with a mixture of constructed materials ...
4	23	235,0,0	Developed, Medium Intensity	areas with a mixture of constructed materials ...
5	24	171,0,0	Developed High Intensity	highly developed areas where people reside or ...
6	31	179,172,159	Barren Land (Rock/Sand/Clay)	areas of bedrock, desert pavement, scarps, tal...
7	41	104,171,95	Deciduous Forest	areas dominated by trees generally greater tha...
8	42	28,95,44	Evergreen Forest	areas dominated by trees generally greater tha...
9	43	181,197,143	Mixed Forest	areas dominated by trees generally greater tha...
10	52	204,184,121	Shrub/Scrub	areas dominated by shrubs; less than 5 meters ...
11	71	223,223,194	Grassland/Herbaceous	areas dominated by gramanoid or herbaceous veg...
12	81	220,217,57	Pasture/Hay	areas of grasses, legumes, or grass-legume mix...
13	82	171,108,40	Cultivated Crops	areas used for the production of annual crops,...
14	90	184,217,235	Woody Wetlands	areas where forest or shrubland vegetation acc...
15	95	108,159,184	Emergent Herbaceous Wetlands	areas where perennial herbaceous vegetation ac...

Code

# get census block boundary data for Washington DC
dc_blocks = datasets.blocks_2020(states=["11"])
dc_blocks = dc_blocks.to_crs(dc_blocks.estimate_utm_crs())

nlcd_path = (
    "https://s3-us-west-2.amazonaws.com/mrlc/Annual_NLCD_LndCov_2023_CU_C1V0.tif"
)
# convert the NLCD into vector representation
dclu = extract_raster_features(
    dc_blocks, nlcd_path, pixel_values=vals["code"].tolist(), collapse_values=False
)
# replace the integer pixel values with their string descriptions for readibility
dclu["value"] = dclu["value"].replace(
    vals.set_index("code")["classification"].to_dict()
)
dclu = dclu.to_crs(dc_blocks.crs)

Now the dclu variable holds our land use data in a geodataframe of square polygons rather than a raster. While it is straightforward to plot this new data, with a little bit of extra work, we can create a ‘proper’ land use map that matches the colors of the NLCD. These colors are already present in the vals table shown above, so our task is to join the colors to their appropriate row, convert from RGBA color representation to hexcode representation, then create a legend to match.

Code

# convert the column of [R,G,B,A] colors to hexcolors
vals["hexcolor"] = vals["color"].apply(
    lambda x: to_hex(np.fromstring(x, sep=",") / 255)
)
# align colors with values using a left-join
hexcolors = dclu.merge(vals, left_on="value", right_on="classification", how="left")[
    "hexcolor"
].values
# plot the map, passing the colors manually
ax = dclu.plot(color=hexcolors)
# create a colorpatch/label for each entry in the nlcd definitions table
legend_elements = [
    Patch(
        facecolor=row[1]["hexcolor"], edgecolor="gray", label=row[1]["classification"]
    )
    for row in vals.iterrows()
]
# create legend and move it to the right of the map
ax.legend(
    handles=legend_elements,
).set_bbox_to_anchor([2, 1, 0, 0])
ax

Code

dclu.head(10)

	value	geometry
0	Developed, Open Space	POLYGON ((323226.647 4318276.105, 323256.195 4...
1	Developed, Open Space	POLYGON ((323183.792 4318224.692, 323213.34 43...
2	Developed, Low Intensity	POLYGON ((323331.53 4318191.928, 323390.626 43...
3	Developed, Open Space	POLYGON ((323140.937 4318173.279, 323170.484 4...
4	Developed, Open Space	POLYGON ((323436.413 4318107.752, 323465.961 4...
5	Developed, Medium Intensity	POLYGON ((323242.887 4318211.586, 323331.53 43...
6	Developed, Low Intensity	POLYGON ((323282.021 4318111.533, 323311.569 4...
7	Developed, Open Space	POLYGON ((323459.307 4318072.217, 323488.855 4...
8	Developed, Open Space	POLYGON ((323104.735 4318150.849, 323134.283 4...
9	Developed, Open Space	POLYGON ((323061.88 4318099.437, 323091.427 43...

Code

lu_blks = area_interpolate(
    source_df=dclu,
    target_df=dc_blocks.set_index("geoid"),
    categorical_variables=["value"],
)
lu_blks.columns = lu_blks.columns.str.replace("value_", "")

In the resulting dataframe lu_blks, we now have separate columns for each category of land use in our original source, and the values in each column hold the share of that land-use in each census block.

Code

lu_blks.head()

	Developed, Open Space	Developed, Low Intensity	Developed, Medium Intensity	Woody Wetlands	Developed High Intensity	Mixed Forest	Deciduous Forest	Shrub/Scrub	Pasture/Hay	Open Water	Emergent Herbaceous Wetlands	Grassland/Herbaceous	Evergreen Forest	Barren Land (Rock/Sand/Clay)	geometry
geoid
110019800001089	0.207453	0.603459	0.135434	0.000000	0.053653	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((322969.039 4306546.708, 322970.142 4...
110010056012005	0.000000	0.000000	0.502513	0.000000	0.497487	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((321580.797 4307997.329, 321596.402 4...
110010005013003	0.062934	0.131362	0.476541	0.000000	0.328051	0.0	0.001112	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((321758.637 4310250.787, 321759.503 4...
110010005011000	0.384194	0.402416	0.136406	0.000000	0.003791	0.0	0.073193	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((321810.7 4311101.634, 321883.434 431...
110010111003035	0.347988	0.319033	0.147562	0.063681	0.021227	0.0	0.100508	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((329084.877 4308909.414, 329085.096 4...

Code

lu_blks.explore("Developed High Intensity", tiles="CartoDB Positron", style_kwds=dict(weight=0.5))

Make this Notebook Trusted to load map: File -> Trust Notebook

4.2 Smooth Pycnophylactic Interpolation

As an alternative to simple area-weighted interpolation, Tobler (1979) introduced a technique known as ‘smooth pycnophylactic interpolation’, which does not assume a uniform variable distribution within source zones (and hard breaks at zone edges), but rather approximates a continuous density surface from the source zones that can be re-aggregated into the target zones. The pycnophylactic (mass-preserving) property is designed to keep the integrity of extensive variables (e.g. counts), so Tobler’s method does not work well for intensive variables where we aim to preserve the mean rather than the sum.

“Census enumerations are usually packaged in irregularly shaped geographical regions. Interior values can be interpolated for such regions, without specification of”control points,” by using an analogy to elliptical partial differential equations. A solution procedure is suggested, using finite difference methods with classical boundary conditions. In order to estimate densities, an additional nonnegativity condition is required. Smooth contour maps, which satisfy the volume preserving and nonnegativity constraints, illustrate the method using actual geographical data. It is suggested that the procedure may be used to convert observations from one bureaucratic partitioning of a geographical area to another.”

— Tobler (1979)

This method proceeds in a few steps; first it defines a discrete raster grid used to approximate the smoth density surface (requiring the analyst to set a raster cell size). Then it uses the centroid of each source polygon to interpolate a smooth density surface. Finally, it uses the source boundaries as a constraint and iteratively updates the density surface estimate via convolution to ensure that the source volume is respected (i.e. the pycnophylactic property is preserved). Once the raster surface converges, its cells can be summed by intersecting target polygon resulting in the final output.

Conducting pycnophylactic interpolation is nearly identical to the area-weighted approach, except that we must set the cell size of the intermediate raster used to estimate the density surface (measured in units of the CRS). In this case, we will set the cell size to 100 feet, then interpolate both total population and median household income.

Code

hexes = hexes.to_crs(hexes.estimate_utm_crs())
dc = dc.to_crs(dc.estimate_utm_crs())

dc_pycno = pycno_interpolate(
    source_df=dc,
    target_df=hexes,
    variables=["n_total_pop", "median_household_income"],
    cellsize=100,
    handle_null=True,
)

WARNING: nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this. [astropy.convolution.convolve]
WARNING:astropy:nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this.
WARNING: nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this. [astropy.convolution.convolve]
WARNING:astropy:nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this.

Code

fig, ax = plt.subplots(1, 2, figsize=(8, 6))

dc_pycno.plot("n_total_pop", scheme="quantiles", ax=ax[0])
dc.plot("n_total_pop", scheme="quantiles", ax=ax[1])

ax[0].set_title("interpolated")
ax[1].set_title("original")
for ax in ax:
    ax.axis("off")
fig.suptitle("Total Population (extensive)")
plt.tight_layout()

As described above, however, Tobler’s method is devised for interpolating extensive variables, but we have just used it to transfer an intensive variable, median household income. If we take a look at a table of values for the different interpolation methods for the median household income variable, the consequences of treating median income as an extensive variable in the pycnophylactic approach are evident.

Code

pd.DataFrame.from_dict(
    {
        "sum": [
            dc.median_household_income.sum().round(0),
            dc_pycno.median_household_income.sum().round(0),
            results.median_household_income.sum().round(0),
        ],
        "average": [
            dc.median_household_income.mean().round(0),
            dc_pycno.median_household_income.mean().round(0),
            results.median_household_income.mean().round(0),
        ],
    },
    columns=["original", "pycno", "area weighted"],
    orient='index')

	original	pycno	area weighted
sum	16265339.0	15958011.0	22975386.0
average	92417.0	64870.0	93396.0

Whereas Tobler’s technique does a good job of preserving the aggregate sum of median income (which is inflated in the area-weighted approach, where median income is correctly treated as an intensive variable), it significantly deflates the average. Thus while the map looks reasonable, this would be a poor approach for estimating local values in an alternative geographic unit.

Code

fig, ax = plt.subplots(1, 2, figsize=(8, 6))

dc_pycno.plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[0])
dc.plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[1])

ax[0].set_title("interpolated")
ax[1].set_title("original")
for ax in ax:
    ax.axis("off")
fig.suptitle("Median Income (intensive)")
plt.tight_layout()

Three important considerations affect the interpolation quality of the smooth pycnophylactic method: 1) how well the process being interpolated approximates a smooth function, 2) how well the centroid represents the center of mass for each source zone, and 3) how boundary conditions are handled. The first issue is self-explanatory, and as we discuss in the opening of this chapter, many processes in human geography are extremely “lumpy” and not always well-approximated by smooth functions, however, the second and third considerations have technical implications.

To generate the initial smooth density raster, Tobler’s approach relies on the centoid of source zones to represent the center of mass, e.g. where the bulk of the variable being interpolated is expected to exist. In many cases, however, the centroid is not the center of mass—and in some cases may not even fall inside the polygon itself. In these cases, Tobler’s method makes poor assumptions about where peaks in the density surface. In the Riverside example above, the centroid of the county falls in the desert, whereas the bulk of population is clustered into the Western portion, and an interpolation based on county source polygons would misattribute more population toward the center. Finally, Tobler’s method requires setting certain boundary conditions before it becomes solveable (i.e. the Dirichlet or Neumann conditions). In the tobler package, pycnophylactic interpolation is implemented with zero population on the boundary, which can lead to underestimation along the periphery.

Anselin, L. (1988). Spatial Econometrics: Methods and Models. In Operational Regional Science Series (Vol. 4). Springer Netherlands. https://doi.org/10.1007/978-94-015-7799-1

Ding, C., Knaap, G. J., & Hopkins, L. D. (1999). Managing Urban Growth with Urban Growth Boundaries: A Theoretical Analysis. Journal of Urban Economics, 46(1), 53–68. https://doi.org/10.1006/juec.1998.2111

Fotheringham, A., Brunsdon, C., & Charlton, M. (2007). Quantitative Geography. SAGE Publications Ltd. https://doi.org/10.4135/9781849209755

Goodchild, M. F., & Lam, N. S.-N. N. (1980). Areal Interpolation: A variant of the traditional spatial problem. Geo-Processing, 1, 297–312.

Knaap, G., Ding, C., & Hopkins, L. D. (2001). Managing Urban Growth for the Efficient Use of Public Infrastructure: Toward a Theory of Concurrency. International Regional Science Review, 24(3), 328–343. https://doi.org/10.1177/016001760102400304

Manley, D. (2014). Scale, Aggregation, and the Modifiable Areal Unit Problem. In Handbook of Regional Science (pp. 1157–1171). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23430-9_69

Martin, D., Tate, N. J., & Langford, M. (2000). Refining Population Surface Models: Experiments with Northern Ireland Census Data. Transactions in GIS, 4(4), 343–360. https://doi.org/10.1111/1467-9671.00060

Mennis, J., & Hultgren, T. (2006). Intelligent Dasymetric Mapping and Its Application to Areal Interpolation. Cartography and Geographic Information Science, 33(3), 179–194. https://doi.org/10.1559/152304006779077309

Openshaw, S. (1984). Ecological Fallacies and the Analysis of Areal Census Data. Environment and Planning A: Economy and Space, 16(1), 17–31. https://doi.org/10.1068/a160017

Tobler, W. R. (1979). Smooth pycnophylactic interpolation for geographical regions. Journal of the American Statistical Association, 74(367), 519–529. https://doi.org/10.1080/01621459.1979.10481647

Yoo, E.-H., Kyriakidis, P. C., & Tobler, W. (2010). Reconstructing Population Density Surfaces from Areal Data: A Comparison of Tobler’s Pycnophylactic Interpolation Method and Area-to-Point Kriging. Geographical Analysis, 42(1), 78–98. https://doi.org/10.1111/j.1538-4632.2009.00783.x

Except for very specific variogram models with a particular set of constraints (Yoo et al., 2010)–and even then “it is the analyst’s responsibility to justify whether the smoothness criterion employed in Tobler’s approach is relevant to the particular application at hand.” [p. 80]. In general, population density is not a very smooth function because the development of urban infrastructure (required to support population density) is an inherently “lumpy” process (i.e. a discontinuous function, almost by definition) (Ding et al., 1999; Knaap et al., 2001).↩︎
Technically, categorical variables are intensive because they don’t depend on the size of the system, but they need to be handled separately during interpolation. Since they can’t be neatly summarized via weighted average like other intensive variables (because they have no magnitude), we treat them as extensive variables (the count of land area), and return an intensive variable: the share of land area in the target covered by the category↩︎