import contextily as ctximport geopandas as gpdimport geoviews as gvimport hvplot.pandasimport matplotlib.pyplot as pltimport pydeck as pdkfrom geosnap import DataStorefrom geosnap import io as giofrom lonboard import Map, PolygonLayer, vizfrom mapclassify.util import get_color_arrayfrom tobler.area_weighted import area_interpolatefrom tobler.dasymetric import masked_area_interpolatefrom tobler.pycno import pycno_interpolatefrom tobler.util import h3fy%load_ext jupyter_black%load_ext watermark%watermark -a 'eli knaap'-iv
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.12/site-packages/dask/dataframe/__init__.py:42: FutureWarning:
Dask dataframe query planning is disabled because dask-expr is not installed.
You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.
warnings.warn(msg, FutureWarning)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Let’s say we want to represent the median income and total population using a regular hexgrid, but the only data available is provided at the the census-tract level. Since we can’t aggregate data upward into our new boundaries, we need some way to transfer information between the two sets of geographic units.
The tobler package provides several different ways to estimate data collected at one geography using the boundaries of a different geography.
The simplest technique available in tobler is simple areal interpolation in which variables from the source data are weighted according to the overlap between source and target polygons, then reaggregated to fit the target polygon geometries
The critical assumption in area weighted interpolation is that the variable(s) of interest is distributed uniformly across the source geometry. In practice, that is rarely true, but there are some ways to relax that assumption (in the next section).
A good way to think about this is with the figure from a classic Tobler paper on the subject. Imagine you were trying to transfer data collected at the state level, and convert it to some other meso-geography, like a PUMA or large hexgrid. The assumption we make with area-weighted interpolation is that each variable is distributed uniformly across space within each state
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.12/site-packages/geosnap/io/util.py:275: UserWarning: Unable to find local adjustment year for 2020. Attempting from online data
warn(
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.12/site-packages/geosnap/io/constructors.py:215: UserWarning: Currency columns unavailable at this resolution; not adjusting for inflation
warn(
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.12/site-packages/tobler/util/util.py:117: UserWarning: The source geodataframe is stored in a geographic CRS. Falling back to estimated UTM zone to generate desired buffer. If this produces unexpected results, reproject the input data prior to using this function
warn(
That is problematic, because first, we are technically encoding both the ecological fallacy (Openshaw, 1984) and the modifiable areal unit problem (Manley, 2014) because we are literally making estimates about small areas based only on our data collected at a different resolution. And second, we know the source data is not technically uniform. Do we really think all of Texas or California or Florida have constant population density that high across the entire state? Of course not. But all we know for certain is the population density collected at the state level, so that’s the information we use to distribute information to another geographic unit. It is important to be critical about when the uniformity condition is likely to be violated badly.
Another important consideration is the difference in size between the source and target geometries.
If source\(\gg\)target, then the interpolation operation is effectively downscaling the original information, which will induce a lot of autocorrelation (because target units falling inside a single source geometry all get assigned the same value). This is one way you can induce ‘spatial dependence’ as a consequence of measurement error (Anselin, 1988)
if target\(\gg\)source, then the operation is effectively aggregating source polygons, in which case the estimates should be pretty accurate, but at the cost of additional information. In this case, you are throwing away variation from higher-resolution data.
/var/folders/79/cknfb1sx2pv16rztkpg6wzlw0000gn/T/ipykernel_53453/3633170404.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
results = area_interpolate(
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.12/site-packages/tobler/area_weighted/area_interpolate.py:314: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
den = source_df.area.values
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.12/site-packages/tobler/util/util.py:57: UserWarning: nan values in variable: median_household_income, replacing with 0
warn(f"nan values in variable: {column}, replacing with 0")
Code
fig, ax = plt.subplots(1, 2, figsize=(6, 4))dc.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[0], alpha=0.7)results.to_crs(3857).plot("n_total_pop", scheme="quantiles", ax=ax[1], alpha=0.7)ax[0].set_title("original")ax[1].set_title("interpolated")for a in ax: a.axis("off") ctx.add_basemap(ax=a, source=ctx.providers.CartoDB.Positron)fig.suptitle("Total Population (extensive)")plt.tight_layout()
Pycnophylactic interpolation is based on Tobler’s method and does not distinguish between extensive and intensive variables
“Census enumerations are usually packaged in irregularly shaped geographical regions. Interior values can be interpolated for such regions, without specification of “control points,” by using an analogy to elliptical partial differential equations. A solution procedure is suggested, using finite difference methods with classical boundary conditions. In order to estimate densities, an additional nonnegativity condition is required. Smooth contour maps, which satisfy the volume preserving and nonnegstivity constraints, illustrate the method using actual geographical data. It is suggested that the procedure may be used to convert observations from one bureaucratic partitioning of a geographical area to another.”
Tobler’s approach does not assume a uniform distribution in the source zones but instead estimates a smooth density surface based on the centroid of the input geometries; then, it uses the source boundaries as a constraint and iteratively updates the density surface estimate to ensure that the source volume is respected (i.e. the pycnophylactic property is preserved). This can have negative consequences when the surface is poorly estimated from the input centroids
WARNING: nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this. [astropy.convolution.convolve]
WARNING:astropy:nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this.
WARNING: nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this. [astropy.convolution.convolve]
WARNING:astropy:nan_treatment='interpolate', however, NaN values detected post convolution. A contiguous region of NaN values, larger than the kernel size, are present in the input array. Increase the kernel size to avoid this.
Code
fig, ax = plt.subplots(1, 2, figsize=(10, 7))dc_pycno.plot("n_total_pop", scheme="quantiles", ax=ax[0])dc.plot("n_total_pop", scheme="quantiles", ax=ax[1])ax[0].set_title("interpolated")ax[1].set_title("original")for ax in ax: ax.axis("off")fig.suptitle("Total Population (extensive)")plt.tight_layout()
Code
fig, ax = plt.subplots(1, 2, figsize=(10, 7))dc_pycno.plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[0])dc.plot("median_household_income", scheme="quantiles", cmap="magma", ax=ax[1])ax[0].set_title("interpolated")ax[1].set_title("original")for ax in ax: ax.axis("off")fig.suptitle("Median Income (intensive)")plt.tight_layout()
:::
Anselin, L. (1988). Spatial Econometrics: Methods and Models. In Operational Regional Science Series (Vol. 4). Springer Netherlands. https://doi.org/10.1007/978-94-015-7799-1
Manley, D. (2014). Scale, Aggregation, and the Modifiable Areal Unit Problem. In Handbook of Regional Science (pp. 1157–1171). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23430-9_69
Openshaw, S. (1984). Ecological Fallacies and the Analysis of Areal Census Data. Environment and Planning A: Economy and Space, 16(1), 17–31. https://doi.org/10.1068/a160017
Tobler, W. R. (1979). Smooth pycnophylactic interpolation for geographical regions. Journal of the American Statistical Association, 74(367), 519–529. https://doi.org/10.1080/01621459.1979.10481647