3  Geovisualization

Code
import contextily as cx
import geopandas as gpd
import geoviews as gv
import hvplot.pandas
import mapclassify as mc
import matplotlib.font_manager as font_manager
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pydeck as pdk
import seaborn as sns
from geosnap import DataStore
from geosnap import io as gio
from matplotlib import cm
from matplotlib.colors import Normalize
from matplotlib_scalebar.scalebar import ScaleBar

%load_ext watermark
%watermark -a 'eli knaap'  -iv
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/dask/dataframe/__init__.py:42: FutureWarning: 
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  warnings.warn(msg, FutureWarning)
Author: eli knaap

contextily : 1.6.2
mapclassify: 2.8.0
geosnap    : 0.14.0
geopandas  : 1.0.1
numpy      : 1.26.4
geoviews   : 1.12.0
pandas     : 2.2.2
pydeck     : 0.8.0b4
seaborn    : 0.13.2
matplotlib : 3.9.2
hvplot     : 0.10.0
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Mapping is a cornerstone of exploratory spatial analysis. Tufte (1983) actually says maps are among the oldest data visualization techniques ever devised. While many of the uninitiated pejoratively reduce spatial analysis (and Geography, in general) to “mapping”, many spatial analysts revere cartographers who are actual map designers. Regardless of the perspective, maps are a staple of good data visualization. For that reason, there are important lessons from both data visualization in general (Cairo, 2016, 2019; Tufte, 1983, 2006), and cartography in particular (Brath & Banissi, 2018; Brewer et al., 1997; Brewer & Pickle, 2002; Eicher & Brewer, 2001; Harvey & Losang, 2018; Jenny et al., 2018; Minard, 1862; Olson & Brewer, 1997; Roth et al., 2010; Tobler, 2004)

3.1 Visual Variables

When we make maps, we need to combine ideas from Stevens (1946) and Bertin (1983). Basically Stevens (1946) provides a taxonomy of measurement scales: nominal, ordinal, interval, ratio (NOIR), and Bertin (1983) maps those onto different “visual variables” which have been extended to include location size, shape, value, hue, orientation, grain (pattern spacing), transparency, fuzziness, resolution, volume/height. Data visualization experts have spent a long time thinking about the best ways to use each variable to visualize each measurement scale, for example the lovely Table 3.1 is from the wikipedia article on visual variables and describes the recommended ways to encode measurement scales into visual variables.

Table 3.1: Measurement Scales and Visual Variables
Level Distinction Preferred Variables Marginal Variables Examples
Nominal Same or different Color Hue, Shape Pattern Arrangement, Orientation Owner, Facility type
Hierarchical Degree of qualitative difference Color Hue Shape, Arrangement Languages, Geologic formation
Ordinal Order Color value, Color saturation, Transparency, Crispness Size, Height, Color Hue, Pattern Spacing Socioeconomic status (rich, middle class, poor)
Interval Amount of quantitative difference Color value Size, Color saturation, Opacity, Hue Temperature, Year
Ratio Proportional difference Size, Height, Color value Transparency, Pattern Spacing Population growth rate, population density
Cyclical Angular difference Color hue, orientation Day of the year, Aspect of terrain

So our primitive units for visualization are points, lines and polygons, each of which may be associated with a NOIR measurement that can be used to style it via a visual variable. That seems like a lot, but since we’re mapping, almost all the features will already be represented by arrangement and orientation. We need to make critical decisions about the rest, though. If you are coming from R and are used to thinking about this as a Grammar of Graphics, check out plot9 which gives a ggplot API in Python.

Code
datasets = DataStore()

counties = datasets.counties()

gdf = gio.get_acs(
    datasets, msa_fips="47900", years=[2021], level="tract", constant_dollars=False
)

Since we’re only doing visualization in this notebook, not any spatial analysis, it is convenient to reproject the dataset into a coordinate system ready for cartography. The most useful CRS in this case is “web mercator” also known as EPSG 3857, which is the system used by google maps and by every other web map tile provider thereafter.

Code
gdf = gdf.to_crs(3857)

The dead simple way to create a map is by using the plot method on the geodataframe, which returns an outline of the study area

Code
gdf.plot()

As a first cut, geopandas makes it very easy to plot a map quickly. If you know the area well, this may do fine for quick exploration. If you don’t know a place extremely well (or you want to make a figure easy to understand for those who don’t) it’s often a good idea to add a basemap for context. We can do that easily using the contextily package which uses the package xyzservices to collect a basemap from a web-based tile provider.

Imagine we wanted to visualize toxic release emissions from point-source polluters in the U.S. That only takes a few lines: we start by collecting data from the Environmental Protection Agency (EPA), then use the included lat/long columns to turn the CSV into a geodataframe.

Code
epa_tri = pd.read_csv(
    "https://data.epa.gov/efservice/downloads/tri/mv_tri_basic_download/2022_US/csv/2022_us.csv",
    low_memory=False,
)
Code
# print a few columns to see what we need
epa_tri.columns[:20]
Index(['1. YEAR', '2. TRIFD', '3. FRS ID', '4. FACILITY NAME',
       '5. STREET ADDRESS', '6. CITY', '7. COUNTY', '8. ST', '9. ZIP',
       '10. BIA', '11. TRIBE', '12. LATITUDE', '13. LONGITUDE',
       '14. HORIZONTAL DATUM', '15. PARENT CO NAME', '16. PARENT CO DB NUM',
       '17. STANDARD PARENT CO NAME', '18. FOREIGN PARENT CO NAME',
       '19. FOREIGN PARENT CO DB NUM', '20. STANDARD FOREIGN PARENT CO NAME'],
      dtype='object')
Code
# convert lat/long to Point geoms in a geodataframe
epa_tri = gpd.GeoDataFrame(
    epa_tri,
    geometry=gpd.points_from_xy(epa_tri["13. LONGITUDE"], epa_tri["12. LATITUDE"]),
    crs=4326,
)

# keep only the sites in the DC region
epa_tri = epa_tri[epa_tri.intersects(gdf.to_crs(4326).unary_union)]

epa_tri = epa_tri.to_crs(gdf.crs)
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_70545/1825304274.py:9: DeprecationWarning: The 'unary_union' attribute is deprecated, use the 'union_all()' method instead.
  epa_tri = epa_tri[epa_tri.intersects(gdf.to_crs(4326).unary_union)]
Code
# create a plot and store resulting `ax`
ax = gdf.plot(alpha=0.3, figsize=(9, 9))

# plot the other dataframe to the same ax
epa_tri.plot(ax=ax, color="red")
ax.axis("off")

# add a basemap with contextily
cx.add_basemap(ax=ax, source=cx.providers.CartoDB.Positron)

Now we have shown the locations of each toxic release site along with a nice unobtrusive basemap from CARTO. We also added some transparency to the polygons because we only want to highlight the extent of the region. By presenting all the TRI sites in exactly the same style, it implies that they are all the same. But they are probably different in many ways, so we could style them using a variety of variables:

  • the name of the site (nominal)
  • the type of the largest emission (nominal)
  • the volume of emissions (ratio)
  • the rank in the distribution of emittors (ordinal)

One measure that seems important is '65. ON-SITE RELEASE TOTAL', so lets visualize that below.

Code
epa_tri['65. ON-SITE RELEASE TOTAL']
23         0.0000
366        0.0000
643        0.0000
822        1.3700
1458     250.0000
           ...   
78257      0.0000
78312      0.0097
78395      0.0000
78458      0.0000
78877     81.0000
Name: 65. ON-SITE RELEASE TOTAL, Length: 329, dtype: float64
Code
# re-order so smaller dots draw on top
epa_tri = epa_tri.dropna(subset=['65. ON-SITE RELEASE TOTAL']).sort_values(
    '65. ON-SITE RELEASE TOTAL', ascending=False
)

# only show emitting facilities
epa_tri = epa_tri[epa_tri['65. ON-SITE RELEASE TOTAL'] > 0]

ax = gdf.plot(figsize=(9, 9), alpha=0.3)

epa_tri.plot(
    ax=ax,
    color="red",
    markersize=epa_tri['65. ON-SITE RELEASE TOTAL'].apply(lambda x: 10 + (x**0.6)),
    legend=True,
    alpha=0.3,
)
ax.axis("off")
ax.set_title("Toxic Release Sites in the D.C. Metro\nby Total Release Volume")
cx.add_basemap(ax=ax, source=cx.providers.CartoDB.Positron)
Figure 3.1: Encoding Emission Volume with Size

Here, we scale the size of the point to reflect the amount of emissions given off. This works well (though we will eventually need to create the legend by hand). We could also treat this like an interval variable, but since the origin is technically being treated as a point-source, then the total is a ratio since the denominator is constant across units. We also use transparency because it reveals that some locations are in exactly the same place (and as an additional effect, we have induced an intensity gradient when points overlap, corresponding to the increased exposure).

But we could also use color (or size and color) to convey the same information

Code
ax = gdf.plot(figsize=(8, 8), alpha=0.3)

epa_tri.plot(
    '65. ON-SITE RELEASE TOTAL',
    scheme="quantiles",
    k=8,
    cmap="YlOrBr",
    ax=ax,
    linewidth=0.3,
    # color="red",
    markersize=epa_tri['65. ON-SITE RELEASE TOTAL'].apply(lambda x: 10 + (x**0.6)),
    legend=True,
    alpha=0.3,
)
ax.axis("off")
ax.set_title("Toxic Release Sites in the D.C. Metro\nby Total Release Volume")
cx.add_basemap(ax=ax, source=cx.providers.CartoDB.Positron)
Figure 3.2: Encoding Emission Volume with Size and Color

This does not work as well. We have used two visual variables (size and color) to encode the same measurement (total realase volume) which is less effective than using size alone. In this case, the eye is drawn to the lighter-colored points, which are actually the smaller emittors. Including multiple hues also disrputs the intensity gradient created by the transparency. But as an alternative, we could use color to encode a different variable like dollars spent in sequestration or something, which would let us look at bivariate correlation in space.

3.2 Static Choropleths

For a review on choropleth maps, particularly map classifiers, see Rey et al. (2023)

Usually, we want to convey more information, e.g. by creating a choropleth map which shades each polygon according to some value associated to it. Choropleths are easy with geopandas; all we need to do is provide the column we want to visualize, the binning scheme we use to assign colors, and any additional parameters that make the map more aesthetically pleasing.

Code
gdf.plot(
    column="median_home_value",
    # scheme="quantiles",
    edgecolor="white",
    legend=True,
    linewidth=0.3,
)

Gross. This map is uninformative for several reasons. First the thick white border obscures most of the visualization, there is hardly any color variation in the map, and the scientific notation in the scalebar make it impossible to read. We do not know what variable we are visualizing or even where in the world we are. With a bit more code, we can make this visualization much better.

Code
# set a different font for the legend
legend_font = font_manager.FontProperties(family="Arial", style="normal", size=10)

# options for a better legend
legend_kwds = {
    # add a dollar sign and suppress decimals
    "fmt": "\${:,.0f}",
    "title": "2021 Dollars",
    "prop": legend_font,
    "loc": "upper right",
}

# options for displaying missing (NaN) data
missing_kwds = {
    "color": "lightgray",
    "label": "No Data",
}

# select out counties in the DC metro
dc_counties = counties[counties.geoid.str[:5].isin(gdf.geoid.str[:5].unique())]

# instantiate an interface onto which we'll plot
f, ax = plt.subplots(1, figsize=(12, 8))

# plot the map into the interface we created using the `ax` argument
gdf.to_crs(gdf.estimate_utm_crs()).plot(
    column="median_home_value",
    scheme="quantiles",
    k=5,
    alpha=0.5,
    ax=ax,
    legend=True,
    legend_kwds=legend_kwds,
    missing_kwds=missing_kwds,
)

# layer on county boundaries for context
dc_counties.to_crs(dc_counties.estimate_utm_crs()).plot(
    ax=ax, facecolor="none", linewidth=0.2, edgecolor="black"
)

# add a basemap to the same `ax`
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron, crs=gdf.estimate_utm_crs())

# add a title
ax.set_title("Median Home Values\nin the D.C. Metro Region", font="Arial", fontsize=15)

# add a scalebar
ax.add_artist(ScaleBar(1, location="lower right"))

# turn the axis/labels off because they're not useful
ax.set_axis_off()
<>:10: SyntaxWarning: invalid escape sequence '\$'
<>:10: SyntaxWarning: invalid escape sequence '\$'
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_70545/1295614893.py:10: SyntaxWarning: invalid escape sequence '\$'
  "fmt": "\${:,.0f}",
Figure 3.3: Home Values using Quantiles

This map is not perfect, but it has most of the things cartographers like to see:

  • an informative title for the figure and legend,
  • a nicely formatted legend with explicit units of measurement and no trailing decimals
  • a clean basemap underneath (appropriately credited) and county boundaries overlaid on top for context
  • a scalebar for additional reference

There are also obvious ways it could be improved. For example, there is no north arrow (though there are simple solutions for that), and it is not immediately intuitive (to me, at least) that yellow means more expensive. You might also wonder why the size of the last bracket is more than 6x the size of the first.

3.2.1 Map Classifiers

In the map above, the scheme argument defines the classification scheme used to assign each tract’s median home value to the appropriate color. In this example we used the ‘quantile’ classifier scheme='quantiles with five classes k=5 (i.e. quintiles). Under the hood, geopandas is generating this binning scheme using the mapclassify package. To see what it is doing, we can classify the median_home_value variable directly. For convenience, we will assign that Series on the geodataframe to the hv variable.

Code
pd.set_option("display.float_format", lambda x: "%.2f" % x)
gdf.median_home_value.describe()
count      1442.00
mean     504914.22
std      233140.10
min       93900.00
25%      337950.00
50%      444600.00
75%      609225.00
max     1938800.00
Name: median_home_value, dtype: float64

We use a map classification scheme to partition this distribution into categorical bins (classes), then assign each class to a color.

Code
gdf.median_home_value.plot(kind="hist", bins=60)
sns.despine()

Code
mc.classifiers.CLASSIFIERS
('BoxPlot',
 'EqualInterval',
 'FisherJenks',
 'FisherJenksSampled',
 'HeadTailBreaks',
 'JenksCaspall',
 'JenksCaspallForced',
 'JenksCaspallSampled',
 'MaxP',
 'MaximumBreaks',
 'NaturalBreaks',
 'Quantiles',
 'Percentiles',
 'PrettyBreaks',
 'StdMean',
 'UserDefined')
Code
hv = gdf["median_home_value"].dropna()

Generating a quantile classification with five classes (again, quintiles), produces the following, which shows the minimum and maximum values for each class (also called “bin”) and the number of observations (count) assigned to each class:

Code
mc.Quantiles(hv, k=5)
Quantiles

        Interval           Count
--------------------------------
[  93900.00,  318720.00] |   289
( 318720.00,  397480.00] |   288
( 397480.00,  499200.00] |   288
( 499200.00,  656200.00] |   288
( 656200.00, 1938800.00] |   289

To create deciles (i.e. ten classes with a quantile classification), we use k=10. Note that although we are using quantiles, the number of observations assigned to each bin is not strictly equal since there are a number of duplicate values in the dataset.

Code
mc.Quantiles(hv, k=10)
Quantiles

        Interval           Count
--------------------------------
[  93900.00,  275810.00] |   145
( 275810.00,  318720.00] |   144
( 318720.00,  356350.00] |   144
( 356350.00,  397480.00] |   144
( 397480.00,  444600.00] |   145
( 444600.00,  499200.00] |   143
( 499200.00,  575760.00] |   144
( 575760.00,  656200.00] |   144
( 656200.00,  826450.00] |   144
( 826450.00, 1938800.00] |   145

Assigning the classifier object to a variable provides access to several attributes as a convenience via object orientation.

Code
q10 = mc.Quantiles(hv, k=10)

For example, the bins attribute stores the upper boundary of each class and the counts attribute stores the number of observations in the class.

Code
q10.bins
array([ 275810.,  318720.,  356350.,  397480.,  444600.,  499200.,
        575760.,  656200.,  826450., 1938800.])
Code
q10.counts
array([145, 144, 144, 144, 145, 143, 144, 144, 144, 145])
Code
q10.yb
array([7, 9, 9, ..., 0, 0, 2])
Code
labels = pd.Series(q10.yb, index=hv.index, name='homeval_decile')
gdf = gdf.join(labels)

These are the class intervals (the max and min values) of the variable (median home values) that define the range of each decile bin

Code
gdf.groupby('homeval_decile')['median_home_value'].agg(['min', 'max', 'count'])
min max count
homeval_decile
0.00 93900.00 275800.00 145
1.00 275900.00 318600.00 144
2.00 319200.00 356200.00 144
3.00 356700.00 397200.00 144
4.00 397900.00 444600.00 145
5.00 444700.00 498900.00 143
6.00 499400.00 575200.00 144
7.00 576000.00 655800.00 144
8.00 656300.00 824200.00 144
9.00 826700.00 1938800.00 145

Alternatively, we could have used the Fisher Jenks classifier instead of the Quantile classifier. To compare the difference, we can classify the hv data again using jenks also with 10 classes.

Code
fj10 = mc.FisherJenks(hv, k=10)
Code
fj10
FisherJenks

        Interval           Count
--------------------------------
[  93900.00,  305000.00] |   246
( 305000.00,  388900.00] |   308
( 388900.00,  477000.00] |   252
( 477000.00,  571300.00] |   197
( 571300.00,  681300.00] |   188
( 681300.00,  809700.00] |   100
( 809700.00,  956000.00] |    81
( 956000.00, 1119700.00] |    40
(1119700.00, 1383400.00] |    21
(1383400.00, 1938800.00] |     9
Code
f, ax = plt.subplots(1, figsize=(9, 9))
gdf.plot(
    column="median_home_value",
    scheme="fisher_jenks",
    ax=ax,
    alpha=0.5,
    legend=True,
    legend_kwds=legend_kwds,
    missing_kwds=missing_kwds,
)
dc_counties.to_crs(gdf.crs).plot(
    ax=ax, facecolor="none", linewidth=0.1, edgecolor="black"
)
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)
ax.set_axis_off()
ax.set_title("Median Home Values\nin the D.C. Metro Region", font="Arial", fontsize=15)

plt.show()
Figure 3.4: Home Values using Fisher Jenks

The Fisher jenks classifier highlights different patterns than the quantile classifier in Figure 3.3. What really sticks out in this version is the region of concentrated affluence along the Potomac river where median home prices range from ~1.1M to 2M

Using the Classifier object, we can use the bins attribute to plot how each classification scheme breaks the distribution into pieces

Code
f, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

q10.plot_histogram(bins=60, ax=ax[0])
ax[0].set_title("Quantile Breaks")

fj10.plot_histogram(bins=60, ax=ax[1])
ax[1].set_title("Fisher Jenks Breaks")
Text(0.5, 1.0, 'Fisher Jenks Breaks')
(a) Histograms Classified by Quantiles and Fisher Jenks
(b)
Figure 3.5

As we know from Rey et al. (2023), Fisher Jenks is an optimimal classifier that prioritizes homogeneity within each class. Thus for the same value of k, Jenks should always have a better “fit” score than Quantiles. For example the ‘absolute deviation from class mean’ (ADCM) should always be lower for the jenks classifier, and the ‘goodness of absolute deviation of fit’ (GADF) should always be higher.

Code
fj10.adcm
37204900.0
Code
fj10.gadf
0.8481319216904772
Code
q10.adcm
43187300.0
Code
q10.gadf
0.8237121384985082
Code
fj10.adcm < q10.adcm
True
Code
fj10.gadf > q10.gadf
True

In some cases, existing classification schemes just are not adequate, for example if you want to show distinctions between critical cutoff values or generate breaks with nicely rounded numbers that resonate more clearly with an audience. In those cases, it is simple to create a classifier using manually-specified breaks

Code
bins = [100000, 500000, 1000000, 1500000]
Code
ud4 = mc.UserDefined(hv, bins=bins)
ud4
UserDefined

        Interval           Count
--------------------------------
[  93900.00,  100000.00] |     1
( 100000.00,  500000.00] |   869
( 500000.00, 1000000.00] |   513
(1000000.00, 1500000.00] |    54
(1500000.00, 1938800.00] |     5
Code
# ignore user-defined
f, ax = plt.subplots(5, 3, figsize=(10, 18))
ax = ax.flatten()

for i, classifier in enumerate(mc.classifiers.CLASSIFIERS[:14]):

    gdf.plot(
        column="median_home_value",
        scheme=classifier,
        ax=ax[i],
        alpha=0.6,
        legend=False,
        missing_kwds=missing_kwds,
    )

    cx.add_basemap(ax[i], source=cx.providers.CartoDB.Positron, attribution="")
    ax[i].set_title(classifier)
    ax[i].set_axis_off()
ax[14].remove()

Using the Tuftian principle of small multiples (Tufte, 1983), we can also look at the intra-county variation in home prices by plotting each map independently but presenting them all together. First, we create a list of county FIPS codes, which are the first five characters of the geoid field.

Code
county_list = set([geoid[:5] for geoid in gdf.geoid])
Code
len(county_list)
25

Then we need to know how many counties we need to plot so we can create enough axes. With 25 different counties we can create a nice \(5\times5\) square, so those are the dimensions passed off to matplotlib. Then, we iterate through the set of counties and plot each to its respective axes (and set the title appropriately).

Code
f, ax = plt.subplots(5, 5, figsize=(16, 18))
ax = ax.flatten()
for i, county in enumerate(sorted(county_list)):
    cgdf = gdf[gdf["geoid"].str.match(f"^{county}")]
    cgdf.plot(
        column="median_home_value",
        scheme="quantiles",
        ax=ax[i],
        edgecolor="none",
        legend=False,
        cmap="YlOrBr",
        alpha=0.4,
        missing_kwds=missing_kwds,
    )
    cx.add_basemap(ax[i], source=cx.providers.CartoDB.Positron, attribution="")
    ax[i].set_title(county)
    ax[i].set_axis_off()
plt.suptitle("Median Home Values by County\n", fontsize=24)
plt.tight_layout()

Here, each subpanel is independent, so the color scheme is relative to the county itself rather than the MSA as a whole; this gives us a sense for how home values vary within each county (rather than being washed out by inter-county variation). As a person from the D.C. region, I have to admit that I have a different impression of this figure now that it has been produced. Technically there is only one ‘county or county equivalent’ in Washington D.C. (11001) so it shows up in the first row, first column. And D.C. is tiny, about the same size as many counties in its metropolitan region, so it would be an easy thing to overlook. But in the D.C. metro, equating the District with a county would be wildly disrespectful… D.C. has all the powers of a state except representation (and self-determination), so in respect, we treat it as a state until it achieves statehood. Apologies for the title.

3.2.2 Colormaps

Choosing a colormap for shading a choropleth map is a science unto itself (Brewer, 1997, 2003; Harrower & Brewer, 2003).

“The map exemplifies the”first rule of color composition” of the illustrious Swiss cartographer, Eduard Imhof:

Pure, bright or very strong colors have loud, unbearable effects when they stand unrelieved over large areas adjacent to each other, but extraordinary effects can be achieved when they are used sparingly on or between dull background tones. “Noise is not music . . . only on a quiet background can a colorful theme be constructed,” claims Windisch.3

Tufte (2013), p.58

Code
cmaps = ["Greens", "plasma", "coolwarm"]
descriptions = ["Sequential\nSingle-Hue", "Sequential\nMulti-Hue", "Diverging"]

f, ax = plt.subplots(1, 3, figsize=(16, 7))
ax = ax.flatten()
for i, cmap in enumerate(cmaps):
    gdf.plot(
        column="median_home_value",
        scheme="quantiles",
        ax=ax[i],
        edgecolor="none",
        legend=True,
        legend_kwds={
            # add a dollar sign and suppress decimals
            "fmt": "\${:,.0f}",
            "prop": legend_font,
            "loc": "lower right",
        },
        missing_kwds=missing_kwds,
        cmap=cmap,
        alpha=0.4,
    )
    dc_counties.to_crs(gdf.crs).plot(
        ax=ax[i], facecolor="none", linewidth=0.5, edgecolor="gray"
    )
    cx.add_basemap(ax[i], source=cx.providers.CartoDB.Positron)
    ax[i].set_title(descriptions[i])
    ax[i].set_axis_off()
plt.suptitle("Median Home Value\nby Colormap", fontsize=16)
plt.tight_layout()
<>:15: SyntaxWarning: invalid escape sequence '\$'
<>:15: SyntaxWarning: invalid escape sequence '\$'
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_70545/2881896840.py:15: SyntaxWarning: invalid escape sequence '\$'
  "fmt": "\${:,.0f}",

These maps all show the same variable (home values) with the same classification scheme (quintiles) but use different colormaps to acheive different effects.

See the matplotlib documentation for more information in built-in colormaps. For an extended set of colors, see also palettable and colorcet

3.2.2.1 Sequential

Sequential colormaps are ordered from light to dark (or the reverse. In matplotlib, you can reverse most colormaps with the _r suffix, e.g. YlGnBu_r). Sequential maps come in two varieties: single-hue, which step through the range of a single color from light to dark, and multi-hue, which step through a range of two or more colors that vary in intensity.

Apart from pleasing aesthetics, one of the central concerns when choosing a colormap is how humans perceive differences between colors. That is, does the move from light green to dark green mean the same thing as the move from yellow to purple? Sometimes it can also be difficult to intuit immediately which end of the color palette maps onto higher values; that is, does the brighter color represent high values? or does the dark color represent high values?

Another question is how many classes (k) to use. The short answer is, the human eye cannot distinguish between ultra-fine differences in color, so choropleth maps using greater than 10 classes have rapidly diminishing returns to larger k.

Perceptually Uniform

Perceptually Uniform

Sequential Colormaps

Sequential Colormaps

More Sequential Colormaps

More Sequential Colormaps

3.2.2.2 Diverging

Diverging colormaps are useful for displaying departure from a central tendency. A diverging map is essentially a pair of complementary sequential maps with one map inverted then stuck on the end of the other. For that reason it is important to choose a classification scheme that centers on a meaningful value (an odd number of quantiles is a good choice), otherwise the color scaling can give a skewed perception of the high and low values. It is also important to ensure that the paired colormaps (a) work well together, and (b) are equally-accessible for all audiences.

The following pairs of hues for use on thematic maps suited to diverging color schemes: red/blue, orange/blue, orange/purple, and yellow/purple. These recommendations are robust because they may be communicated verbally and require no particular color charts or color measurements for use. These color pairs will aid in producing maps of differences and of changes that can be read accurately by all map users.

Brewer (1996)

Diverging Colormaps

Diverging Colormaps

3.2.2.3 Categorical

Categorical colormaps are used to show discrete or nominal data and are designed to emphasize difference in kind rather than degree. There is no natural ordering to a categorical set, so when displaying discrete data it is quite important to choose a categorical colormap to avoid the perception that a color gradient implies a numerical change. One other important concern is ensuring that there are enough colors in the colormap to exhaust the unique discrete values in the dataset. Otherwise, the same color may be placed randomly next to another observation of the same color so they are impossible to distinguish from one another.

Categorical Colormaps

Categorical Colormaps
Code
ax = gdf.assign(county=gdf.geoid.str[:5]).plot(
    "county", cmap="tab20b", categorical=True, figsize=(7, 7), alpha=0.6
)
ax.axis("off")
ax.set_title("Tracts by County in the D.C. MSA")
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)

3.2.3 Hatching

You can also use matplotlib’s hatching to denote different categories, particularly if you need to create a choropleth with an additional categorical overlay.

Code
import matplotlib.patches as mpatches
Code
gdf.geoid.str[:2].unique()
array(['11', '24', '51', '54'], dtype=object)
Code
states = gdf.geoid.str[:2].unique()[:6]
hatches = ["..", "+", "/", "O"]
handles = []

# store the county variable back onto the gdf
gdf = gdf.assign(county=gdf.geoid.str[:5])

f, ax = plt.subplots(figsize=(9, 14))
gdf.plot(
    "county",
    cmap="tab20b",
    categorical=True,
    figsize=(7, 7),
    alpha=0.6,
    ax=ax,
    legend=True,
)
for i, state in enumerate(states):
    # treat each county as separate but plot to same ax
    gdf[gdf.geoid.str[:2] == state].plot(
        "county",
        hatch=hatches[i],
        facecolor="none",
        linewidth=0,
        ax=ax,
    )
    # create list of patches for legend
    handles.append(
        mpatches.Patch(facecolor="white", alpha=0.4, hatch=hatches[i], label=state)
    )

# store the county legend for later
county_legend = ax.get_legend()

# create a new state legend (but overwrites counties)
state_legend = plt.legend(
    handles=handles,
    handleheight=3,
    handlelength=4,
    loc=4,
)


# add the county legend back
plt.gca().add_artist(county_legend)

cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)
ax.axis("off")
ax.set_title("Counties by State\nin the D.C. Metro Region")
Text(0.5, 1.0, 'Counties by State\nin the D.C. Metro Region')

This map is not very pretty but it conveys a lot of information. You can change the density of the hatching pattern by adding more symbols (e.g. xxxx instead of x), and you can include variation by mixing the symbols.

3.3 Interactive Visualization

Interactive visualization gives an experience closer to traditional desktop GIS explorer, where you can pan, zoom, inspect, and mouseover different features of the map for more information. For exploratory work (and many other cases) this is extremely useful. One downside is that interactive visualization can be resource-intensive, especially when it is done in a webbrowser with limited access to computational power. For that reason, we will trim the example dataset down to just D.C.

Code
dc = gio.get_acs(
    datasets, state_fips="11", level="bg", years=["2021"], constant_dollars=False
)
dc = dc.to_crs(4326)

3.3.1 Folium & explore

The handiest (and therefore one of the best) ways to do interactive visualization is by using the explore method builtin to the geopandas dataframe. Internally, this works almost identically to the plot method, except that it builds a folium-based webmap instead of a static plot using matplotlib. To render the map, Folium passes off the data to leaflet-js, which has been the juggernaut powering open-source webmaps for the last decade or so.

Code
dc[["median_home_value", "geometry"]].explore(
    column="median_home_value",
    tooltip=["median_home_value"],
    scheme="FisherJenks",
    tiles="CartoDB Positron",
    style_kwds={'weight':0.5}
)
Make this Notebook Trusted to load map: File -> Trust Notebook

The downside to folium/explore is that leaflet, having been built several years ago, is not the most performant rendering engine, and geojson is ultimately the data transfer mechanism. That means when a map is rendered, each individual vertex of every polygon in a dataset is stored in the webpage as text. For complicated features or a large number of features (let alone both), this can slow a browser down significantly and you can easily run out of memory entirely and crash the webpage.

3.3.2 HoloViews and GeoViews

The holoviews and geoviews (and hvplot which gives us a nice API) libraries are part of the pyviz suite and use Bokeh as a backend for rending interactive maps. The great thing about these libraries is (obviously) that they plug into the rest of the pyviz ecosystem to include things like callbacks and interlinking with other ipywidgets. This can make it extremely fast to put together an interactive dashboard for exploring a new dataset of testing a new method at scale.

One handy thing with hvplot is the ability to link two interactive maps together so they pan and zoom at the same time, which can be especially useful for comparative work.

Code
dc.hvplot(
    c="median_home_value",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    frame_height=450,
    colorbar=False,
    title="Median Home Value",
) + dc.hvplot(
    c="median_household_income",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    frame_height=450,
    colorbar=False,
    title="Median Household Income",
)

The hvplot syntax uses the + symbol to link maps together so when you pan and zoom in one map, it will have the same effect in the other. One downside to holoviews and hvplot is that there is no passing off to mapclassify under the hood; every variable is shaded on a continuous linear scale. To generate a nice-looking choropleth, that means you need to create a temporary column on the geodataframe that stores bin values, then use that temporary variable to color values using hvplot.

Below we will use the Quantiles classifier to create binning columns for median home value and median household income (using only non-NaN values)

Code
dc.loc[~dc.median_home_value.isna(), "mhv_bins"] = mc.FisherJenks(
    dc.median_home_value.dropna()
).yb
dc.loc[~dc.median_household_income.isna(), "mhi_bins"] = mc.FisherJenks(
    dc.median_household_income.dropna()
).yb
Code
hover_cols = ["geoid", "median_home_value", "median_household_income"]

dc.hvplot(
    c="mhv_bins",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    frame_height=450,
    colorbar=False,
    title="Median Home Value",
    hover_cols=hover_cols,
    hover_fill_color="red",
    selection_fill_color="blue",
    tools=["tap"],
) + dc.hvplot(
    c="mhi_bins",
    cmap="viridis",
    line_width=0.1,
    alpha=0.7,
    geo=True,
    tiles="CartoLight",
    xaxis=False,
    yaxis=False,
    frame_height=450,
    colorbar=False,
    title="Median Household Income",
    hover_cols=hover_cols,
    tools=["tap", "box_select"],
)

3.3.3 PyDeck

PyDeck and Kepler.gl are based on deck.gl a high-performance graphics library built by the Uber team. Using deckgl can be great in some situations because it can handle larger datasets than leaflet (in theory, anyway) and it can render objects in three dimensions, allowing the use of height (the z-dimension) as an additional visual variable (Bertin, 1983). This can be just stylistic, like extruding buildings using their height attribute to create a more realistic visualization, or it can be analytic like using height (in addition to color) as a way of exploring variation in spatial relationships.

But PyDeck’s additional features come at the (relatively small) cost of additional parameterization. Specifically, we need to create a colormapping ourselves, and create an extrusion factor used to visualize the height of each feature (i.e. a variable on the dataset of some function thereof). Both of these can be read as columns from the geodataframe that provides the feature geometry, so we just need to create some simple functions. First, we will set a variable to visualize (still median home value), and use the fisher jenks classifier with six categories but with a new colormap (diverging this time, which is more appropriate for home values).

Code
var = "median_home_value"
cmap = "PRGn"
classifier = "fisherjenks"
k = 6

Then we create two functions that create the extrusion values and the generate the color for each observation. The extrusion factor (the height of each polygon) depends on the scale of the input variable and may take some experimentation to work out. The color mapping requires us to define a define a mapclassify classifier (and k) and collect the appropriate color from matplotlib. That function is short, but because matplotlib can be arcane the function can be too. All we need to do is create a mapping between values of the given variable and colors from the given colormap, using the classifier as our translator (the caveat is we need to translate the list of [r,g,b,a]–that’s red, green, blue, alpha (or transparency), which is a global color specifications–color attributes into the correct 255 scale).

Code
def get_height(val):
    return val**0.6
Code
from mapclassify.util import get_color_array
Code
def pure_get_color_array(
    values,
    scheme="quantiles",
    cmap="viridis",
    alpha=1,
    nan_color=[255, 255, 255, 255],
    as_hex=False,
    **kwargs,
):
    """Convert array of values into RGBA or hex colors using a colormap and classifier.
    This function is useful for visualization libraries that require users to provide
    an array of colors for each object (like pydeck or lonboard) but can also be used
    to create a manual column of colors passed to matplotlib.

    Parameters
    ----------
    values : list-like
        array of input values
    scheme : str, optional
        string description of a mapclassify classifier, by default `"quantiles"`
    cmap : str, optional
        name of matplotlib colormap to use, by default "viridis"
    alpha : float
        alpha parameter that defines transparency. Should be in the range [0,1]
    nan_color : list, optional
        RGBA color to fill NaN values, by default [255, 255, 255, 255]
    as_hex: bool, optional
        if True, return a (n,1)-dimensional array of hexcolors instead of a (n,4)
        dimensional array of RGBA values.
    kwargs : dict
        additional keyword arguments are passed to `mapclassify.classify`

    Returns
    -------
    numpy.array
        numpy array (aligned with the input array) defining a color for each row. If
        `as_hex` is False, the array is :math:`(n,4)` holding an array of RGBA values in
        each row. If `as_hex` is True, the array is :math:`(n,1)` holding a hexcolor in
        each row.

    """

    # only operate on non-NaN values
    v = pd.Series(values, dtype=object)
    legit_indices = v[~v.isna()].index.values
    legit_vals = v.dropna().values
    # stash these to fill later
    bogus_indices = v[v.isna()].index.values
    # transform (non-NaN) values into class bins
    bins = _classify(legit_vals, scheme=scheme, **kwargs).yb

    # normalize using the data's range (not strictly 1-k if classifier is degenerate)
    norm = Normalize(min(bins), max(bins))
    normalized_vals = norm(bins)

    # generate RBGA array and convert to series
    rgbas = colormaps[cmap](normalized_vals, bytes=True, alpha=alpha)
    colors = pd.Series(list(rgbas), index=legit_indices).apply(np.array)
    nan_colors = pd.Series([nan_color for i in range(len(bogus_indices))], index=bogus_indices).apply(lambda x: np.array(x).astype(np.uint8))

    # put colors in their correct places and fill empty with specified color
    v.update(colors)
    v.update(nan_colors)

    # convert to hexcolors if preferred
    if as_hex:
        colors = v.apply(lambda x: to_hex(x / 255.0))
        return colors.values
    return np.stack(v.values)
Code
from mapclassify import classify as _classify
from matplotlib import colormaps
from matplotlib.colors import Normalize, to_hex

With those functions in hand, we can create two temporary ‘visualization variables’ that can be passed to the PyDeck plot function.

Code
pd.Series(list(pure_get_color_array(dc[var], scheme=classifier, k=k, cmap=cmap, as_hex=False))).apply(list)
0      [231, 212, 232, 255]
1          [0, 68, 27, 255]
2          [0, 68, 27, 255]
3      [217, 240, 211, 255]
4      [255, 255, 255, 255]
               ...         
566    [153, 112, 171, 255]
567    [153, 112, 171, 255]
568    [231, 212, 232, 255]
569        [64, 0, 75, 255]
570    [255, 255, 255, 255]
Length: 571, dtype: object
Code
dc["fill_color"] = list(pure_get_color_array(dc[var].values, scheme=classifier, k=k, cmap=cmap, as_hex=False))
dc["height"] = dc[var].apply(lambda x: get_height(x))

To show that it worked, we can look at the fill_color column and see it now holds a list of r,g,b,a values for each observation

Code
dc['fill_color'] = dc.fill_color.apply(lambda x: x.astype(float)).apply(list).values
Code
layers = [
    pdk.Layer(
        "GeoJsonLayer",
        data=dc[["geometry", "fill_color", "height"]],
        get_fill_color='fill_color',
        opacity=0.8,
        extruded=True,
        get_elevation="height",
        auto_highlight=True,
        pickable=True,
    ),
]

view_state = pdk.ViewState(
    **{
        "latitude": dc.unary_union.centroid.y,
        "longitude": dc.unary_union.centroid.x,
        "zoom": 10,
        "maxZoom": 16,
        "pitch": 25,
        "bearing": -40,
    }
)

D = pdk.Deck(
    layers,
    map_provider="carto",
    map_style=pdk.map_styles.LIGHT,
    height=700,
    initial_view_state=view_state,
)
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_70545/3100224659.py:16: DeprecationWarning: The 'unary_union' attribute is deprecated, use the 'union_all()' method instead.
  "latitude": dc.unary_union.centroid.y,
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_70545/3100224659.py:17: DeprecationWarning: The 'unary_union' attribute is deprecated, use the 'union_all()' method instead.
  "longitude": dc.unary_union.centroid.x,

Use option+click to change the camera orientation rather than pan in the map

Code
D

In this map we are using two visual variables to encode the same measurement (home values) in two different ways (like Figure 3.2, except here we use height/extrusion rather than size). Color is used to discretize the distribution into categories, and height is used to show the full variation. In this case, color is effectively re-encoding home values from an interval measure into an ordinal measure and height encodes the original (albeit rescaled) interval measure. This is a bit like visualizing Figure 3.5 on the map itself because the height displays the full range of the data in the histogram and the colors show the binning scheme. The variation in height within a single color band represents the variation inside each pair of black bars in the histogram. A more effective example of extrusion is shown in Section 14.3.1

:::