Neighborhood Opportunity Types in R


Neighborhood Opportunity, Spatial Data Science, and Equitable Urban Policy

This notebook walks through the creation of neighborhood typologies using R. In the parlance of geography and spatial analysis, this is known as geodemographic analysis. In our case, we’ll be developing typologies that can be used to represent the different “opportunity structures” available to people in different neighborhoods, which can then be used to support the design of more equitable housing, transportation, and other urban policies.

This example will use Census data from the 2015 ACS with tracts serving as (weak) proxies for neighborhoods. The first section will give a quick demonstration of the brand new (and excellent) tidycensus package developed by Kyle Walker which is useful for starting completely from scratch and collecting Census data programatically.

The next section will use external data to demonstrate how we can develop composite neighborhood indices based on the literature on neighborhood effects along with statistical techniques for modeling latent variables (what is often called measurement modeling in the structural equation modeling literature). Note: the rationale behind the methodology outlined here is detailed in a forthcoming paper in Housing Policy Debate.

Finally, we will look at how neighborhood typologies can be created by performing a cluster analysis on (a) the large set of variables, (b) the composite indices, or (c) a collection of composite indices and other policy-relevant indicators.

As part of the analysis we will use:

  • tidycensus and tigris for programmatically downloading data from the Census
  • lavaan for building a measurement model used to estimate our composite opportunity indices
  • mclust for performing a model-based cluster analysis
  • leaflet for making interactive maps
  • ggplot and ggbeeswarm for plotting and analyzing the cluster results
library(tidycensus)
library(tidyverse)
## + ggplot2 2.2.1.9000        Date: 2017-05-31
## + tibble  1.3.3                R: 3.4.0
## + tidyr   0.6.3               OS: macOS Sierra 10.12.5
## + readr   1.1.1              GUI: X11
## + purrr   0.2.2.2         Locale: en_US.UTF-8
## + dplyr   0.5.0               TZ: America/New_York
## + stringr 1.2.0           
## + forcats 0.2.0
## ── Conflicts ────────────────────────────────────────────────────
## * filter(),  from dplyr, masks stats::filter()
## * lag(),     from dplyr, masks stats::lag()
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3, lwgeom 2.3.2 r15302
library(tigris)
## As of version 0.5.1, tigris does not cache downloaded data by default. To enable caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
## 
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
## 
##     plot
options(tigris_class = "sf")
options(tigris_use_cache = TRUE)

To download data straight from Census, we’ll need to first set the API key. If you don’t have a census API key, you can register for one (free) from here. Replace the text in the following cell with your own key. Alternatively, you can simply load in your own data from a CSV or other external file. In that case, I’d recommend checking out Kyle’s tigris package (which is used internally by tidycensus) to download shapefiles directly from the Census

Part I: Collecting Demographic Data from the Census

First set your census API key; if you don’t have one you can get one (free) here

# census_key <- "" **Set your own census key here**
census_api_key(census_key)  

First, we need to define a list of variables we might want to include in our analysis. The list of variables available from the Census API is acessible here. In this simple example, lets just say we’re interested in the neighborhood poverty rate. From the census API site, we can see that the variable B17001_002 refers to the total number of people in poverty and B17001_0011 refers to the total number of people for whom poverty status is determined. Dividing the former by the latter gives us the poverty rate, and using the get_acs function from tidycensus we can get both variables at once in a tidy dataset, and we can include the tract boundaries for good measure. Here’s what that might look like for Chicago

chi_data <- get_acs(geography = "tract", variables = 'B17001_002', summary_var = 'B17001_001',
state = "IL", county='Cook', geometry=TRUE)

set the row names as the tract GEOID so that we can drop that column but retain the identifiers

row.names(chi_data) <- chi_data$GEOID

Now we can just calculate the poverty rate and keep only the columns we need:

chi_data$poverty_rate <- (chi_data$estimate / chi_data$summary_est) * 100
chi_data <- subset(chi_data, select=c('geometry', 'poverty_rate'))

Then, we can easily create a choropleth map by piping the data into ggplot

library(viridis)
## Loading required package: viridisLite
library(ggthemes)

chi_data %>% ggplot(aes(fill = poverty_rate, color = poverty_rate)) + geom_sf() + coord_sf(crs = 26971) + # set the Illinois West State Plane CRS, thanks to epsg.io scale_fill_viridis() + scale_color_viridis() + theme_pander()

Following these basic commands you could easily build a large dataset of useful census data. You can call get_acs with a list of variables to download several at once, but since tidycensus returns total counts rather than percentages, you’d need to make separate calls for each variable that requires a new denominator. There’s a good example of this workflow in the demo on Kyle’s website where he calculates the percentage of each racial category. Since all of the racial categories use the same denominator (total population) you can grab all the necessary data in a single call to get_acs. If you wanted to create a larger dataset that included race, poverty rate, and unemployment rate, for example, you’d make three separate calls then join the three dataframes together using GEOID


Part II: Evidence-based “Opportunity Indices”

For a few years now there have been lots of people trying to measure the geography of opportunity. I wrote a good chunk of my dissertation on why most of those attempts are misguided and how it might be done better. The root of the problem is that there has always been a weak connection between the practice of “opportunity mapping” and the long history of research on neighborhood effects. Social scientists know a lot about the ways that spatial context affects a wide variety of individual outcomes (like cognitive development, educational attainment, mental & physical health, employment, and several others) but until Chetty, the public often found it easy to ignore. If we want to develop good measures of spatial opportunity we need to

  1. look at what the research tells us is most important about spatial context
  2. make sure we’re not conflating neighborhood influences with individual vulnerability characteristics
  3. find a way to use available data to model those important neigborhood influences

Starting with step one

We’ll leverage the work of George Galster, the extraordinary urban economist who coined the term “Geography of Opportunity in the mid 1990s. Galster has done an enormous amount of work trying to understand how neighborhoods affect various socioeconomic outcomes, both in his own empirical studies and in reviews of others’ work. In a recent review of the literature, he argues that there’s strong evidence for four basic dimensions of spatial opportunity:

Social effects accrue through the people you interact with in your neighborhood (like knowing people who can give you a job referral)

Environmental effects accrue through ambient exposure (like pollution)

Geographic effects accrue through access to goods and services (like jobs)

Institutional effects accrue through participation in organizations (like school quality)

Together, these comprise a socio-spatial landscape called the geography of opportunity

In simpler terms, we can say that for people to succeed in life, they need (1) safe and healthy environments, (2) access to jobs, goods, and services, (3) supportive, beneficial institutions like good schools, community centers, houses of worship, etc. and (4) efficacious and supportive social networks with knowledge and willingness to help out.

People can make do without some or all of those things, but they are severely disadvantaged when compared to others that do have them. Individual grit and hard work matter, but spatial context matters too. A lot.

In step two

we need to make sure that there’s a plausible causal connection between each of the things we measure and its spatial influence on the people who live there. The way to think about this is through the economic principle of ceteris paribus–all things equal, could I expect this variable to affect my life in some way regardless of my personal circumstances. In other words, if you were to move into a neighborhood tomorrow, how would the conditions in that neighbrohood affect your life, even if you didn’t know anything about yourself. This means we’re not measuring things like housing affordability as a measure of opportunity. That’s not to say that housing affordability doesn’t matter (it does) but if I move into a new neighborhood tomorrow it won’t affect me if my neighbors have a tough time paying their rent. On the other hand, if I move into a neighborhood where unemployment is at 50%, there’s a very small chance that my neighbors will be able to help me find work or recommend my application to their bosses, etc.

It’s important to keep spatial opportunity and individual vulnerability as distinct.

In step three

We’ll use quantitative methods to try and isolate the four factors we identified above. If we can do that well, we should have a good handle on the spatial structure of opportunity

Above we saw how to collect census data we might use in this analysis, and Census data is extremely important but it can only take us so far when looking to represent the spatial structure of opportunity. To adequately capture the four dimensions outline above, we also need

  • health and safety data, like exposure to ambient pollution or crime which might come from the EPA or your local Socrata platform or other open data portal, respectively
  • data on access to jobs, goods, and services which you might get from the Census LEHD dataset or OpenStreetMap, ideally processed through a network analysis tool like Pandana or a regional travel demand model. If those aren’t available, you might turn to the job accessibility measures provided by the Smart Location Database
  • information on instutitions (namely, school quality) which you might get from the local department of education

In my dissertation at UMD, I did a lot of this work for the Baltimore region, so I’ll use my dataset to show how to build reasonable measures of “opportunity” once you have these data on hand. I’ll skip the data collection and processing pieces which could easily fill another notebook.

library(lavaan)
## This is lavaan 0.5-23.1097
## lavaan is BETA software! Please report any bugs.
opportunity_data <- read.csv('opportunity_data.csv', row.names='geoid')

We begin with the theoretical model outlined above: The Geography of Opportunity can be represented parsimoniously by sound measurements of four key dimensions that we’ll call social, environmental, institutional and geographic.

We can use confirmatory factor analysis (CFA) to test whether the data support this theoretical model. If the model holds up to scrutiny, we can use the latent variables estimated by the model as measures of the four dimensions of spatial opportunity.

To create a factor model in lavaan, we specify a series of equations like you might do for a regression, where your unobserved variable (which does not actually appear in your dataset) is given by a linear combination of observed variables (which are in the dataset). In this case, we specify the four dimensions outlined above: social, environmental, instutitional, and geographic. We also specify two additional variables called ES and HS which correspond to elementary and high school values respectively. The logic is that we would expect a relationship to exist among multiple measures from the same high school (a high performing high school will have lots of students who pass their final exams and lots of students who do well on their AP exams) and similarly for elementary school. We don’t have enough indicators to estimate a separate middle school factor, so the institutional factor is essentially comprised of all the school data along with the additional hierarchy just described.

Finally, although each of these measurements is designed to capture a different dimension of the importance of place, they’re also likely to be related to one another. For instance wealthy people are likely to have jobs, and live in places with good school districts and low crime. To account for that relationship, we specify that each of the factors should have correlated residual variance.

opportunity_factors <- '

social =~ income + edu_diploma + owner_occupied_housing + poverty + unemployment + welfare

geographic =~ walk_score + density_public_institutions + density_social_orgs + jobs_transit + access_healthcare

HS =~ hs_performance + ap_scores + teachers_high

ES =~ reading_3rd + math_3rd + reading_5th + math_5th

institutional =~ HS + ES + ms_performance + teachers_middle

environmental =~ crime + vacancy + toxic

social ~~ environmental social ~~ institutional social ~~ geographic institutional ~~ environmental institutional ~~ geographic geographic ~~ environmental’

opportunity_fit <- cfa(opportunity_factors, data=opportunity_data,  missing='ML',  std.all=TRUE, information="observed")
summary(opportunity_fit, standardize=TRUE)
## lavaan (0.5-23.1097) converged normally after  67 iterations
## 
##   Number of observations                           660
## 
##   Number of missing patterns                         3
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic             1934.344
##   Degrees of freedom                               222
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   social =~                                                             
##     income            1.000                               0.882    0.883
##     edu_diploma       0.915    0.034   26.826    0.000    0.807    0.807
##     ownr_ccpd_hsng    0.843    0.036   23.710    0.000    0.744    0.744
##     poverty          -0.997    0.031  -31.998    0.000   -0.880   -0.880
##     unemployment     -0.798    0.037  -21.300    0.000   -0.704   -0.705
##     welfare          -0.764    0.038  -20.014    0.000   -0.674   -0.675
##   geographic =~                                                         
##     walk_score        1.000                               0.869    0.870
##     dnsty_pblc_nst    1.113    0.028   40.199    0.000    0.967    0.968
##     densty_scl_rgs    1.103    0.028   39.490    0.000    0.959    0.960
##     jobs_transit      0.907    0.034   26.294    0.000    0.789    0.789
##     access_helthcr    1.011    0.031   32.103    0.000    0.879    0.879
##   HS =~                                                                 
##     hs_performance    1.000                               0.990    0.992
##     ap_scores         0.908    0.019   47.477    0.000    0.899    0.901
##     teachers_high     0.756    0.027   27.713    0.000    0.748    0.750
##   ES =~                                                                 
##     reading_3rd       1.000                               0.928    0.929
##     math_3rd          0.958    0.025   38.801    0.000    0.889    0.890
##     reading_5th       1.003    0.024   42.556    0.000    0.931    0.932
##     math_5th          0.976    0.025   39.057    0.000    0.905    0.906
##   institutional =~                                                      
##     HS                1.000                               0.925    0.925
##     ES                0.908    0.029   31.127    0.000    0.895    0.895
##     ms_performance    0.964    0.027   35.402    0.000    0.882    0.883
##     teachers_middl    0.861    0.031   27.821    0.000    0.789    0.789
##   environmental =~                                                      
##     crime             1.000                               0.720    0.721
##     vacancy          -1.010    0.056  -18.157    0.000   -0.727   -0.728
##     toxic             0.789    0.055   14.272    0.000    0.568    0.569
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   social ~~                                                             
##     environmental     0.529    0.040   13.135    0.000    0.833    0.833
##     institutional     0.689    0.046   15.000    0.000    0.853    0.853
##     geographic       -0.571    0.042  -13.631    0.000   -0.744   -0.744
##   institutional ~~                                                      
##     environmental     0.622    0.045   13.923    0.000    0.943    0.943
##   geographic ~~                                                         
##     institutional    -0.637    0.044  -14.412    0.000   -0.800   -0.800
##     environmental    -0.581    0.043  -13.563    0.000   -0.928   -0.928
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .income           -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .edu_diploma      -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .ownr_ccpd_hsng   -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .poverty           0.000    0.039    0.000    1.000    0.000    0.000
##    .unemployment      0.000    0.039    0.000    1.000    0.000    0.000
##    .welfare           0.000    0.039    0.000    1.000    0.000    0.000
##    .walk_score        0.000    0.039    0.000    1.000    0.000    0.000
##    .dnsty_pblc_nst    0.000    0.039    0.000    1.000    0.000    0.000
##    .densty_scl_rgs    0.000    0.039    0.000    1.000    0.000    0.000
##    .jobs_transit      0.000    0.039    0.000    1.000    0.000    0.000
##    .access_helthcr    0.000    0.039    0.000    1.000    0.000    0.000
##    .hs_performance    0.001    0.039    0.027    0.979    0.001    0.001
##    .ap_scores         0.001    0.039    0.024    0.981    0.001    0.001
##    .teachers_high     0.001    0.039    0.020    0.984    0.001    0.001
##    .reading_3rd      -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .math_3rd         -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .reading_5th      -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .math_5th         -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .ms_performance    0.001    0.039    0.018    0.985    0.001    0.001
##    .teachers_middl    0.001    0.039    0.016    0.987    0.001    0.001
##    .crime            -0.000    0.039   -0.000    1.000   -0.000   -0.000
##    .vacancy           0.000    0.039    0.000    1.000    0.000    0.000
##    .toxic            -0.000    0.039   -0.000    1.000   -0.000   -0.000
##     social            0.000                               0.000    0.000
##     geographic        0.000                               0.000    0.000
##     HS                0.000                               0.000    0.000
##     ES                0.000                               0.000    0.000
##     institutional     0.000                               0.000    0.000
##     environmental     0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .income            0.220    0.016   13.381    0.000    0.220    0.221
##    .edu_diploma       0.348    0.022   15.540    0.000    0.348    0.348
##    .ownr_ccpd_hsng    0.445    0.027   16.376    0.000    0.445    0.446
##    .poverty           0.225    0.016   13.639    0.000    0.225    0.225
##    .unemployment      0.503    0.030   16.763    0.000    0.503    0.504
##    .welfare           0.544    0.032   17.017    0.000    0.544    0.545
##    .walk_score        0.242    0.015   16.565    0.000    0.242    0.243
##    .dnsty_pblc_nst    0.063    0.007    9.225    0.000    0.063    0.063
##    .densty_scl_rgs    0.079    0.007   10.754    0.000    0.079    0.079
##    .jobs_transit      0.377    0.022   17.152    0.000    0.377    0.377
##    .access_helthcr    0.226    0.014   16.008    0.000    0.226    0.227
##    .hs_performance    0.015    0.008    1.885    0.059    0.015    0.015
##    .ap_scores         0.187    0.013   14.848    0.000    0.187    0.188
##    .teachers_high     0.437    0.025   17.396    0.000    0.437    0.438
##    .reading_3rd       0.138    0.011   11.962    0.000    0.138    0.138
##    .math_3rd          0.208    0.014   14.368    0.000    0.208    0.208
##    .reading_5th       0.132    0.011   11.963    0.000    0.132    0.132
##    .math_5th          0.179    0.013   13.780    0.000    0.179    0.179
##    .ms_performance    0.219    0.015   14.319    0.000    0.219    0.220
##    .teachers_middl    0.376    0.023   16.380    0.000    0.376    0.377
##    .crime             0.480    0.032   15.031    0.000    0.480    0.481
##    .vacancy           0.469    0.031   15.271    0.000    0.469    0.470
##    .toxic             0.676    0.039   17.529    0.000    0.676    0.677
##     social            0.778    0.055   14.211    0.000    1.000    1.000
##     geographic        0.756    0.054   14.090    0.000    1.000    1.000
##     HS                0.142    0.014   10.150    0.000    0.145    0.145
##     ES                0.171    0.015   11.166    0.000    0.198    0.198
##     institutional     0.838    0.055   15.251    0.000    1.000    1.000
##     environmental     0.519    0.051   10.085    0.000    1.000    1.000
# Store a few different model fit statistics in a dataframe
fitm <- data.frame(round(fitmeasures(opportunity_fit, c('srmr', 'ifi', 'cfi',  'nfi', 'tli', 'rmsea')),2))
fitm
##       round.fitmeasures.opportunity_fit..c..srmr....ifi....cfi....nfi...
## srmr                                                                0.05
## ifi                                                                 0.90
## cfi                                                                 0.90
## nfi                                                                 0.89
## tli                                                                 0.88
## rmsea                                                               0.11

The fit measures suggest this is a reasonable model, though certainly not a perfect one. There are many opinions and lots of advice in the SEM literature about what constitutes a good CFA fit (which is partly why there are so many different measures). In genearal, the literature suggests that a model is adequate if:

  • SRMR <= 0.5
  • IFI >= 0.9 (ideally >= 0.95)
  • CFI >= 0.9 (ideally >= 0.95)
  • NFI >= 0.9 (ideally >= 0.95)
  • TLI >= 0.9 (ideally >= 0.95)
  • RMSEA <= 0.1

Following these guidelines, I would argue that this is an acceptable model, especially since these methods were not developed in the context of urban/spatial data so this is almost entirely new territory. Furthermore, Marsh, Hau and Wen argued that the oft-cited cutoffs listed above can be too conservative, so we’ll follow their advice ;)

Since we’ve established that the data support our (Galster’s) theoretical model, we can use the estimated variables as our composite ‘opportunity’ measures, which we can extract from the fitted model using the predict function.

opportunity_dimensions <- data.frame(geoid=row.names(opportunity_data), predict(opportunity_fit), row.names=row.names(opportunity_data))
# We don't really need the HS and ES variables, so lets drop those
opportunity_dimensions <- subset(opportunity_dimensions, select=-c(HS, ES))
head(opportunity_dimensions)
##                   geoid    social geographic institutional environmental
## 24003701101 24003701101 0.9589206 -0.5939411     0.9851466     0.6339893
## 24003701102 24003701102 0.4958195 -0.4596392     0.8858705     0.5323079
## 24003701200 24003701200 0.6249499 -0.9318697     0.9478635     0.7466184
## 24003701300 24003701300 1.1124505 -1.5147904     0.9982983     1.0119755
## 24003701400 24003701400 0.8624903 -1.1202646     0.8861376     0.8052137
## 24003702100 24003702100 0.8055959 -0.8491362     0.2624387     0.4535719

So far, we’ve only been working with tabular data to build the factor model, but now that we have good estimates of the opportunity dimensions lets grab the tract shapefiles and map each of the dimensions

We’ll use tigris and borrow Kyle’s metro_tracts function to download all the census tracts in the Baltimore region

metro_tracts <- function(metro_name) {

First, identify which states intersect the metro area using the

states function in tigris

st <- states(cb = TRUE) cb <- core_based_statistical_areas(cb = TRUE) metro <- filter(cb, grepl(metro_name, NAME))

stcodes <- st[metro,]$STATEFP

Then, fetch the tracts, using rbind_tigris if there is more

than one state

if (length(stcodes) > 1) { tr <- rbind_tigris( map(stcodes, function(x) { tracts(x, cb = TRUE) }) ) } else { tr <- tracts(x, cb = TRUE) }

Now, find out which tracts are within the metro area

within <- st_within(tr, metro)

within_lgl <- map_lgl(within, function(x) { if (length(x) == 1) { return(TRUE) } else { return(FALSE) } })

Finally, subset and return the output

output <- tr[within_lgl,]

return(output)

}

balt <- metro_tracts("Baltimore")
## although coordinates are longitude/latitude, it is assumed that they are planar
## although coordinates are longitude/latitude, it is assumed that they are planar
# Keep only the GEOID and the geometry columns, rename GEOID to lowercase
balt <- select(balt, GEOID, geometry)
balt <- rename(balt, geoid=GEOID)
balt_data <- inner_join(balt, opportunity_dimensions)
## Joining, by = "geoid"
library(RColorBrewer)
ggplot(balt_data, aes(fill = cut_number(social, 5), color=cut_number(social, 5))) +
geom_sf() +
coord_sf(crs = 2248) +
scale_fill_brewer(palette="BrBG") + 
scale_color_brewer(palette="BrBG")  

  #theme_solarized(light = FALSE)

That map isn’t terribly pretty, but it looks plausible. Support for sf in ggplot hasn’t been finalized yet and my ggplot-fu is rusty, so I’m not sure how to adjust the linewidths around polygons.

Static maps are only so useful anyway. With the leaflet library we can pipe the data into a slippy map so that we can pan & zoom over a CARTO basemap

library(leaflet)
balt_wgs <- balt_data %>% st_set_crs(4326) # make sure the dataframe CRS is set to WGS84
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform
## for that
 m <- leaflet(balt_wgs)  

m %>% addTiles(‘http://{s}.basemaps.cartocdn.com/light_all/{z}/{x}/{y}.png’, attribution = paste( ‘&copy; <a href="http://openstreetmap.org">OpenStreetMap</a> contributors’, ‘&copy; <a href="http://cartodb.com/attributions">CartoDB</a>’ ), group = ‘CARTO Positron’ ) %>% addPolygons(color = "#444444", weight = 0.2, smoothFactor = 0.5, opacity = 1.0, fillOpacity = 0.5, group = ‘Social’, fillColor = ~colorQuantile("BrBG", n=5, social)(social), highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE) ) %>% addPolygons(color = "#444444", weight = 0.2, smoothFactor = 0.5, opacity = 1.0, fillOpacity = 0.5, group = ‘Institutional’, fillColor = ~colorQuantile("BrBG", n=5, institutional)(institutional), highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE) ) %>% addPolygons(color = "#444444", weight = 0.2, smoothFactor = 0.5, opacity = 1.0, fillOpacity = 0.5, group = ‘Environmental’, fillColor = ~colorQuantile("BrBG", n=5, environmental)(environmental), highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE) ) %>% addPolygons(color = "#444444", weight = 0.2, smoothFactor = 0.5, opacity = 1.0, fillOpacity = 0.5, group = ‘Geographic’, fillColor = ~colorQuantile("BrBG", n=5, geographic)(geographic), highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE) ) %>% addLegend(pal = colorQuantile("BrBG", balt_wgs$social, n=5), values = ~social, position=‘bottomleft’, title = ‘Opportunity Score’

) %>% addLayersControl(baseGroups = c("CARTO Positron"), overlayGroups = c("Social", "Environmental", "Institutional", ‘Geographic’) ) %>% hideGroup(c("Environmental", "Institutional", ‘Geographic’))

Part III: Model-Based Opportunity Typologies

Thanks to the CFA in the previous section, we have a good reason to believe that we’ve captured the four dimensions of opportunity reasonably well. But how do we make sense of that? We’ve isolated four different geographies of opportunity–all of which matter–but matter differently depending on which outcome we’re most interested in (climbing the economic ladder or maintaining good mental health?) and which population we’re most interested (children or seniors?)

All of the factors matter, and they matter interdependently but we can’t just assume everything matters the same way all the time. Institutional measures (school quality) are really important for children but they mean little or nothing for seniors. Safety matters for everyone but developing children might be particularly susceptible to toxic emissions. Accessibility matters to everyone, but it might be more important for people who don’t have a car or for those who are mobility impaired.

How do we make sense of these four variable together then? One useful method is to use a cluster analysis where we can group neighborhoods into types based on the four different opportunity dimensions. In this case, the output is a set of neighborhood categories that vary by kind rather than degree. We might find places that are strong in three measures and weak in another or neighborhoods that have two very strong measures and two average measures, or maybe even places that have everything going well.

We can try and find those neighbrohood opportunity types by mining the structure of the data through a cluster analysis. There are lots of different clustering methods and there’s no perfect answer about which is best. In this case, I’m going to use a Gaussian Mixture Model which is a probabilistic clustering technique because it helps give us some sense about what the right number of clusters is (as opposed to an algorithm like k-means where you need to specify the number of clusters ahead of time). You could also use a more complicated algorithm like Affinity Propagation (which I’ve come to like quite a bit) which also helps determine the appropriate number of clusters subject to a couple of hyperparameters (you can use the apcluster package in R). I’m not going to demo all of the different options but the essential workflow is the same.

In this case we’ll let the data speak for themselves, using the mclust package to find the best model according to the BIC statistic, which will tell us how many clusters the data support^

Note: As Jake VanderPlas points out, the implications for using GMMs for clustering aren’t quite as straightforward as it might seem and BIC tells you “how well GMM works as a density estimator, not how well it works as a clustering algorithm.” Nevertheless, I think GMMs are a good option for starting to explore the data.

library(mclust)
## Package 'mclust' version 5.3
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
## 
##     map
opportunity_dimensions <- subset(opportunity_dimensions, select=-geoid)

cluster_fit <- Mclust(opportunity_dimensions) summary(cluster_fit, parameters = TRUE)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EVE (ellipsoidal, equal volume and orientation) model with 5 components:
## 
##  log.likelihood   n df       BIC       ICL
##        75.56779 660 46 -147.5074 -315.8739
## 
## Clustering table:
##   1   2   3   4   5 
## 176 142 153 104  85 
## 
## Mixing probabilities:
##         1         2         3         4         5 
## 0.2565678 0.2343560 0.2208285 0.1609654 0.1272822 
## 
## Means:
##                     [,1]       [,2]        [,3]       [,4]      [,5]
## social         0.8231648  0.4886442 -0.18128107 -0.6585279 -1.411640
## geographic    -0.7600691 -0.5401749  0.12580627  0.7168748  1.401796
## institutional  0.9509052  0.4678987 -0.01897986 -0.9784446 -1.507938
## environmental  0.7197355  0.4178240 -0.04525995 -0.7125996 -1.240372
## 
## Variances:
## [,,1]
##                    social  geographic institutional environmental
## social         0.16018067 -0.04003883    0.06447085    0.04139197
## geographic    -0.04003883  0.37922670   -0.02507230   -0.15845820
## institutional  0.06447085 -0.02507230    0.03781717    0.02493386
## environmental  0.04139197 -0.15845820    0.02493386    0.07294539
## [,,2]
##                     social   geographic institutional environmental
## social         0.149165375 -0.009029619   0.041883940    0.01971034
## geographic    -0.009029619  0.213425107  -0.001567914   -0.08400964
## institutional  0.041883940 -0.001567914   0.062631088    0.02679129
## environmental  0.019710335 -0.084009642   0.026791291    0.04443017
## [,,3]
##                    social  geographic institutional environmental
## social        0.152841837  0.01702849    0.02216166   0.001027023
## geographic    0.017028487  0.08455828    0.02006781  -0.024549304
## institutional 0.022161657  0.02006781    0.09598161   0.032764946
## environmental 0.001027023 -0.02454930    0.03276495   0.024057672
## [,,4]
##                   social  geographic institutional environmental
## social        0.22545843  0.03993459    0.07553359    0.01404405
## geographic    0.03993459  0.10874560    0.02104363   -0.03385748
## institutional 0.07553359  0.02104363    0.07143304    0.02123786
## environmental 0.01404405 -0.03385748    0.02123786    0.02265116
## [,,5]
##                    social   geographic institutional environmental
## social        0.115648319  0.010513457   0.029917657   0.007341131
## geographic    0.010513457  0.088524253   0.008823473  -0.030568933
## institutional 0.029917657  0.008823473   0.051656707   0.017817279
## environmental 0.007341131 -0.030568933   0.017817279   0.020629255

According to the Bayesian Information Criterion, the best model is given by the 5 cluster solution. The four opportunity dimensions are distributed similarly to Z-scores (i.e. positive scores are above average, scores near 0 are close to average, and negative numbers are below average) so we can get some sense of how the clusters are distributed from the mclust summary but not a very detailed one. To get a better sense of how the clusters are separated, we can write out the classification of each tract, join it back to the original dataset, and compute some statistics and make some plots

opportunity_types <- data.frame(cluster_fit$classification)
opportunity_types$geoid <- as.character(row.names(opportunity_types))
opportunity_types <- rename(opportunity_types, neighborhood_type = cluster_fit.classification)
head(opportunity_types)
##             neighborhood_type       geoid
## 24003701101                 1 24003701101
## 24003701102                 1 24003701102
## 24003701200                 1 24003701200
## 24003701300                 1 24003701300
## 24003701400                 1 24003701400
## 24003702100                 2 24003702100
# Read in some of the raw (untransformed) indicator data so we can examine how it looks by cluster
raw_data <- read.csv('opportunity_raw.csv')
raw_data$geoid <- as.character(raw_data$geoid)
head(raw_data)
##         geoid reading_3rd math_3rd reading_5th math_5th teachers_elem
## 1 24003701101       95.88    97.25       98.00    95.90         68.13
## 2 24003701102       96.86    98.22       95.36    95.71         71.08
## 3 24003701200       95.16    94.90       98.90    94.60         72.36
## 4 24003701300       95.73    96.30       94.67    88.49         71.36
## 5 24003701400       94.50    95.21       93.92    85.62         75.98
## 6 24003702100       93.93    94.89       95.13    92.90         63.97
##   ms_performance teachers_middle hs_performance ap_enroll ap_scores
## 1          93.01           69.20          94.86     35.80     57.80
## 2          93.01           69.20          94.86     35.80     57.80
## 3          93.01           69.20          94.87     35.80     57.80
## 4          89.65           70.60          93.56     32.30     49.04
## 5          89.00           70.86          93.31     31.63     47.36
## 6          55.00           39.00          91.70     23.00     39.70
##   sat_score hs_dropout teachers_high access_training housing_value
## 1   1589.93       2.00         76.40           23.82        495300
## 2   1589.94       2.00         76.40           26.91        348700
## 3   1590.06       2.00         76.40            2.21        394600
## 4   1519.58       2.63         74.49            1.56        614200
## 5   1506.11       2.75         74.12            0.07        647200
## 6   1461.04       3.40         68.00          226.51        453600
##   gross_rent delta_housing_units smocapi grapi affordability_index
## 1       1484                0.10    19.5  34.2              114.26
## 2       1750                0.18    27.1  36.3              121.80
## 3       1795                0.19    23.0  29.5              121.80
## 4       1764                0.12    22.4  30.5              112.65
## 5        633                0.27    21.3  21.9              103.81
## 6       1536                0.04    23.9  29.4              125.63
##   ht_index high_cost_loan foreclosure vancancy density_social_orgs
## 1    59.36           9.62        0.80     0.04               17.24
## 2    57.10          20.93        2.18     0.07               18.26
## 3    54.86          12.76        1.22     0.08               17.10
## 4    67.18           8.30        0.70     0.04                9.31
## 5    57.61          12.57        1.19     0.06               12.13
## 6    60.65          10.37        1.00     0.07               19.81
##   density_public_institutions racial_diversity edu_diploma edu_bachelor
## 1                       15.03             0.13        0.93         0.51
## 2                       21.37             0.22        0.90         0.35
## 3                        7.45             0.12        0.95         0.35
## 4                        1.90             0.22        0.93         0.35
## 5                        3.95             0.30        0.90         0.45
## 6                        7.86             0.15        0.96         0.43
##   income poverty unemployment density_population owner_occupied_housing
## 1 110650     3.5         0.03               1.91                   0.87
## 2  81008     4.5         0.07               4.39                   0.84
## 3  86318     4.6         0.06               1.43                   0.88
## 4 114727     2.1         0.05               0.30                   0.92
## 5 120987     5.2         0.07               0.22                   0.86
## 6  91908     1.9         0.05               0.93                   0.88
##   single_parents crime jobs_auto jobs_transit walk_score
## 1           0.06   100 136799.52        17347      14.62
## 2           0.11   102 110400.91        16425      14.17
## 3           0.09    96  18074.28        13272       3.61
## 4           0.05    67  71540.86         4310       0.82
## 5           0.04    75  15815.40         1134      14.17
## 6           0.05    60 316858.54        16525      10.39
raw_opportunity <- inner_join(raw_data, opportunity_types)
## Joining, by = "geoid"

To understand how the neighborhood typologies pan out, we need some visualizations that help compare the variation both within and between each cluster for a variety of useful variables. One good way to accomplish that is to take a variable (or handful of variables) and plot its distribution for each cluster on the same graph.

ggplot(raw_opportunity) + geom_density(aes(x=income, color=factor(neighborhood_type))) 

That’s a start, but it’s still difficult to see what’s happening. Instead, we can use parallel beeswarm plots which are kind of like scatterplots spliced with a density plots (it’s a smarter way to jitter). The beeswarms give us a way to assess each cluster’s distribution and to compare side-by-side with the other clusters.

library(ggbeeswarm)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
ggplot(raw_opportunity, aes(neighborhood_type, income)) + geom_beeswarm()

I find the beeswarms much more intuitive than the overlpping kernel density plots. We can also superimpose the beeswarm over a violin plot so that we can see both the individual observations and the smooth distribution, and we can compare different variables against one another, like income and school quality

(note: you could also compare two different variables on the same graph by splitting the violin by a particular variable, but I think that’s a little less useful in this particular case so I won’t show it)

# income
p1 <- ggplot(raw_opportunity, aes(neighborhood_type, income))+ geom_violin(aes(fill=factor(neighborhood_type) )) + geom_beeswarm(pch = 1, col='white', cex=0.8, alpha=0.6)

3rd grade reading scores

p2 <- ggplot(raw_opportunity, aes(factor(neighborhood_type), reading_3rd)) + geom_violin(aes(fill=factor(neighborhood_type) )) + geom_beeswarm(pch = 1, col=‘white’, cex=0.8, alpha=0.6)

grid.arrange(p1, p2, ncol=1)

Together, these plots show that Type 1 has mostly high income residents and schools that have relatively strong elementary school reading performance. Type 5, on the other hand has mostly poor residents and poorly performing schools. Given how much (vertical) distance there is between the violins for type 1 and type 5, we can infer that wealthy children in the Baltimore region attend categorically different schools and face wildly different educational prospects. What about access to jobs?

p1 <- ggplot(raw_opportunity, aes(factor(neighborhood_type), jobs_transit))  + geom_violin(aes(fill=factor(neighborhood_type) )) + geom_beeswarm(pch = 1, col='white', cex=0.8, alpha=0.6)

p2 <- ggplot(raw_opportunity, aes(factor(neighborhood_type), jobs_auto)) + geom_violin(aes(fill=factor(neighborhood_type) )) + geom_beeswarm(pch = 1, col=‘white’, cex=0.8, alpha=0.6)

grid.arrange(p1, p2, ncol=1)

With job access, things are reversed; Types 4 and 5 have the best access to jobs by both transit and automobile commute. Type 1, meanwhile, is lacking, especially when it comes to transit. Interestingly, types 2 and 3 are all over the place. This begins to suggest another important policy concern… we saw in the previous graphs that Type 1 has good schools but few poor residents. Why? Housing cost could have a lot to do with it, but transportation could be the other issue. If poor residents don’t have a car, they’re unlikely to find a place to live in Type 1 because they can’t access any jobs by transit, effectively eliminating Type 1 neighborhoods from their choice set. Policy solution? Maybe think about bus lines in these places that would allow poorer residents to live there and still commute to their jobs

What about housing costs? There are two ways to think about housing affordability: in absolute and in relative terms. Gross rent tells us about how expensive a particular place is to live (typically we compare it with the median income of a region). Gross Rent as a Percentage of Income (GRAPI) tells us how cost-burdened people are. That is, regardless of the absolute cost of housing, how much of the family’s income needs to be devoted to rent? A household spending more than 30% of its income on housing is considered cost-burdened and a household paying more than 50% is considered extremely cost burdened.

p1 <- ggplot(raw_opportunity, aes(factor(neighborhood_type), gross_rent))  + geom_violin(aes(fill=factor(neighborhood_type) )) + geom_beeswarm(pch = 1, col='white', cex=0.8, alpha=0.6)

p2 <- ggplot(raw_opportunity, aes(factor(neighborhood_type), grapi)) + geom_violin(aes(fill=factor(neighborhood_type) )) + geom_beeswarm(pch = 1, col=‘white’, cex=0.8, alpha=0.6)

grid.arrange(p1, p2, ncol=1)

## Warning: Removed 18 rows containing non-finite values (stat_ydensity).
## Warning: Removed 18 rows containing missing values (position_beeswarm).
## Warning: Removed 17 rows containing non-finite values (stat_ydensity).
## Warning: Removed 17 rows containing missing values (position_beeswarm).

These graphs show an interesting story. Neighborhood Types 1 and 2 tend to have high rents, and in absolute terms they might be unaffordable to many residents but by and large, the people who already live there can do so within their means (the widest parts of the violin are below the 30% threshold). Meanwhile, Neighborhood Type 5 has some of the cheapest rents in the region–most under $1000/month–but almost all the people in those neighborhoods are cost burdened!

The implications for equitable policy development here are nuanced. In neighborhoods Type 1 and 2, we might look for a place-based solution: build more subsidized housing so that more people can live there (though, as we saw in the previous graphs, we might also need to invest in transportation infrastructure if that strategy is going to be effective). In Neighborhood Type 5 we likely need a person-based solution. The people in these neighborhoods live in the cheapest housing in the region and it’s still wildly unaffordable to them. Without getting deeply into the Baltimore story (this is a general demo of the methodology not a discussion of Baltimore) this picture begins to look bleak. The people who live in Neighborhood Type 5, most of whom are very poor, face underperforming school districts, unaffordable housing and toxic environments. These conditions have been proven time and again to perpetuate poverty, health problems, and joblessness (even though the jobs are right there!)–in short, these residents don’t have a fair shot. Meanwhile, they are prevented from moving to places with better opportunity due to the high cost of housing and the lack of a regional transportation system (the Tiebout hypothesis be damned).

There are two important things to take away from this workflow. The first is that the neighborhood typologies can be defined using any set of variables you think are appropriate. In this case, we’re using the four “opportunity dimensions” which we input into the cluster analysis to create the typologies, but you could use different data. Instead, you could do a more typical geodemographic analysis by clustering on, say, race, age, educational attainment, and income which would yield different types. The other important piece is that the information used to describe each neighborhood type can be entirely different from the information used to create the types themselves. In this case, we created the types based on opportunity dimensions but we described the types using some indicators (like income) which went into those dimensions in the CFA stage as well as some indicators (like gross rent and GRAPI) that are policy-relevant but don’t necessarily have a causal pathway associated with opportunity. This means that however they typologies are created, they can be used to summarize and explain any other information that could be relevant to policymaking–which is a very useful property. This abstraction allows us, for example, to create typologies based on the opportunity dimensions that are well-supported by the social scientific literature (and, it turns out by the data) but then describe those neighborhood types using much more useful policy data (like the presence of affordable housing)

You could also mix-and-match different indicators, like creating typologies based on the opportunity dimensions and the presence of affordable housing–it just depends on the type of output you’re looking for and the way you can use if for equitable policy analysis


To Conclude

All told, the workflow here attempts to synthesize quantitative traditions from different social sciences. Confirmatory Factor Analysis leverarges theory from ecometrics, a branch of methods developed in sociology and applied often in public health that tries to develop empirically-valid measurements of ecological conditions. Cluster analysis draws from geodemographics, a longstanding framework in geography that is often applied in marketing and the private sector for ‘spatial segmentation’.

In this workflow, we’re combining those techniques to help assess neighborhood conditions to provide tools to policymakers that lead to better data-driven decisions, and better, more social equitable housing, transportation, and development policies that benefit everyone.

Side note: interestingly, this follows a very typical process in data science and machine learning, where we take a large dataset, reduce its complexity using decomposition (a type of unsupervised learning), then build a model that helps make decisions (in this case using cluster model, another type of unsupervised learning). The terminology is a bit contested, though, because the underlying theory is a crucial important aspect of this analysis, especially in the CFA portion… we’re not simply learning from the structure of the data; we are imposing a theoretical framework drawn from decades of research, testing whether our data support that theory, and building a useful model based on that analysis. It’s machine learning for social science and policy analysis.

balt_types <- inner_join(balt, opportunity_types)
## Joining, by = "geoid"
# Finally, you can save the shapefiles, data, and cluster classifications in external files 
# for further analysis or visualization in other programs like a desktop GIS

Normally I would avoid excel formats, but saving as an xlsx will preserve the string formatting

on the geoids (unlike CSVs), which is useful for those tracts that start with a leading 0

library(openxlsx)

write.xlsx(balt_types, ‘baltimore_opportunity_types.xlsx’, row.names=FALSE)

In the next post I’ll look at how you could do a comparable analysis in Python with geopandas, scikit-learn, seaborn and mplleaflet