import osimport geopandas as gpdimport matplotlib.pyplot as pltimport pandana as pdnaimport pandas as pdimport quilt3 as q3import requestsfrom esda import Moran_Localfrom factor_analyzer import ConfirmatoryFactorAnalyzer, ModelSpecificationParserfrom folium import LayerControlfrom geosnap import DataStorefrom geosnap import analyze as gazfrom geosnap import visualize as gvzfrom geosnap import io as giofrom libpysal.graph import Graphfrom mapclassify import classifyfrom IPython.display import Imagefrom scipy.stats import zscorefrom semopy import Model, calc_stats, semplotfrom segregation.local import LocalDistortion, MultiLocalEntropyfrom tobler.area_weighted import area_interpolatefrom zipfile import ZipFile
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
It is well established in the social sciences that neighborhoods and social contexts influence a wide variety of individual outcomes (e.g. in education, health, socioeconomic status, etc.), and that these spatial influences are multidimensional (Galster, 2019). Thus a large literature in the U.S. focuses on outcomes driven by the notion of “concentrated disadvantage”, and similarly in the U.K. on “multiple deprivation” both of which represent multivariate indices of social and environmental context. In their general form, these indices are designed support researcher’s efforts to quantify the “geography of opportunity” (Galster & Killen, 1995), and a growing literature explores the benefits of drawbacks of developing various opportunity indices, particularly for use in policy evaluation efforts (Balachandran & Greenlee, 2022; Brazil et al., 2023).
One branch of the literature treats the issue of composite index construction as a latent variable problem, drawing from the tradition of “psychometrics” in psychology. This approach is especially valuable because it allows researchers to assess the validity of a particular index, and test hypotheses about whether the selected collection of variables combine to measure what the researcher supposes they measure. In a seminal contribution to this literature, Raudenbush & Sampson (1999) outline a theory of “ecometrics” designed to capture the social-ecological context as a way to help measure the latent construct of collective efficacy. A key to the original methodology is that it relies on actual observations of social interaction as measured by an intentionally-designed survey. These data are combined with others gathered via systematic social observation.
This idea has been expanded to include new forms of data like google street maps and VGI (boston stuff). We usually don’t have SSO data, but we do have lots of other data like 411 reports, google street view, and satellite imagery and we may be able to substitute these data for SSO–if we believe they accurately capture the social process under investigation (i.e. if we believe that some social process like “disorder” is the underlying driver of the observed data) (Friche et al., 2013; O’Brien et al., 2015; O’Brien & Montgomery, 2015).
The key distinction between ecometrics and earlier methods like factor ecology is the reliance on a formal theoretical model underneath. We’re not allowing the data to speak for themselves; instead we are specifying a set of theoretical social processes which are unobservable directly, but might be inferred to exist if we treat them as latent variables. We then fit a model and test whether these latent constructs appear as specified. Using this approach we can try and capture the geography of opportunity following the theory outlined by Galster (2008), who argues that individual-level outcomes are a function of individual characteristics, as well as spatial characteristics (at multiple scales). \[O_{it} = \alpha + \beta[P_{it}] + \gamma[P_i] + \phi[UP_{it}] + \delta[UP_i] + \theta[N_{jt}] + \mu[M_{kt}] + \epsilon\]
where
\(O_{it}\) = employment status or income (model dependent) for individual \(i\) at time \(t\)
\([P_t ]\) = observed personal characteristics that can vary over time (e.g., marital or fertility status, educational attainment)
\([P]\) = observed personal characteristics that do not vary over time (e.g., year and country of birth)
\([UP_t]\) = unobserved personal characteristics that can vary over time (e.g., psychological states, interpersonal networks and relationships)
\([UP]\) = time-invariant unobserved personal characteristics (e.g., IQ, prior experiences, certain values and beliefs)
\([N_t]\) = observed characteristics of neighborhood where individual resides during \(t\)
\([M_t]\) = observed characteristics of metropolitan area in which individual resides during \(t\) (e.g., area unemployment rates)
\(\epsilon\) = a random error term
\(i\) = individual
\(j\) = neighborhood
\(k\) = metropolitan area
\(t\) = time period (typically a year)
The poignancy of this framework is its ability to distinguish between individual characteristics and (multiscalar) location characteristics, each of which have static and time-variant components. In the effort to understand the influence of place, then it is critical to distinguish between the individual-level factors and those population-level attributes conceived as components of the local community (Knaap, 2017)1. A place with a large share of structurally-disadvantaged people is not the same as a structurally-disadvantaged place, and it’s important to care about each distinctly. Places with high risk of wildfire (a spatial disadvantage) are especially dangerous for low-income populations (a population disadvantage) but only the former transmits risk by virtue of location. Although it is tempting to include population vulnerability measures into composite indices of spatial (dis)advantage, it’s important to keep the sources distinct (even though they are correlated!) to maintain conceptual integrity of the measurement.
To capture the geography of opportunity, then, our focus is on the \(M\) and \(N\) components of the equation, and according to Galster (2013), the key vectors of these terms are composed of four categories:
social-interactive: social processes endogenous to neighborhoods, including examples like contagion, collective socialization, collective efficacy, relative deprivation, social cohesion, competition or parental mediation
environmental: “natural and human-made attributes of the local space that may affect directly the mental and/or physical health of residents without affecting their behaviors”. Examples include things like exposure to violence, pollution, or persistent psychological stress.
geographic: “aspects of spaces that may affect residents’ life courses yet do not arise within the neighborhood but rather purely because of the neighborhood’s location relative to larger-scale political and economic forces,” e.g. spatial mismatch (access to skill-appropriate jobs) or quality of public services.
institutional: “actions by those typically not residing in the given neighborhood who control important institutional resources located there and/or points of interface between neighborhood residents and vital markets”. Examples in this category include things like spatial stigmatization or access/quality of schools, daycare, charities, food markets, or drug markets.
These categories are well supported by the empirical literature, and are theoretically grounded in causal processes that generate socioeconomic outcomes, yet each can also be measured in multiple ways, with each measurement providing additional useful information. Following Knaap (2017), we can combine Galster’s theoretical framework with the ecometric technique to (a) test whether the proposed structure holds, and (b) develop composite indices that represent each of the dimensions.
One way to address this problem is to treat the quantification of opportunity as a measurement error problem. Through a liberal interpretation, this may be viewed as an extension of ecometrics, a methodology concerned with developing measures of neighborhood social ecology (Mujahid et al., 2007; O’Brien et al., 2015; Raudenbush & Sampson, 1999). In this framework, opportunity and its subdimensions are viewed as latent variables that cannot be measured directly, but can be estimated by modeling the covariation among the indicators through which they manifest.
This effectively places the \(N_t\) and \(M_t\) terms on the other side of the equation, and we use observable variables like poverty rates and pollution exposure to estimate their values. Ignoring the time dimension (and focusing on the neighborhood scale), our framework says that the relevant \(N\) is composed of mechanistic pathways:
\[ N = \{S, E, G, I\}, \]
where S, E, G, and I are the neighborhood mechanism categories specified above, and a single composite index such as the environmental component would estimated as \[\mathbf{x} = \tau + \Lambda\text{E}+ \Psi,\]
where \(x\) is the set of measurable environmental characteristics like particulate matter, ozone concentration, proximity to superfund sites, lead paint, or contaminated water, etc, \(\Lambda\) is a vector of factor loadings, and \(\text{E}\) is the latent Environmental mechanism, and \(\Psi\) a vector of error terms (Levy & Mislevy, 2016). While we will not have access to all the ideal measurements, we can try to estimate a theoretically-sound index using the best available information.
21.1 Datasets & Geoprocessing
It would be best to use the ecometric framework with hyperlocal and bespoke data (e.g. 311 calls or local crime data), however in this case we can demonstrate using the datasets we have worked with in the prior examples. To get the data in the correct shape, we also need to leverage the same processing and analysis skills developed earlier. For the sake of presentation, we will use census blockgroups as proxies for neighborhoods, then convert all of our measurements to that geography.
21.1.1 social data
As usual, our only option for social data at this scale comes from the Census, and we will use blockgroup geometries to narrow down the spatial scale as much as possible. This will cost us a few variables that are avaialble at the tract level, so you could also re-do this analysis by moving up a level. The variables in this category are designed to capture the ways that social interactions help you get ahead. Historically, the focus has been on the negative consequences of being in the bottom of the distribution (the negative effects of concentrated poverty), though some have also called for increasing attention on the other end, and the importance of concentrated affluence which helps those in privilege remain there.
Here, we will say that socioeconomic status increases local opportunity, with SES measured by income and educational attainment for adults. We will also say that racial integration promotes opportunity, so we include the local distortion segregation measure.
21.1.1 social data
As usual, our only option for social data at this scale comes from the Census, and we will use blockgroup geometries to narrow down the spatial scale as much as possible. This will cost us a few variables that are avaialble at the tract level, so you could also re-do this analysis by moving up a level. The variables in this category are designed to capture the ways that social interactions help you get ahead. Historically, the focus has been on the negative consequences of being in the bottom of the distribution (the negative effects of concentrated poverty), though some have also called for increasing attention on the other end, and the importance of concentrated affluence which helps those in privilege remain there.
Here, we will say that socioeconomic status increases local opportunity, with SES measured by income and educational attainment for adults. We will also say that racial integration promotes opportunity, so we include the local distortion segregation measure.
Code