import geopandas as gpdimport pandas as pdimport numpy as npimport esdaimport libpysal as lpsimport matplotlib.pyplot as pltfrom geosnap import DataStorefrom geosnap.io import get_acsfrom splot import esda as esdplt%load_ext watermark%watermark -a 'eli knaap'-v -d -u -p spreg,geopandas,geosnap,statsmodels
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Author: eli knaap
Last updated: 2024-03-26
Python implementation: CPython
Python version : 3.10.8
IPython version : 8.9.0
spreg : 1.4.2
geopandas : 0.14.1
geosnap : 0.12.1.dev9+g3a1cb0f6de61.d20240110
statsmodels: 0.13.5
35.1 Understanding and Interpreting Spatial Models
In the last chapter, several commonly-used model specifications were defined, but the choice among them depends on the research question. Naturally, if spillover effects are of substantive interest, then it is necessary to choose a model that includes (at least) \(\rho WY\) or \(\theta WX\), but beyond that, the choice is more complex.
The literature basically offers two different strategies; Anselin & Rey (2014) argue that a “bottom-up” strategy is most effective, where we begin with simple models (e.g., OLS), then conduct spatial diagnostic tests to decide whether more complex specifications are necessary. LeSage (2014) and LeSage & Pace (2014) argue for a “top-down” strategy where we begin with a ‘fully specified’ conceptual model. LeSage (2014, p. 19) goes to far as to say that “there are only two model specifications worth considering for applied work.” When there are theoretical reasons to consider global spillover processes (such as home prices in a hedonic model), then the SDM model is appropriate. In other cases that consider only local spillover effects, the SDEM specification will provide unbiased estimates of all parameters.
Their logic (to always prefer the Durbin specifications) is simple: the inclusion of the \(\theta WX\) term ensures that if there is an exogenous spillover process, then it will be measured accurately with only a small efficiency penalty, but omitting the term would lead to bias.
The distinction between the SDM versus the SDEM, however, is largely a theoretical choice for a given research question rather than an empirical matter. As LeSage & Pace (2014, p. 1551) describe:
“Global spillovers arise when changes in a characteristic of one region impact outcomes in more than just immediately neighboring regions. Global impacts fall on neighbors to the neighbors, neighbors to the neighbors to the neighbors, and so on…
Local spillovers represent a situation where the impacts fall only on nearby or immediate neighbors, dying out before they impact regions that are neighbors to the neighbors.
Another distinction between local and global spillovers is that feedback effects arise in the case of global spillovers, but not in the case of local spillovers.”
It follows “that practitioners cannot simply estimate an SDM and SDEM model and draw meaningful conclusions about the correct model specification from point estimates of \(\rho\) or \(\lambda\)… the SDM specification implies global spatial spillovers, while the SDEM is consistent with local spillovers. These have very different policy implications in applied work” (LeSage, 2014, p. 22), because indirect (and total) effects will be underestimated in the case when SDM model is appropriate but the SDEM model is fit because SDEM does not include feedback effects.
The SAR and SDM models which contain the spatially-lagged dependent variable \(\rho Wy\) specify global spillovers; the SLX, SDM, and SDEM models which contain the \(\theta WX\) term specify local spillovers (SDM specifies both). The difference between global and local spillovers pertains to the ways that effects propagate through the system and the ways that model coefficients may be interpreted.
35.1.1 Model selection
Different specifications imply different DGPs and reflect what you want to study.
SAR assumes explicit spillover processes, whereas SEM assumes spatially correlated omitted variables. You should have theory guiding which specification you want to use, even though most applied work uses a fit-based model selection approach
A sobering critique of applied spatial econometric modeling is given by Gibbons & Overman (2012), who argue compellingly that it
does not provide a toolbox that gives satisfactory solutions to these problems for applied researchers interested in causality and the economic processes at work. It is in this sense that we call into question the approach of the burgeoning body of applied economic research that proceeds with mechanical application of spatial econometric techniques to causal questions of “spillovers” and other forms of spatial interaction, or that estimates spatial lag models as a quick fix for endogeneity issues, or that blindly applies spatial econometric models by default without serious consideration of whether this is necessary given the research question in hand.
To overcome these issues, they argue that applied work should adopt an experimentalist design, where some form of exogenous variation is induced to provide insight into a causal mechanism at play. Models that incorporate both spatial and temporal dimensions (explored in the next section) provide an important avenue forward to address this issue (Bardaka et al., 2018; delgado2015DifferenceInDifferencesTechniques?).
Alternatively, structural models should proceed by considering the SLX model as the point of depatrture, or use strong theory to guide the choice of a particular specification as described above (Halleck Vega & Elhorst, 2015). Although there remains debate in urban economics, we believe there is reasonable justification for assuming a potential autoregressive process in the specific context of hedonic housing price models.
Direct and indirect effects of spillovers
35.1.2Graph (\(W\)) Selection
There are conflicting opinions on the importance of the neighbor graph.
The state of Maryland makes available a great set of parcel-level data called PropertyView. They also provide sales listings broken down by quarter or aggregated by year
Code
# read in the points (not the polys)gdf = gpd.read_file('data/BACIparcels0622/BACIMDPV.shp')
We generally start with ESDA (following the bottom-up strategy) to ensure we even care about spatial econometrics. We begin by mapping the dependent variable (housing price, here) and examining any global and local autocorrelation
Make this Notebook Trusted to load map: File -> Trust Notebook
There are definitely spatial patterns, but they’re also not uniform. There’s probably autocorrelation in the sales prices, but it seems to be particularly strong in some pockets. We should be sure to check both global and local statistics
35.2.1 Global Stats
Code
# prices are probably influenced by a greater 'neighborhood' than just the adjacent units, so lets use distance band weights (3/4 km; ~10 min walk) W = lps.weights.DistanceBand.from_dataframe(balt_sales, threshold=750)W.transform ='r'
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.10/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected:
There are 5 disconnected components.
warnings.warn(message)
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.10/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected:
There are 5 disconnected components.
warnings.warn(message)
Code
# but we can also look at hyperlocal trendsw_knn = lps.weights.KNN.from_dataframe(balt_sales, k=10)w_knn.transform ='r'
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.10/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected:
There are 10 disconnected components.
warnings.warn(message)
Taking different modeling strategies, we can vary the way we operationalize \(p(h)\) to best identify the coefficients of interest, including the potential for spatial spillover (e.g. the price of one unit affecting the prices of those nearby)
Code
import statsmodels.formula.api as smf
Code
# Treat the structure grade variable like its continuous. whatever.balt_sales.STRUGRAD = balt_sales.STRUGRAD.astype(int)
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.94e+09. This might indicate that there are strong multicollinearity or other numerical problems.
the residuals are pretty heteroskedastic (and the Jarque-Bera test is highly significant in the statsmodels results table above) so we will need to handle that if we plan to make valid inferences
There’s still significant autocorrelation in the residual, so we’re either failing to account for a spatial spillover process (implying an SAR model), or omitting crucial autocorrelated explanatory variables (implying a SE model) (or both, implying SARAR model)
The autocorrelation is even stronger at smaller scales (First Law, and all)
We could fit the same model with spreg
We can use patsy to generate the X and y inputs spreg requires using the same formula as before
Code
from patsy import dmatrices
Code
# spreg will add an intercept so redefine the formula without one#formula_spreg = "log_price ~ C(DESCBLDG) + C(RESITYP) + STRUGRAD + SQFTSTRC +LANDAREA + YEARBLT + I(YEARBLT**2) + p_nonhisp_black_persons + p_hispanic_persons + p_edu_college_greater + median_household_income + p_housing_units_multiunit_structures + p_vacant_housing_units"
Code
y, X = dmatrices(formula, balt_sales)
Code
# need to drop the intercept. Note you can't just define the formula without one# because then patsy wont drop a level from the first categorical variable, so you need this fix anyway# slice off the intterceptname_x=X.design_info.column_names[1:]name_y=y.design_info.column_names[0]y = np.asarray(y).flatten()X = np.asarray(X)[:,1:]
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : log_price Number of Observations: 5284
Mean dependent var : 12.1935 Number of Variables : 16
S.D. dependent var : 0.5906 Degrees of Freedom : 5268
R-squared : 0.7195
Adjusted R-squared : 0.7187
Sum squared residual: 517.042 F-statistic : 900.6752
Sigma-square : 0.098Prob(F-statistic) : 0
S.E. of regression : 0.313 Log likelihood : -1356.831
Sigma-square ML : 0.098 Akaike info criterion : 2745.661
S.E of regression ML: 0.3128 Schwarz criterion : 2850.820
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 76.863567910.35210897.42491880.0000000C(RESITYP)[T.TH]-0.20462200.0135583-15.09201170.0000000C(DESCCNST)[T.CNST Frame]-0.01173670.0123539-0.95004170.3421347C(DESCCNST)[T.CNST Shingle Asbestos]0.02856310.02725251.04809360.2946436C(DESCCNST)[T.CNST Shingle Wood]-0.04187950.0337687-1.24018780.2149612C(DESCCNST)[T.CNST Siding]0.03735300.01309972.85143830.0043691C(DESCCNST)[T.CNST Stucco]-0.04451630.0257458-1.72906900.0838553
STRUGRAD 0.17190220.005566930.87942600.0000000np.log(SQFTSTRC)0.50499430.015070533.50887120.0000000
YEARBLT -0.07473520.0106498-7.01754110.0000000I(YEARBLT ** 2)0.00001960.00000277.13895320.0000000
p_nonhisp_white_persons 0.00143920.00023656.08559990.0000000
p_edu_college_greater 0.00524800.000387713.53783680.0000000
median_household_income 0.15006050.01706828.79180430.0000000
p_housing_units_multiunit_structures 0.00190570.00024447.79631770.0000000
p_owner_occupied_units 0.00469540.00049709.44669470.0000000
------------------------------------------------------------------------------------
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 18745.447
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 21236.1930.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 15553.8810.0000
Koenker-Bassett test 15339.0850.0000
================================ END OF REPORT =====================================
Most of the variables are significant, except for the construction materials, in which only the siding has a significant effect. Otherwise, the coefficients are as expected. Housing prices increase with larger and newer homes (though historic homes also get a bump). Townhomes are less expensive than single family, and there are some small effects from the neighborhood’s demographic composition, the one exception being median income, which is relatively large
Code
np.set_printoptions(suppress=True) #prevent scientific format
The SEM specification removes all spatial autocorrelation from the residual, as expected. But we also have no sense for how prices spillover from one unit to another. To estimate spillover effects, we need a spatial autoregressive (SAR) model
Again, the spatially-filtered residuals take care of the unobserved spatial effects, but they also mask out any substantively interesting spillover effects.
Though the graph doesnt look crazy, but compared to the distribution, its clear there’s still significant autocorrelation in the residual. To keep the spillover term and also account for autocorrelation in the residual, we need to use a ‘spatial combo’ model
35.3.2.3 Spatial Combo
The spatial autoregressive combination (SAC) or “spatial combo” model includes both a lagged dependent variable to capture endogenous spillover effects and a spatial error term that can capture any residual autocorrelation from omitted variables
\[y = \rho W y + \beta X + u\]\[ u = \lambda Wu+ \epsilon \]
The residuals still aren’t perfectly random, and we can’t get rid of that skew, but these are real data, and they’re imperfect. It is clear that we get closer to normal residuals once we account for autocorrelation
Interesting… If we fit a model with heteroskedastic errors, we can reintroduce some significant spatial dependence in the residuals, even if its small. In this case, the difference between the models with/without the heteroskedasticity is negligible; their results are basically identical.
Combining the coeffients from the different models into a single table lets us compare how the estimated coefficients change under each specification
Comparing the estimated coefficients shows that median income was doing a lot of heavy lifting in the OLS model. That is, when we ignore spatial effects, the median income variable absorbs a lot of the spatial variance, inflating its coefficient and biasing our interpretation of how much residents value higher-income neighbors. In the spatial models, the estimated coefficient for income is less than half as large. That’s basically true of all the other neighborhood characteristics as well
There are a few things going on here. First, it’s important to remember that our data are measured at two different spatial levels. Our housing sales are measured at the unit level, but the ‘neighborhood’ variables are measured at the census blockgroup level. Because we have several sales nested inside each census unit, we’re building spatial autocorrelation into the model by design. That is, because we don’t know the income, education, race, etc of the home’s previous owner, the neighborhood variables are essentially \(Wx\) instead of just \(x\), where \(W\) is defined by discrete membership inside a census unit. That implies we’re actually fitting something closer to a spatial Durbin model (SDM), since our preferred models have spatially lagged \(Y\) variables, and some of the \(X\) variables are “lagged” themselves. [note: i think i would say it’s not technically durbin because we dont have x and Wx for all inputs, but a mix of the two… i know you dont have to lag all the X in a durbin specification, but i dont think i’ve seen one where you have only\(Wx\) for some \(x\) (as opposed to only \(x\), NOT, \(Wx\))]
Accounting for space, there are two things going on. In the SAR and spatial combo models, the autoregressive term \(\rho\) is positive, significant, and pretty large in magnitude, showing pretty strong evidence of a spillover effect in housing prices. But the combo models also have a large and significant spatial error term \(\lambda\), so we also have a big unmeasured spatial effect. There are several important variables we might be missing, like perceptions of crime, school quality, and access to amenities. Indeed, according to neoclassical urban theory, we have not included the determining factor in land rent prices: access to the CBD (or jobs) (alonso?, muth, mills, ….)
35.4 Estimation Techniques
Classically, spatial econometrics moels are estimated using Maximum-Likelihood (ML), but modern approaches prefer Generalized Method of Moments (GMM). GMM estimation is much faster and applicable to much larger datasets, but it also depends on instrumental variables. You can provide these instruments yourself, or you can use spatially-lagged regressions (i.e. lag of \(X\) variables) which is the standard technique. This is important because when using estimating a spatial Durbin model using GMM, you will need either to use high-order neighbors as instruments, or provide your own.
:::
Anselin, L., & Rey, S. J. (2014). Modern spatial econometrics in practice: {}A{} guide to {}GeoDa{}, {}GeoDaSpace{} and {}PySAL{}. GeoDa Press.
Bardaka, E., Delgado, M. S., & Florax, R. J. G. M. (2018). Causal identification of transit-induced gentrification and spatial spillover effects: The case of the Denver light rail. Journal of Transport Geography, 71(June), 15–31. https://doi.org/10.1016/j.jtrangeo.2018.06.025
Halleck Vega, S., & Elhorst, J. P. (2015). THE SLX MODEL. Journal of Regional Science, 55(3), 339–363. https://doi.org/10.1111/jors.12188
LeSage, J. P. (2014). What Regional Scientists Need to Know About Spatial Econometrics. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2420725
LeSage, J. P., & Pace, R. K. (2014). Interpreting Spatial Econometric Models. In Handbook of Regional Science (pp. 1535–1552). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23430-9_91