35  Spatial Hedonics in Python

Code
import geopandas as gpd
import pandas as pd
import numpy as np
import esda
import libpysal as lps
import matplotlib.pyplot as plt
from geosnap import DataStore
from geosnap.io import get_acs
from splot import esda as esdplt
%load_ext watermark
%watermark -a 'eli knaap' -v -d -u -p spreg,geopandas,geosnap,statsmodels
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Author: eli knaap

Last updated: 2024-03-26

Python implementation: CPython
Python version       : 3.10.8
IPython version      : 8.9.0

spreg      : 1.4.2
geopandas  : 0.14.1
geosnap    : 0.12.1.dev9+g3a1cb0f6de61.d20240110
statsmodels: 0.13.5

35.1 Understanding and Interpreting Spatial Models

In the last chapter, several commonly-used model specifications were defined, but the choice among them depends on the research question. Naturally, if spillover effects are of substantive interest, then it is necessary to choose a model that includes (at least) \(\rho WY\) or \(\theta WX\), but beyond that, the choice is more complex.

The literature basically offers two different strategies; Anselin & Rey (2014) argue that a “bottom-up” strategy is most effective, where we begin with simple models (e.g., OLS), then conduct spatial diagnostic tests to decide whether more complex specifications are necessary. LeSage (2014) and LeSage & Pace (2014) argue for a “top-down” strategy where we begin with a ‘fully specified’ conceptual model. LeSage (2014, p. 19) goes to far as to say that “there are only two model specifications worth considering for applied work.” When there are theoretical reasons to consider global spillover processes (such as home prices in a hedonic model), then the SDM model is appropriate. In other cases that consider only local spillover effects, the SDEM specification will provide unbiased estimates of all parameters.

Their logic (to always prefer the Durbin specifications) is simple: the inclusion of the \(\theta WX\) term ensures that if there is an exogenous spillover process, then it will be measured accurately with only a small efficiency penalty, but omitting the term would lead to bias.

The distinction between the SDM versus the SDEM, however, is largely a theoretical choice for a given research question rather than an empirical matter. As LeSage & Pace (2014, p. 1551) describe:


“Global spillovers arise when changes in a characteristic of one region impact outcomes in more than just immediately neighboring regions. Global impacts fall on neighbors to the neighbors, neighbors to the neighbors to the neighbors, and so on…

Local spillovers represent a situation where the impacts fall only on nearby or immediate neighbors, dying out before they impact regions that are neighbors to the neighbors.

Another distinction between local and global spillovers is that feedback effects arise in the case of global spillovers, but not in the case of local spillovers.”


It follows “that practitioners cannot simply estimate an SDM and SDEM model and draw meaningful conclusions about the correct model specification from point estimates of \(\rho\) or \(\lambda\)… the SDM specification implies global spatial spillovers, while the SDEM is consistent with local spillovers. These have very different policy implications in applied work” (LeSage, 2014, p. 22), because indirect (and total) effects will be underestimated in the case when SDM model is appropriate but the SDEM model is fit because SDEM does not include feedback effects.

The SAR and SDM models which contain the spatially-lagged dependent variable \(\rho Wy\) specify global spillovers; the SLX, SDM, and SDEM models which contain the \(\theta WX\) term specify local spillovers (SDM specifies both). The difference between global and local spillovers pertains to the ways that effects propagate through the system and the ways that model coefficients may be interpreted.

35.1.1 Model selection

Different specifications imply different DGPs and reflect what you want to study.

SAR assumes explicit spillover processes, whereas SEM assumes spatially correlated omitted variables. You should have theory guiding which specification you want to use, even though most applied work uses a fit-based model selection approach

A sobering critique of applied spatial econometric modeling is given by Gibbons & Overman (2012), who argue compellingly that it

does not provide a toolbox that gives satisfactory solutions to these problems for applied researchers interested in causality and the economic processes at work. It is in this sense that we call into question the approach of the burgeoning body of applied economic research that proceeds with mechanical application of spatial econometric techniques to causal questions of “spillovers” and other forms of spatial interaction, or that estimates spatial lag models as a quick fix for endogeneity issues, or that blindly applies spatial econometric models by default without serious consideration of whether this is necessary given the research question in hand.

To overcome these issues, they argue that applied work should adopt an experimentalist design, where some form of exogenous variation is induced to provide insight into a causal mechanism at play. Models that incorporate both spatial and temporal dimensions (explored in the next section) provide an important avenue forward to address this issue (Bardaka et al., 2018; delgado2015DifferenceInDifferencesTechniques?).

Alternatively, structural models should proceed by considering the SLX model as the point of depatrture, or use strong theory to guide the choice of a particular specification as described above (Halleck Vega & Elhorst, 2015). Although there remains debate in urban economics, we believe there is reasonable justification for assuming a potential autoregressive process in the specific context of hedonic housing price models.

Direct and indirect effects of spillovers

35.1.2 Graph (\(W\)) Selection

There are conflicting opinions on the importance of the neighbor graph.

Code
datasets = DataStore()
Code
balt_acs = get_acs(datasets, county_fips='24510', years=2019)

The state of Maryland makes available a great set of parcel-level data called PropertyView. They also provide sales listings broken down by quarter or aggregated by year

Code
# read in the points (not the polys)

gdf = gpd.read_file('data/BACIparcels0622/BACIMDPV.shp')
Code
sales = gpd.read_file('data/Statewide-Res-Sales-CY2018/mdst_sale18.shp')
Code
gdf = gdf.set_index('ACCTID')
Code
sales = sales.set_index('acctid')
Code
balt_sales = gdf.drop(columns=['geometry']).join(sales, how='inner')
Code
balt_sales = gpd.GeoDataFrame(balt_sales)
Code
# this probably provides some interesting info 
balt_sales.DESCCNST.value_counts()
CNST Brick               3559
CNST Siding              1108
CNST Frame               1053
CNST Stucco               182
CNST Shingle Asbestos     177
CNST Shingle Wood         117
CNST Block                 47
CNST 1/2 Brick Frame       44
CNST 1/2 Brick Siding      35
CNST Stone                 35
CNST 1/2 Stone Frame       10
CNST 1/2 Stone Siding       2
Name: DESCCNST, dtype: int64
Code
balt_sales = balt_sales[~balt_sales.DESCCNST.isin(['CNST 1/2 Stone Frame', 'CNST 1/2 Stone Siding',
                                                  'CNST Block', 'CNST 1/2 Brick Frame', 'CNST Stone',
                                                  'CNST 1/2 Brick Siding'])]
Code
balt_sales = balt_sales[balt_sales.RESIDENT==1]
Code
balt_sales.STRUBLDG.value_counts()
0003    3071
0013    1400
0001    1263
0002    1192
0009     179
0008     137
0007      70
0005      13
0010       5
0004       4
Name: STRUBLDG, dtype: int64
Code
balt_sales.DESCBLDG.value_counts()
DWEL Center Unit        3071
DWEL Rental Dwelling    1400
DWEL Standard Unit      1263
DWEL End Unit           1192
DWEL Condo Hi Rise       179
DWEL Condo Garden        137
DWEL Condo Townhouse      70
DWEL Split Level          13
DWEL Penthouse             5
DWEL Split Foyer           4
Name: DESCBLDG, dtype: int64
Code
balt_sales = balt_sales[~balt_sales.STRUBLDG.isin(['0005', '0004'])]
Code
balt_sales['date'] = balt_sales.tradate.astype('datetime64')
Code
balt_sales['date']
0301011738 014   2018-01-12
0301011738 017   2018-09-06
0301011738 022   2018-01-08
0301011738 024   2018-06-13
0301011738 026   2018-01-16
                    ...    
0328058102I060   2018-04-18
0328058102J016   2018-11-21
0328058102L031   2018-11-21
0328058102M013   2018-10-30
0328058127A010   2018-08-22
Name: date, Length: 7335, dtype: datetime64[ns]
Code
balt_sales['month']=balt_sales.tradate.str[4:6]
Code
balt_sales.DESCBLDG.value_counts()
DWEL Center Unit        3071
DWEL Rental Dwelling    1400
DWEL Standard Unit      1263
DWEL End Unit           1192
DWEL Condo Hi Rise       179
DWEL Condo Garden        137
DWEL Condo Townhouse      70
DWEL Penthouse             5
Name: DESCBLDG, dtype: int64
Code
balt_sales.RESITYP.value_counts()
TH    4330
AP    1403
SF    1266
CN     336
Name: RESITYP, dtype: int64
Code
balt_sales.RESITYP.value_counts().plot(kind='bar')

Code
balt_sales = balt_sales[balt_sales.RESITYP != 'AP']
Code
balt_sales.SQFTSTRC.describe()
count    5932.000000
mean     1480.061362
std       652.125825
min         0.000000
25%      1112.000000
50%      1302.000000
75%      1652.000000
max      7816.000000
Name: SQFTSTRC, dtype: float64

Mostly townhomes that sold in 2018. Not surprising, given Baltimore’s housing stock

Code
# landarea is the size of the parcel, sqftstrc is the size of the structure
balt_sales[['SQFTSTRC', 'LANDAREA']].corr()
SQFTSTRC LANDAREA
SQFTSTRC 1.000000 0.339829
LANDAREA 0.339829 1.000000
Code
balt_sales = balt_sales.dropna(subset=['STRUGRAD', 'YEARBLT'])
Code
balt_sales = balt_sales[balt_sales.DESCLU =='Residential']
Code
balt_sales[balt_sales.considr1> 1000000].considr1.plot(kind='density')

Code
# doesnt count if it was free
balt_sales = balt_sales[balt_sales.considr1 >20000]
Code
balt_sales = balt_sales[balt_sales.considr1 <1500000]
Code
balt_sales.considr1.plot(kind='density')

Code
balt_sales.considr1.plot(kind='box')

Code
from scipy.stats import zscore
Code
balt_sales.considr1.apply(np.log).plot(kind='density')

Log transform gets close to a normal shape with heavy tails

Code
balt_sales.considr1.apply(np.log).plot(kind='box')

Code
balt_sales['log_price'] = balt_sales.considr1.apply(np.log)
Code
balt_sales.log_price.hist()

Code
pd.Series(zscore(balt_sales.considr1)).plot(kind='density')

Code
balt_sales[ (balt_sales.log_price <15) & (balt_sales.log_price >10)].log_price.hist()

Code
# trim outliers
balt_sales = balt_sales[ (balt_sales.log_price <15) & (balt_sales.log_price >10)]
Code
balt_sales = balt_sales.merge(balt_acs.drop(columns=['geometry']), left_on='bg2010', right_on='geoid')
Code
balt_sales.median_household_income = balt_sales.median_household_income.apply(np.log)
Code
balt_sales.STRUGRAD.astype(int).plot(kind='hist')

Code
COLS = ['DESCBLDG','RESITYP','DESCCNST','SQFTSTRC', 'LANDAREA','YEARBLT','p_nonhisp_black_persons','p_hispanic_persons','p_edu_college_greater','median_household_income','p_housing_units_multiunit_structures','p_vacant_housing_units']
Code
balt_sales = balt_sales.dropna(subset=COLS)

35.2 ESDA

We generally start with ESDA (following the bottom-up strategy) to ensure we even care about spatial econometrics. We begin by mapping the dependent variable (housing price, here) and examining any global and local autocorrelation

Code
balt_sales[['considr1', 'log_price', 'geometry']].explore('log_price', scheme='fisher_jenks', k=10, cmap='plasma',tiles='Stamen Toner Lite')
Make this Notebook Trusted to load map: File -> Trust Notebook

There are definitely spatial patterns, but they’re also not uniform. There’s probably autocorrelation in the sales prices, but it seems to be particularly strong in some pockets. We should be sure to check both global and local statistics

35.2.1 Global Stats

Code
# prices are probably influenced by a greater 'neighborhood' than just the adjacent units, so lets use distance band weights (3/4 km; ~10 min walk) 

W = lps.weights.DistanceBand.from_dataframe(balt_sales, threshold=750)
W.transform = 'r'
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.10/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected: 
 There are 5 disconnected components.
  warnings.warn(message)
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.10/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected: 
 There are 5 disconnected components.
  warnings.warn(message)
Code
# but we can also look at hyperlocal trends
w_knn = lps.weights.KNN.from_dataframe(balt_sales, k=10)
w_knn.transform = 'r'
/Users/knaaptime/mambaforge/envs/urban_analysis/lib/python3.10/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected: 
 There are 10 disconnected components.
  warnings.warn(message)
Code
moran = esda.Moran(balt_sales['considr1'], W)
Code
moran_log = esda.Moran(balt_sales['log_price'], W)
Code
moran_knn = esda.Moran(balt_sales['log_price'], w_knn)
Code
esdplt.plot_moran(moran)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.48', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.48)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Code
esdplt.plot_moran(moran_log)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.52', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.52)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Code
esdplt.plot_moran(moran_knn)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.66', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.66)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

35.2.2 Local Stats

Code
lisa = esda.Moran_Local(balt_sales.log_price, W, n_jobs=-1, )
Code
esdplt.plot_local_autocorrelation
<function splot._viz_esda_mpl.plot_local_autocorrelation(moran_loc, gdf, attribute, p=0.05, region_column=None, mask=None, mask_color='#636363', quadrant=None, aspect_equal=True, legend=True, scheme='Quantiles', cmap='YlGnBu', figsize=(15, 4), scatter_kwds=None, fitline_kwds=None)>
Code
esdplt.plot_local_autocorrelation(lisa, balt_sales, 'log_price')
(<Figure size 1080x288 with 3 Axes>,
 array([<AxesSubplot:title={'center':'Moran Local Scatterplot'}, xlabel='Attribute', ylabel='Spatial Lag'>,
        <AxesSubplot:>, <AxesSubplot:>], dtype=object))

Code
balt_sales.YEARBLT = balt_sales.YEARBLT.astype(int)

35.3 Models

Taking different modeling strategies, we can vary the way we operationalize \(p(h)\) to best identify the coefficients of interest, including the potential for spatial spillover (e.g. the price of one unit affecting the prices of those nearby)

Code
import statsmodels.formula.api as smf
Code
# Treat the structure grade variable like its continuous. whatever.
balt_sales.STRUGRAD = balt_sales.STRUGRAD.astype(int)
Code
formula = 'log_price ~ STRUGRAD + C(RESITYP) + C(DESCCNST) + np.log(SQFTSTRC) + YEARBLT + I(YEARBLT**2) + p_nonhisp_white_persons + p_edu_college_greater + median_household_income + p_housing_units_multiunit_structures + p_owner_occupied_units'

35.3.1 Aspatial (OLS)

\[y = \beta X + \epsilon\]

Code
model_ols = smf.ols(formula,balt_sales).fit()
Code
model_ols.summary()
OLS Regression Results
Dep. Variable: log_price R-squared: 0.719
Model: OLS Adj. R-squared: 0.719
Method: Least Squares F-statistic: 900.7
Date: Wed, 06 Jul 2022 Prob (F-statistic): 0.00
Time: 17:18:31 Log-Likelihood: -1356.8
No. Observations: 5284 AIC: 2746.
Df Residuals: 5268 BIC: 2851.
Df Model: 15
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 76.8636 10.352 7.425 0.000 56.569 97.158
C(RESITYP)[T.TH] -0.2046 0.014 -15.092 0.000 -0.231 -0.178
C(DESCCNST)[T.CNST Frame] -0.0117 0.012 -0.950 0.342 -0.036 0.012
C(DESCCNST)[T.CNST Shingle Asbestos] 0.0286 0.027 1.048 0.295 -0.025 0.082
C(DESCCNST)[T.CNST Shingle Wood] -0.0419 0.034 -1.240 0.215 -0.108 0.024
C(DESCCNST)[T.CNST Siding] 0.0374 0.013 2.851 0.004 0.012 0.063
C(DESCCNST)[T.CNST Stucco] -0.0445 0.026 -1.729 0.084 -0.095 0.006
STRUGRAD 0.1719 0.006 30.879 0.000 0.161 0.183
np.log(SQFTSTRC) 0.5050 0.015 33.509 0.000 0.475 0.535
YEARBLT -0.0747 0.011 -7.018 0.000 -0.096 -0.054
I(YEARBLT ** 2) 1.958e-05 2.74e-06 7.139 0.000 1.42e-05 2.5e-05
p_nonhisp_white_persons 0.0014 0.000 6.086 0.000 0.001 0.002
p_edu_college_greater 0.0052 0.000 13.538 0.000 0.004 0.006
median_household_income 0.1501 0.017 8.792 0.000 0.117 0.184
p_housing_units_multiunit_structures 0.0019 0.000 7.796 0.000 0.001 0.002
p_owner_occupied_units 0.0047 0.000 9.447 0.000 0.004 0.006
Omnibus: 771.083 Durbin-Watson: 1.436
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1236.192
Skew: -1.001 Prob(JB): 3.67e-269
Kurtosis: 4.267 Cond. No. 8.94e+09


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.94e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
model_ols.resid
0       0.366840
1       0.078007
2       0.166342
3       0.089591
4      -0.011663
          ...   
5489    0.052329
5490   -0.629540
5491    0.028368
5492   -0.630452
5493    0.030852
Length: 5284, dtype: float64
Code
ax = pd.DataFrame({'predy': zscore(model_ols.fittedvalues), 'u': zscore(model_ols.resid)}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_ols.fittedvalues), x=zscore(model_ols.resid))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

the residuals are pretty heteroskedastic (and the Jarque-Bera test is highly significant in the statsmodels results table above) so we will need to handle that if we plan to make valid inferences

Code
moran_resid = esda.Moran(model_ols.resid, W)
Code
moran_resid_knn = esda.Moran(model_ols.resid, w_knn)
Code
moran_resid_knn.p_sim
0.001
Code
esdplt.plot_moran(moran_resid)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.13', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.13)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

There’s still significant autocorrelation in the residual, so we’re either failing to account for a spatial spillover process (implying an SAR model), or omitting crucial autocorrelated explanatory variables (implying a SE model) (or both, implying SARAR model)

Code
esdplt.plot_moran(moran_resid_knn)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.23', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.23)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

The autocorrelation is even stronger at smaller scales (First Law, and all)

We could fit the same model with spreg

We can use patsy to generate the X and y inputs spreg requires using the same formula as before

Code
from patsy import dmatrices
Code
# spreg will add an intercept so redefine the formula without one

#formula_spreg = "log_price ~ C(DESCBLDG)  + C(RESITYP) + STRUGRAD + SQFTSTRC +LANDAREA + YEARBLT + I(YEARBLT**2) + p_nonhisp_black_persons + p_hispanic_persons + p_edu_college_greater + median_household_income + p_housing_units_multiunit_structures + p_vacant_housing_units"
Code
y, X = dmatrices(formula, balt_sales)
Code
# need to drop the intercept. Note you can't just define the formula without one
# because then patsy wont drop a level from the first categorical variable, so you need this fix anyway

# slice off the inttercept
name_x=X.design_info.column_names[1:]
name_y=y.design_info.column_names[0]

y = np.asarray(y).flatten()
X = np.asarray(X)[:,1:]
Code
from spreg import OLS
Code
model_ols_spreg = OLS(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
from rich.jupyter import print
Code
print(model_ols_spreg.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          16
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5268
R-squared           :      0.7195
Adjusted R-squared  :      0.7187
Sum squared residual:     517.042                F-statistic           :    900.6752
Sigma-square        :       0.098                Prob(F-statistic)     :           0
S.E. of regression  :       0.313                Log likelihood        :   -1356.831
Sigma-square ML     :       0.098                Akaike info criterion :    2745.661
S.E of regression ML:      0.3128                Schwarz criterion     :    2850.820

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      76.8635679      10.3521089       7.4249188       0.0000000
    C(RESITYP)[T.TH]      -0.2046220       0.0135583     -15.0920117       0.0000000
C(DESCCNST)[T.CNST Frame]      -0.0117367       0.0123539      -0.9500417       0.3421347
C(DESCCNST)[T.CNST Shingle Asbestos]       0.0285631       0.0272525       1.0480936       
0.2946436
C(DESCCNST)[T.CNST Shingle Wood]      -0.0418795       0.0337687      -1.2401878       
0.2149612
C(DESCCNST)[T.CNST Siding]       0.0373530       0.0130997       2.8514383       0.0043691
C(DESCCNST)[T.CNST Stucco]      -0.0445163       0.0257458      -1.7290690       0.0838553
            STRUGRAD       0.1719022       0.0055669      30.8794260       0.0000000
    np.log(SQFTSTRC)       0.5049943       0.0150705      33.5088712       0.0000000
             YEARBLT      -0.0747352       0.0106498      -7.0175411       0.0000000
     I(YEARBLT ** 2)       0.0000196       0.0000027       7.1389532       0.0000000
p_nonhisp_white_persons       0.0014392       0.0002365       6.0855999       0.0000000
p_edu_college_greater       0.0052480       0.0003877      13.5378368       0.0000000
median_household_income       0.1500605       0.0170682       8.7918043       0.0000000
p_housing_units_multiunit_structures       0.0019057       0.0002444       7.7963177       
0.0000000
p_owner_occupied_units       0.0046954       0.0004970       9.4466947       0.0000000
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER        18745.447

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        1236.193           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               15         553.881           0.0000
Koenker-Bassett test             15         339.085           0.0000
================================ END OF REPORT =====================================

Most of the variables are significant, except for the construction materials, in which only the siding has a significant effect. Otherwise, the coefficients are as expected. Housing prices increase with larger and newer homes (though historic homes also get a bump). Townhomes are less expensive than single family, and there are some small effects from the neighborhood’s demographic composition, the one exception being median income, which is relatively large

Code
np.set_printoptions(suppress=True) #prevent scientific format
Code
balt_sales.DESCBLDG = balt_sales.DESCBLDG.astype('category')

35.3.2 Spatial Models

Code
from spreg import GM_Error, GM_Lag, GM_Combo, GM_Error_Het, GM_Combo_Het

35.3.2.1 Spatial Error

\[y = \beta X + u\] \[ u = \lambda Wu+ \epsilon \]

Code
model_sem = GM_Error(y=y, x=X, w=w_knn, name_x=name_x, name_y=name_y,)
Code
print(model_sem.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES
---------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          16
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5268
Pseudo R-squared    :      0.7172

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      61.1798061      11.5402663       5.3014206       0.0000001
    C(RESITYP)[T.TH]      -0.2290629       0.0158447     -14.4567132       0.0000000
C(DESCCNST)[T.CNST Frame]       0.0001360       0.0118460       0.0114815       0.9908393
C(DESCCNST)[T.CNST Shingle Asbestos]       0.0134232       0.0255281       0.5258218       
0.5990121
C(DESCCNST)[T.CNST Shingle Wood]      -0.0224539       0.0321179      -0.6991078       
0.4844846
C(DESCCNST)[T.CNST Siding]       0.0188702       0.0129996       1.4516048       0.1466115
C(DESCCNST)[T.CNST Stucco]      -0.0254485       0.0242396      -1.0498743       0.2937759
            STRUGRAD       0.1627187       0.0056562      28.7681928       0.0000000
    np.log(SQFTSTRC)       0.4907576       0.0155717      31.5159104       0.0000000
             YEARBLT      -0.0578912       0.0118881      -4.8696967       0.0000011
     I(YEARBLT ** 2)       0.0000153       0.0000031       4.9756099       0.0000007
p_nonhisp_white_persons       0.0021840       0.0003793       5.7576789       0.0000000
p_edu_college_greater       0.0043246       0.0005532       7.8170607       0.0000000
median_household_income       0.1136662       0.0221113       5.1406327       0.0000003
p_housing_units_multiunit_structures       0.0015271       0.0003356       4.5500314       
0.0000054
p_owner_occupied_units       0.0034179       0.0006557       5.2124870       0.0000002
              lambda       0.5590364    
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

Still most variables are significant (though siding construction falls out). The rest of the significant coefficients have changed values a bit

Code
pd.DataFrame({'variable':model_sem.name_x, 
              'beta' : model_sem.betas.flatten(),
              'p_val':pd.DataFrame(model_sem.z_stat)[1].round(4).append(pd.Series(np.nan), ignore_index=True)})
variable beta p_val
0 CONSTANT 61.179806 0.0000
1 C(RESITYP)[T.TH] -0.229063 0.0000
2 C(DESCCNST)[T.CNST Frame] 0.000136 0.9908
3 C(DESCCNST)[T.CNST Shingle Asbestos] 0.013423 0.5990
4 C(DESCCNST)[T.CNST Shingle Wood] -0.022454 0.4845
5 C(DESCCNST)[T.CNST Siding] 0.018870 0.1466
6 C(DESCCNST)[T.CNST Stucco] -0.025449 0.2938
7 STRUGRAD 0.162719 0.0000
8 np.log(SQFTSTRC) 0.490758 0.0000
9 YEARBLT -0.057891 0.0000
10 I(YEARBLT ** 2) 0.000015 0.0000
11 p_nonhisp_white_persons 0.002184 0.0000
12 p_edu_college_greater 0.004325 0.0000
13 median_household_income 0.113666 0.0000
14 p_housing_units_multiunit_structures 0.001527 0.0000
15 p_owner_occupied_units 0.003418 0.0000
16 lambda 0.559036 NaN
Code
moran_resid_sem = esda.Moran(model_sem.e_filtered, W,)
Code
esdplt.plot_moran(moran_resid_sem)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.03', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.03)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

The SEM specification removes all spatial autocorrelation from the residual, as expected. But we also have no sense for how prices spillover from one unit to another. To estimate spillover effects, we need a spatial autoregressive (SAR) model

Code
ax = pd.DataFrame({'predy': zscore(model_sem.predy.flatten()), 'u': zscore(model_sem.e_filtered.flatten())}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_sem.predy.flatten()), x=zscore(model_sem.e_filtered.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
pd.DataFrame(zscore(model_sem.e_filtered)).plot(kind='density')

But the residuals are not perfectly random, either. It might be a good idea to refit with a model that accounts for heteroskedastic errors

35.3.2.1.1 SEM w/ heteroskedastic errors
Code
model_sem_het = GM_Error_Het(y=y, x=X, w=W, name_x=name_x, name_y=name_y, )
Code
print(model_sem_het.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES (HET)
---------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          16
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5268
Pseudo R-squared    :      0.7060
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      64.6414781       8.5488515       7.5614225       0.0000000
    C(RESITYP)[T.TH]      -0.2552127       0.0163554     -15.6041474       0.0000000
C(DESCCNST)[T.CNST Frame]      -0.0029566       0.0125368      -0.2358359       0.8135600
C(DESCCNST)[T.CNST Shingle Asbestos]      -0.0037957       0.0214947      -0.1765881       
0.8598320
C(DESCCNST)[T.CNST Shingle Wood]      -0.0077443       0.0273618      -0.2830349       
0.7771501
C(DESCCNST)[T.CNST Siding]       0.0137209       0.0120292       1.1406325       0.2540229
C(DESCCNST)[T.CNST Stucco]      -0.0260598       0.0215701      -1.2081476       0.2269905
            STRUGRAD       0.1657251       0.0065956      25.1267970       0.0000000
    np.log(SQFTSTRC)       0.5084304       0.0165392      30.7409600       0.0000000
             YEARBLT      -0.0612207       0.0087722      -6.9789183       0.0000000
     I(YEARBLT ** 2)       0.0000162       0.0000023       7.1719155       0.0000000
p_nonhisp_white_persons       0.0025021       0.0003738       6.6935313       0.0000000
p_edu_college_greater       0.0031854       0.0005013       6.3541544       0.0000000
median_household_income       0.0652495       0.0203442       3.2072839       0.0013399
p_housing_units_multiunit_structures       0.0008944       0.0002657       3.3664492       
0.0007614
p_owner_occupied_units       0.0022578       0.0005810       3.8862755       0.0001018
              lambda       0.8651115       0.0329580      26.2488987       0.0000000
              lambda       0.8651115       0.0329580      26.2488987       0.0000000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================
Code
pd.DataFrame({'variable':model_sem_het.name_x, 
              'beta' : model_sem_het.betas.flatten(),
              'p_val':pd.DataFrame(model_sem_het.z_stat)[1].round(4)})
variable beta p_val
0 CONSTANT 64.641478 0.0000
1 C(RESITYP)[T.TH] -0.255213 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.002957 0.8136
3 C(DESCCNST)[T.CNST Shingle Asbestos] -0.003796 0.8598
4 C(DESCCNST)[T.CNST Shingle Wood] -0.007744 0.7772
5 C(DESCCNST)[T.CNST Siding] 0.013721 0.2540
6 C(DESCCNST)[T.CNST Stucco] -0.026060 0.2270
7 STRUGRAD 0.165725 0.0000
8 np.log(SQFTSTRC) 0.508430 0.0000
9 YEARBLT -0.061221 0.0000
10 I(YEARBLT ** 2) 0.000016 0.0000
11 p_nonhisp_white_persons 0.002502 0.0000
12 p_edu_college_greater 0.003185 0.0000
13 median_household_income 0.065250 0.0013
14 p_housing_units_multiunit_structures 0.000894 0.0008
15 p_owner_occupied_units 0.002258 0.0001
16 lambda 0.865112 0.0000
Code
pd.DataFrame(model_sem_het.e_filtered).plot(kind='density')

Code
ax = pd.DataFrame({'predy': zscore(model_sem_het.predy.flatten()), 'eps': zscore(model_sem_het.e_filtered.flatten())}).plot(x='predy', y='eps', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_sem_het.predy.flatten()), x=zscore(model_sem_het.e_filtered.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
moran_resid_sem_het = esda.Moran(model_sem_het.e_filtered, W,)
Code
esdplt.plot_moran(moran_resid_sem_het)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: -0.01', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (-0.01)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Again, the spatially-filtered residuals take care of the unobserved spatial effects, but they also mask out any substantively interesting spillover effects.

35.3.2.2 Spatial Lag

\[y = \rho W y + \beta X + \epsilon\]

Code
model_sar = GM_Lag(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
print(model_sar.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          17
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5267
Pseudo R-squared    :      0.7456
Spatial Pseudo R-squared:  0.7223

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      66.0741101       9.8612513       6.7003779       0.0000000
    C(RESITYP)[T.TH]      -0.1812384       0.0129573     -13.9873691       0.0000000
C(DESCCNST)[T.CNST Frame]      -0.0117211       0.0117461      -0.9978743       0.3183403
C(DESCCNST)[T.CNST Shingle Asbestos]       0.0236255       0.0259132       0.9117180       
0.3619172
C(DESCCNST)[T.CNST Shingle Wood]      -0.0643914       0.0321320      -2.0039633       
0.0450740
C(DESCCNST)[T.CNST Siding]       0.0236021       0.0124789       1.8913585       0.0585765
C(DESCCNST)[T.CNST Stucco]      -0.0538640       0.0244848      -2.1998982       0.0278141
            STRUGRAD       0.1507346       0.0054235      27.7928655       0.0000000
    np.log(SQFTSTRC)       0.4662676       0.0144914      32.1753650       0.0000000
             YEARBLT      -0.0669650       0.0101351      -6.6072180       0.0000000
     I(YEARBLT ** 2)       0.0000177       0.0000026       6.7618663       0.0000000
p_nonhisp_white_persons       0.0009151       0.0002268       4.0354273       0.0000545
p_edu_college_greater       0.0032512       0.0003851       8.4425292       0.0000000
median_household_income       0.0771006       0.0167325       4.6078247       0.0000041
p_housing_units_multiunit_structures       0.0007453       0.0002413       3.0888734       
0.0020092
p_owner_occupied_units       0.0042039       0.0004734       8.8805435       0.0000000
         W_log_price       0.3557615       0.0198750      17.8999895       0.0000000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_C(DESCCNST)[T.CNST Frame], W_C(DESCCNST)[T.CNST Shingle
             Asbestos], W_C(DESCCNST)[T.CNST Shingle Wood],
             W_C(DESCCNST)[T.CNST Siding], W_C(DESCCNST)[T.CNST Stucco],
             W_C(RESITYP)[T.TH], W_I(YEARBLT ** 2), W_STRUGRAD, W_YEARBLT,
             W_median_household_income, W_np.log(SQFTSTRC),
             W_p_edu_college_greater,
             W_p_housing_units_multiunit_structures,
             W_p_nonhisp_white_persons, W_p_owner_occupied_units
================================ END OF REPORT =====================================
Code
model_sar.name_x.append('rho')
Code
pd.DataFrame({'variable':model_sar.name_x, 
              'beta' : model_sar.betas.flatten(),
              'p_val':pd.DataFrame(model_sar.z_stat)[1].round(4)})
variable beta p_val
0 CONSTANT 66.074110 0.0000
1 C(RESITYP)[T.TH] -0.181238 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.011721 0.3183
3 C(DESCCNST)[T.CNST Shingle Asbestos] 0.023626 0.3619
4 C(DESCCNST)[T.CNST Shingle Wood] -0.064391 0.0451
5 C(DESCCNST)[T.CNST Siding] 0.023602 0.0586
6 C(DESCCNST)[T.CNST Stucco] -0.053864 0.0278
7 STRUGRAD 0.150735 0.0000
8 np.log(SQFTSTRC) 0.466268 0.0000
9 YEARBLT -0.066965 0.0000
10 I(YEARBLT ** 2) 0.000018 0.0000
11 p_nonhisp_white_persons 0.000915 0.0001
12 p_edu_college_greater 0.003251 0.0000
13 median_household_income 0.077101 0.0000
14 p_housing_units_multiunit_structures 0.000745 0.0020
15 p_owner_occupied_units 0.004204 0.0000
16 rho 0.355761 0.0000
Code
ax = pd.DataFrame({'predy': zscore(model_sar.predy.flatten()), 'u': zscore(model_sar.e_pred.flatten())}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
import seaborn as sns
Code
ax = sns.kdeplot(y=zscore(model_sar.predy.flatten()), x=zscore(model_sar.e_pred.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
pd.DataFrame(zscore(model_sar.e_pred)).plot(kind='density')

Code
moran_resid_sar = esda.Moran(model_sar.u, W)
moran_resid_sar_local= esda.Moran_Local(model_sar.u, W)
Code
esdplt.plot_moran(moran_resid_sar)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.06', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.06)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Though the graph doesnt look crazy, but compared to the distribution, its clear there’s still significant autocorrelation in the residual. To keep the spillover term and also account for autocorrelation in the residual, we need to use a ‘spatial combo’ model

35.3.2.3 Spatial Combo

The spatial autoregressive combination (SAC) or “spatial combo” model includes both a lagged dependent variable to capture endogenous spillover effects and a spatial error term that can capture any residual autocorrelation from omitted variables

\[y = \rho W y + \beta X + u\] \[ u = \lambda Wu+ \epsilon \]

Code
model_combo = GM_Combo(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
print(model_combo.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED TWO STAGE LEAST SQUARES
-------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          17
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5267
Pseudo R-squared    :      0.7415
Spatial Pseudo R-squared:  0.7112

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      62.5735521      10.5075415       5.9551087       0.0000000
    C(RESITYP)[T.TH]      -0.2418616       0.0153624     -15.7437733       0.0000000
C(DESCCNST)[T.CNST Frame]      -0.0053793       0.0116066      -0.4634686       0.6430285
C(DESCCNST)[T.CNST Shingle Asbestos]      -0.0000074       0.0255539      -0.0002885       
0.9997698
C(DESCCNST)[T.CNST Shingle Wood]      -0.0129074       0.0323857      -0.3985527       
0.6902228
C(DESCCNST)[T.CNST Siding]       0.0140847       0.0126342       1.1148071       0.2649331
C(DESCCNST)[T.CNST Stucco]      -0.0314003       0.0239792      -1.3094788       0.1903722
            STRUGRAD       0.1587686       0.0056232      28.2345850       0.0000000
    np.log(SQFTSTRC)       0.4976700       0.0150801      33.0018461       0.0000000
             YEARBLT      -0.0631704       0.0107961      -5.8512387       0.0000000
     I(YEARBLT ** 2)       0.0000167       0.0000028       6.0155036       0.0000000
p_nonhisp_white_persons       0.0017331       0.0003266       5.3061189       0.0000001
p_edu_college_greater       0.0027159       0.0004409       6.1592697       0.0000000
median_household_income       0.0515914       0.0179272       2.8778328       0.0040042
p_housing_units_multiunit_structures       0.0005846       0.0002621       2.2302934       
0.0257280
p_owner_occupied_units       0.0024495       0.0005245       4.6696885       0.0000030
         W_log_price       0.3420096       0.0304468      11.2330309       0.0000000
              lambda       0.6102413    
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_C(DESCCNST)[T.CNST Frame], W_C(DESCCNST)[T.CNST Shingle
             Asbestos], W_C(DESCCNST)[T.CNST Shingle Wood],
             W_C(DESCCNST)[T.CNST Siding], W_C(DESCCNST)[T.CNST Stucco],
             W_C(RESITYP)[T.TH], W_I(YEARBLT ** 2), W_STRUGRAD, W_YEARBLT,
             W_median_household_income, W_np.log(SQFTSTRC),
             W_p_edu_college_greater,
             W_p_housing_units_multiunit_structures,
             W_p_nonhisp_white_persons, W_p_owner_occupied_units
================================ END OF REPORT =====================================
Code
model_combo.name_x.append('rho')
model_combo.name_x.append('lambda')
Code
pd.DataFrame({'variable':model_combo.name_x, 
              'beta' : model_combo.betas.flatten(),
              'p_val':pd.DataFrame(model_combo.z_stat)[1].round(4).append(pd.Series(np.nan))})
variable beta p_val
0 CONSTANT 62.573552 0.0000
1 C(RESITYP)[T.TH] -0.241862 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.005379 0.6430
3 C(DESCCNST)[T.CNST Shingle Asbestos] -0.000007 0.9998
4 C(DESCCNST)[T.CNST Shingle Wood] -0.012907 0.6902
5 C(DESCCNST)[T.CNST Siding] 0.014085 0.2649
6 C(DESCCNST)[T.CNST Stucco] -0.031400 0.1904
7 STRUGRAD 0.158769 0.0000
8 np.log(SQFTSTRC) 0.497670 0.0000
9 YEARBLT -0.063170 0.0000
10 I(YEARBLT ** 2) 0.000017 0.0000
11 p_nonhisp_white_persons 0.001733 0.0000
12 p_edu_college_greater 0.002716 0.0000
13 median_household_income 0.051591 0.0040
14 p_housing_units_multiunit_structures 0.000585 0.0257
15 p_owner_occupied_units 0.002449 0.0000
16 rho 0.342010 0.0000
0 lambda 0.610241 NaN
Code
pd.DataFrame(zscore(model_combo.u)).plot(kind='density')

Code
model_combo.e_filtered.flatten().mean().round(6)
-0.0
Code
ax = pd.DataFrame({'predy': zscore(model_combo.predy.flatten()), 'eps': zscore(model_combo.e_filtered.flatten())}).plot(x='predy', y='eps', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_combo.predy.flatten()), x=zscore(model_combo.e_pred.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

residuals still aren’t perfect, so might be good to fit a model accounting for heteroskedastic errors again

Code
moran_resid_combo = esda.Moran(model_combo.e_filtered, W)
Code
esdplt.plot_moran(moran_resid_combo)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: 0.0', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (0.0)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

The spatial distribution of the residuals looks great though

Code
model_combo_het = GM_Combo_Het(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
print(model_combo_het.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED TWO STAGE LEAST SQUARES (HET)
-------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          17
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5267
Pseudo R-squared    :      0.7417
Spatial Pseudo R-squared:  0.7117
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      62.7446451       8.4386316       7.4354051       0.0000000
    C(RESITYP)[T.TH]      -0.2398590       0.0159680     -15.0212085       0.0000000
C(DESCCNST)[T.CNST Frame]      -0.0055287       0.0125244      -0.4414344       0.6588986
C(DESCCNST)[T.CNST Shingle Asbestos]       0.0003794       0.0215204       0.0176305       
0.9859336
C(DESCCNST)[T.CNST Shingle Wood]      -0.0140548       0.0273241      -0.5143753       
0.6069896
C(DESCCNST)[T.CNST Siding]       0.0142242       0.0119241       1.1928940       0.2329109
C(DESCCNST)[T.CNST Stucco]      -0.0320732       0.0218914      -1.4651068       0.1428918
            STRUGRAD       0.1586508       0.0065799      24.1113520       0.0000000
    np.log(SQFTSTRC)       0.4969048       0.0165090      30.0989769       0.0000000
             YEARBLT      -0.0633501       0.0086671      -7.3092552       0.0000000
     I(YEARBLT ** 2)       0.0000168       0.0000022       7.5148710       0.0000000
p_nonhisp_white_persons       0.0017124       0.0003604       4.7519135       0.0000020
p_edu_college_greater       0.0027348       0.0004916       5.5627104       0.0000000
median_household_income       0.0521124       0.0201610       2.5848105       0.0097433
p_housing_units_multiunit_structures       0.0005871       0.0002659       2.2077930       
0.0272587
p_owner_occupied_units       0.0024931       0.0005762       4.3268564       0.0000151
         W_log_price       0.3423613       0.0392504       8.7224912       0.0000000
              lambda       0.7480062       0.0545082      13.7228262       0.0000000
              lambda       0.7480062       0.0545082      13.7228262       0.0000000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_C(DESCCNST)[T.CNST Frame], W_C(DESCCNST)[T.CNST Shingle
             Asbestos], W_C(DESCCNST)[T.CNST Shingle Wood],
             W_C(DESCCNST)[T.CNST Siding], W_C(DESCCNST)[T.CNST Stucco],
             W_C(RESITYP)[T.TH], W_I(YEARBLT ** 2), W_STRUGRAD, W_YEARBLT,
             W_median_household_income, W_np.log(SQFTSTRC),
             W_p_edu_college_greater,
             W_p_housing_units_multiunit_structures,
             W_p_nonhisp_white_persons, W_p_owner_occupied_units
================================ END OF REPORT =====================================
Code
model_combo_het.name_x.append('rho')
model_combo_het.name_x.append('lambda')
Code
pd.DataFrame({'variable':model_combo_het.name_x, 
              'beta' : model_combo_het.betas.flatten(),
              'p_val':pd.DataFrame(model_combo_het.z_stat)[1].round(4)})
variable beta p_val
0 CONSTANT 62.744645 0.0000
1 C(RESITYP)[T.TH] -0.239859 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.005529 0.6589
3 C(DESCCNST)[T.CNST Shingle Asbestos] 0.000379 0.9859
4 C(DESCCNST)[T.CNST Shingle Wood] -0.014055 0.6070
5 C(DESCCNST)[T.CNST Siding] 0.014224 0.2329
6 C(DESCCNST)[T.CNST Stucco] -0.032073 0.1429
7 STRUGRAD 0.158651 0.0000
8 np.log(SQFTSTRC) 0.496905 0.0000
9 YEARBLT -0.063350 0.0000
10 I(YEARBLT ** 2) 0.000017 0.0000
11 p_nonhisp_white_persons 0.001712 0.0000
12 p_edu_college_greater 0.002735 0.0000
13 median_household_income 0.052112 0.0097
14 p_housing_units_multiunit_structures 0.000587 0.0273
15 p_owner_occupied_units 0.002493 0.0000
16 rho 0.342361 0.0000
17 lambda 0.748006 0.0000
Code
pd.DataFrame(zscore(model_combo_het.u)).plot(kind='density')

Code
ax = pd.DataFrame({'predy': zscore(model_combo_het.predy.flatten()), 'u': zscore(model_combo_het.e_filtered.flatten())}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_sar.predy.flatten()), x=zscore(model_sar.e_pred.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

The residuals still aren’t perfectly random, and we can’t get rid of that skew, but these are real data, and they’re imperfect. It is clear that we get closer to normal residuals once we account for autocorrelation

Code
moran_resid_combo_het = esda.Moran(model_combo_het.e_filtered, W)
Code
esdplt.plot_moran(moran_resid_combo_het)
(<Figure size 720x288 with 2 Axes>,
 array([<AxesSubplot:title={'center':'Reference Distribution'}, xlabel='Moran I: -0.01', ylabel='Density'>,
        <AxesSubplot:title={'center':'Moran Scatterplot (-0.01)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Code
moran_resid_combo_het.p_sim
0.001

Interesting… If we fit a model with heteroskedastic errors, we can reintroduce some significant spatial dependence in the residuals, even if its small. In this case, the difference between the models with/without the heteroskedasticity is negligible; their results are basically identical.

Combining the coeffients from the different models into a single table lets us compare how the estimated coefficients change under each specification

Code
coefs = pd.DataFrame({
              "combo": model_combo.betas.flatten(),
              "combo_het": model_combo_het.betas.flatten()}).join(pd.DataFrame({"sar": model_sar.betas.flatten()})).join(pd.DataFrame({"sem": model_sem.betas.flatten()})).round(4)
Code
coefs
combo combo_het sar sem
0 62.5736 62.7446 66.0741 61.1798
1 -0.2419 -0.2399 -0.1812 -0.2291
2 -0.0054 -0.0055 -0.0117 0.0001
3 -0.0000 0.0004 0.0236 0.0134
4 -0.0129 -0.0141 -0.0644 -0.0225
5 0.0141 0.0142 0.0236 0.0189
6 -0.0314 -0.0321 -0.0539 -0.0254
7 0.1588 0.1587 0.1507 0.1627
8 0.4977 0.4969 0.4663 0.4908
9 -0.0632 -0.0634 -0.0670 -0.0579
10 0.0000 0.0000 0.0000 0.0000
11 0.0017 0.0017 0.0009 0.0022
12 0.0027 0.0027 0.0033 0.0043
13 0.0516 0.0521 0.0771 0.1137
14 0.0006 0.0006 0.0007 0.0015
15 0.0024 0.0025 0.0042 0.0034
16 0.3420 0.3424 0.3558 0.5590
17 0.6102 0.7480 NaN NaN
Code
coeffs = pd.DataFrame(model_ols.params).reset_index().join(coefs, how='right')
Code
coeffs.at[16, 'index'] = 'rho'
coeffs.at[17, 'index'] = 'lambda'
Code
coeffs['sem'].at[17] = coeffs['sem'].at[16]
coeffs['sem'].at[16] = np.nan
Code
coeffs = coeffs.rename(columns={0: 'ols'})
Code
coeffs.set_index('index')[['ols', 'sem', 'sar', 'combo', 'combo_het']].round(3)
ols sem sar combo combo_het
index
Intercept 76.864 61.180 66.074 62.574 62.745
C(RESITYP)[T.TH] -0.205 -0.229 -0.181 -0.242 -0.240
C(DESCCNST)[T.CNST Frame] -0.012 0.000 -0.012 -0.005 -0.006
C(DESCCNST)[T.CNST Shingle Asbestos] 0.029 0.013 0.024 -0.000 0.000
C(DESCCNST)[T.CNST Shingle Wood] -0.042 -0.022 -0.064 -0.013 -0.014
C(DESCCNST)[T.CNST Siding] 0.037 0.019 0.024 0.014 0.014
C(DESCCNST)[T.CNST Stucco] -0.045 -0.025 -0.054 -0.031 -0.032
STRUGRAD 0.172 0.163 0.151 0.159 0.159
np.log(SQFTSTRC) 0.505 0.491 0.466 0.498 0.497
YEARBLT -0.075 -0.058 -0.067 -0.063 -0.063
I(YEARBLT ** 2) 0.000 0.000 0.000 0.000 0.000
p_nonhisp_white_persons 0.001 0.002 0.001 0.002 0.002
p_edu_college_greater 0.005 0.004 0.003 0.003 0.003
median_household_income 0.150 0.114 0.077 0.052 0.052
p_housing_units_multiunit_structures 0.002 0.002 0.001 0.001 0.001
p_owner_occupied_units 0.005 0.003 0.004 0.002 0.002
rho NaN NaN 0.356 0.342 0.342
lambda NaN 0.559 NaN 0.610 0.748

Comparing the estimated coefficients shows that median income was doing a lot of heavy lifting in the OLS model. That is, when we ignore spatial effects, the median income variable absorbs a lot of the spatial variance, inflating its coefficient and biasing our interpretation of how much residents value higher-income neighbors. In the spatial models, the estimated coefficient for income is less than half as large. That’s basically true of all the other neighborhood characteristics as well

There are a few things going on here. First, it’s important to remember that our data are measured at two different spatial levels. Our housing sales are measured at the unit level, but the ‘neighborhood’ variables are measured at the census blockgroup level. Because we have several sales nested inside each census unit, we’re building spatial autocorrelation into the model by design. That is, because we don’t know the income, education, race, etc of the home’s previous owner, the neighborhood variables are essentially \(Wx\) instead of just \(x\), where \(W\) is defined by discrete membership inside a census unit. That implies we’re actually fitting something closer to a spatial Durbin model (SDM), since our preferred models have spatially lagged \(Y\) variables, and some of the \(X\) variables are “lagged” themselves. [note: i think i would say it’s not technically durbin because we dont have x and Wx for all inputs, but a mix of the two… i know you dont have to lag all the X in a durbin specification, but i dont think i’ve seen one where you have only \(Wx\) for some \(x\) (as opposed to only \(x\), NOT, \(Wx\))]

Accounting for space, there are two things going on. In the SAR and spatial combo models, the autoregressive term \(\rho\) is positive, significant, and pretty large in magnitude, showing pretty strong evidence of a spillover effect in housing prices. But the combo models also have a large and significant spatial error term \(\lambda\), so we also have a big unmeasured spatial effect. There are several important variables we might be missing, like perceptions of crime, school quality, and access to amenities. Indeed, according to neoclassical urban theory, we have not included the determining factor in land rent prices: access to the CBD (or jobs) (alonso?, muth, mills, ….)

35.4 Estimation Techniques

Classically, spatial econometrics moels are estimated using Maximum-Likelihood (ML), but modern approaches prefer Generalized Method of Moments (GMM). GMM estimation is much faster and applicable to much larger datasets, but it also depends on instrumental variables. You can provide these instruments yourself, or you can use spatially-lagged regressions (i.e. lag of \(X\) variables) which is the standard technique. This is important because when using estimating a spatial Durbin model using GMM, you will need either to use high-order neighbors as instruments, or provide your own.

:::