31  Hedonic Models in Space

Code
import geopandas as gpd
import pandas as pd
import numpy as np
import esda
import libpysal as lps
import matplotlib.pyplot as plt
from geosnap import DataStore
from geosnap.io import get_acs
from splot import esda as esdplt
import seaborn as sns

%load_ext watermark
%watermark -a 'eli knaap' -v -d -u -p spreg,geopandas,geosnap,statsmodels
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Author: eli knaap

Last updated: 2025-04-03

Python implementation: CPython
Python version       : 3.12.8
IPython version      : 8.31.0

spreg      : 1.8.1
geopandas  : 1.0.1
geosnap    : 0.14.1.dev14+g0443e2a.d20250103
statsmodels: 0.14.4

31.1 Understanding and Interpreting Spatial Models

In the last chapter, several commonly-used model specifications were defined, but the choice among them depends on the research question. Naturally, if spillover effects are of substantive interest, then it is necessary to choose a model that includes (at least) \(\rho WY\) or \(\theta WX\), but beyond that, the choice is more complex.

The literature basically offers two different strategies; (anselin_modern_2014?) argue that a “bottom-up” strategy is most effective, where we begin with simple models (e.g., OLS), then conduct spatial diagnostic tests to decide whether more complex specifications are necessary. LeSage (2014) and LeSage & Pace (2014) argue for a “top-down” strategy where we begin with a ‘fully specified’ conceptual model. LeSage (2014, p. 19) goes to far as to say that “there are only two model specifications worth considering for applied work.” When there are theoretical reasons to consider global spillover processes (such as home prices in a hedonic model), then the SDM model is appropriate. In other cases that consider only local spillover effects, the SDEM specification will provide unbiased estimates of all parameters.

Their logic (to always prefer the Durbin specifications) is simple: the inclusion of the \(\theta WX\) term ensures that if there is an exogenous spillover process, then it will be measured accurately with only a small efficiency penalty, but omitting the term would lead to bias.

The distinction between the SDM versus the SDEM, however, is largely a theoretical choice for a given research question rather than an empirical matter. As LeSage & Pace (2014, p. 1551) describe:


“Global spillovers arise when changes in a characteristic of one region impact outcomes in more than just immediately neighboring regions. Global impacts fall on neighbors to the neighbors, neighbors to the neighbors to the neighbors, and so on…

Local spillovers represent a situation where the impacts fall only on nearby or immediate neighbors, dying out before they impact regions that are neighbors to the neighbors.

Another distinction between local and global spillovers is that feedback effects arise in the case of global spillovers, but not in the case of local spillovers.”


It follows “that practitioners cannot simply estimate an SDM and SDEM model and draw meaningful conclusions about the correct model specification from point estimates of \(\rho\) or \(\lambda\)… the SDM specification implies global spatial spillovers, while the SDEM is consistent with local spillovers. These have very different policy implications in applied work” (LeSage, 2014, p. 22), because indirect (and total) effects will be underestimated in the case when SDM model is appropriate but the SDEM model is fit because SDEM does not include feedback effects.

The SAR and SDM models which contain the spatially-lagged dependent variable \(\rho Wy\) specify global spillovers; the SLX, SDM, and SDEM models which contain the \(\theta WX\) term specify local spillovers (SDM specifies both). The difference between global and local spillovers pertains to the ways that effects propagate through the system and the ways that model coefficients may be interpreted.

31.1.1 Model selection

Different specifications imply different DGPs and reflect what you want to study.

SAR assumes explicit spillover processes, whereas SEM assumes spatially correlated omitted variables. You should have theory guiding which specification you want to use, even though most applied work uses a fit-based model selection approach

A sobering critique of applied spatial econometric modeling is given by Gibbons & Overman (2012), who argue compellingly that it

does not provide a toolbox that gives satisfactory solutions to these problems for applied researchers interested in causality and the economic processes at work. It is in this sense that we call into question the approach of the burgeoning body of applied economic research that proceeds with mechanical application of spatial econometric techniques to causal questions of “spillovers” and other forms of spatial interaction, or that estimates spatial lag models as a quick fix for endogeneity issues, or that blindly applies spatial econometric models by default without serious consideration of whether this is necessary given the research question in hand.

To overcome these issues, they argue that applied work should adopt an experimentalist design, where some form of exogenous variation is induced to provide insight into a causal mechanism at play. Models that incorporate both spatial and temporal dimensions (explored in the next section) provide an important avenue forward to address this issue (Bardaka et al., 2018; delgado2015DifferenceInDifferencesTechniques?).

Alternatively, structural models should proceed by considering the SLX model as the point of depatrture, or use strong theory to guide the choice of a particular specification as described above (Halleck Vega & Elhorst, 2015). Although there remains debate in urban economics, we believe there is reasonable justification for assuming a potential autoregressive process in the specific context of hedonic housing price models.

Direct and indirect effects of spillovers

31.1.2 Graph (\(W\)) Selection

There are conflicting opinions on the importance of the neighbor graph.

Code
datasets = DataStore()
Code
balt_acs = get_acs(datasets, county_fips='24510', years=2019)

The state of Maryland makes available a great set of parcel-level data called PropertyView. They also provide sales listings broken down by quarter or aggregated by year

Code
# read in the points (not the polys)

gdf = gpd.read_file('data/BACIparcels0622/BACIMDPV.shp')
Code
sales = gpd.read_file('data/Statewide-Res-Sales-CY2018/mdst_sale18.shp')
Code
gdf = gdf.set_index('ACCTID')
Code
sales = sales.set_index('acctid')
Code
balt_sales = gdf.drop(columns=['geometry']).join(sales, how='inner')
Code
balt_sales = gpd.GeoDataFrame(balt_sales)
Code
# this probably provides some interesting info 
balt_sales.DESCCNST.value_counts()
DESCCNST
CNST Brick               3559
CNST Siding              1108
CNST Frame               1053
CNST Stucco               182
CNST Shingle Asbestos     177
CNST Shingle Wood         117
CNST Block                 47
CNST 1/2 Brick Frame       44
CNST 1/2 Brick Siding      35
CNST Stone                 35
CNST 1/2 Stone Frame       10
CNST 1/2 Stone Siding       2
Name: count, dtype: int64
Code
balt_sales = balt_sales[~balt_sales.DESCCNST.isin(['CNST 1/2 Stone Frame', 'CNST 1/2 Stone Siding',
                                                  'CNST Block', 'CNST 1/2 Brick Frame', 'CNST Stone',
                                                  'CNST 1/2 Brick Siding'])]
Code
balt_sales = balt_sales[balt_sales.RESIDENT==1]
Code
balt_sales.STRUBLDG.value_counts()
STRUBLDG
0003    3071
0013    1400
0001    1263
0002    1192
0009     179
0008     137
0007      70
0005      13
0010       5
0004       4
Name: count, dtype: int64
Code
balt_sales.DESCBLDG.value_counts()
DESCBLDG
DWEL Center Unit        3071
DWEL Rental Dwelling    1400
DWEL Standard Unit      1263
DWEL End Unit           1192
DWEL Condo Hi Rise       179
DWEL Condo Garden        137
DWEL Condo Townhouse      70
DWEL Split Level          13
DWEL Penthouse             5
DWEL Split Foyer           4
Name: count, dtype: int64
Code
balt_sales = balt_sales[~balt_sales.STRUBLDG.isin(['0005', '0004'])]
Code
balt_sales["date"] = balt_sales.tradate.astype('datetime64[ns]')
Code
balt_sales['date']
0301011738 014   2018-01-12
0301011738 017   2018-09-06
0301011738 022   2018-01-08
0301011738 024   2018-06-13
0301011738 026   2018-01-16
                    ...    
0328058102I060   2018-04-18
0328058102J016   2018-11-21
0328058102L031   2018-11-21
0328058102M013   2018-10-30
0328058127A010   2018-08-22
Name: date, Length: 7335, dtype: datetime64[ns]
Code
balt_sales['month']=balt_sales.tradate.str[4:6]
Code
balt_sales.DESCBLDG.value_counts()
DESCBLDG
DWEL Center Unit        3071
DWEL Rental Dwelling    1400
DWEL Standard Unit      1263
DWEL End Unit           1192
DWEL Condo Hi Rise       179
DWEL Condo Garden        137
DWEL Condo Townhouse      70
DWEL Penthouse             5
Name: count, dtype: int64
Code
balt_sales.RESITYP.value_counts()
RESITYP
TH    4330
AP    1403
SF    1266
CN     336
Name: count, dtype: int64
Code
balt_sales.RESITYP.value_counts().plot(kind='bar')

Code
balt_sales = balt_sales[balt_sales.RESITYP != 'AP']
Code
balt_sales.SQFTSTRC.describe()
count    5932.000000
mean     1480.061362
std       652.125825
min         0.000000
25%      1112.000000
50%      1302.000000
75%      1652.000000
max      7816.000000
Name: SQFTSTRC, dtype: float64

Mostly townhomes that sold in 2018. Not surprising, given Baltimore’s housing stock

Code
# landarea is the size of the parcel, sqftstrc is the size of the structure
balt_sales[['SQFTSTRC', 'LANDAREA']].corr()
SQFTSTRC LANDAREA
SQFTSTRC 1.000000 0.339829
LANDAREA 0.339829 1.000000
Code
balt_sales = balt_sales.dropna(subset=['STRUGRAD', 'YEARBLT'])
Code
balt_sales = balt_sales[balt_sales.DESCLU =='Residential']
Code
balt_sales[balt_sales.considr1> 1000000].considr1.plot(kind='density')

Code
# doesnt count if it was free
balt_sales = balt_sales[balt_sales.considr1 >20000]
Code
balt_sales = balt_sales[balt_sales.considr1 <1500000]
Code
balt_sales.considr1.plot(kind='density')

Code
balt_sales.considr1.plot(kind='box')

Code
from scipy.stats import zscore
Code
balt_sales.considr1.apply(np.log).plot(kind='density')

Log transform gets close to a normal shape with heavy tails

Code
balt_sales.considr1.apply(np.log).plot(kind='box')

Code
balt_sales['log_price'] = balt_sales.considr1.apply(np.log)
Code
balt_sales.log_price.hist()

Code
pd.Series(zscore(balt_sales.considr1)).plot(kind='density')

Code
balt_sales[ (balt_sales.log_price <15) & (balt_sales.log_price >10)].log_price.hist()

Code
# trim outliers
balt_sales = balt_sales[ (balt_sales.log_price <15) & (balt_sales.log_price >10)]
Code
balt_sales = balt_sales.merge(balt_acs.drop(columns=['geometry']), left_on='bg2010', right_on='geoid')
Code
balt_sales.median_household_income = balt_sales.median_household_income.apply(np.log)
Code
balt_sales.STRUGRAD.astype(int).plot(kind='hist')

Code
COLS = ['DESCBLDG','RESITYP','DESCCNST','SQFTSTRC', 'LANDAREA','YEARBLT','p_nonhisp_black_persons','p_hispanic_persons','p_edu_college_greater','median_household_income','p_housing_units_multiunit_structures','p_vacant_housing_units']
Code
balt_sales = balt_sales.dropna(subset=COLS)

31.2 ESDA

We generally start with ESDA (following the bottom-up strategy) to ensure we even care about spatial econometrics. We begin by mapping the dependent variable (housing price, here) and examining any global and local autocorrelation

Code
balt_sales[['considr1', 'log_price', 'geometry']].explore('log_price', scheme='fisher_jenks', k=10, cmap='plasma',tiles='CartoDB Positron')
Make this Notebook Trusted to load map: File -> Trust Notebook

There are definitely spatial patterns, but they’re also not uniform. There’s probably autocorrelation in the sales prices, but it seems to be particularly strong in some pockets. We should be sure to check both global and local statistics

31.2.1 Global Stats

Code
# prices are probably influenced by a greater 'neighborhood' than just the adjacent units, so lets use distance band weights (3/4 km; ~10 min walk) 

W = lps.weights.DistanceBand.from_dataframe(balt_sales, threshold=750)
W.transform = 'r'
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/libpysal/weights/util.py:826: UserWarning: The weights matrix is not fully connected: 
 There are 5 disconnected components.
  w = W(neighbors, weights, ids, **kwargs)
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/libpysal/weights/distance.py:844: UserWarning: The weights matrix is not fully connected: 
 There are 5 disconnected components.
  W.__init__(
Code
# but we can also look at hyperlocal trends
w_knn = lps.weights.KNN.from_dataframe(balt_sales, k=10)
w_knn.transform = 'r'
/Users/knaaptime/miniforge3/envs/urban_analysis/lib/python3.12/site-packages/libpysal/weights/distance.py:153: UserWarning: The weights matrix is not fully connected: 
 There are 10 disconnected components.
  W.__init__(self, neighbors, id_order=ids, **kwargs)
Code
moran = esda.Moran(balt_sales['considr1'], W)
Code
moran_log = esda.Moran(balt_sales['log_price'], W)
Code
moran_knn = esda.Moran(balt_sales['log_price'], w_knn)
Code
esdplt.plot_moran(moran)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.48', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.48)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Code
esdplt.plot_moran(moran_log)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.52', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.52)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Code
esdplt.plot_moran(moran_knn)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.66', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.66)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

31.2.2 Local Stats

Code
lisa = esda.Moran_Local(balt_sales.log_price, W, n_jobs=-1, )
Code
esdplt.plot_local_autocorrelation
<function splot._viz_esda_mpl.plot_local_autocorrelation(moran_loc, gdf, attribute, p=0.05, region_column=None, mask=None, mask_color='#636363', quadrant=None, aspect_equal=True, legend=True, scheme='Quantiles', cmap='YlGnBu', figsize=(15, 4), scatter_kwds=None, fitline_kwds=None)>
Code
esdplt.plot_local_autocorrelation(lisa, balt_sales, 'log_price')
(<Figure size 1440x384 with 3 Axes>,
 array([<Axes: title={'center': 'Moran Local Scatterplot'}, xlabel='Attribute', ylabel='Spatial Lag'>,
        <Axes: >, <Axes: >], dtype=object))

Code
balt_sales.YEARBLT = balt_sales.YEARBLT.astype(int)

31.3 Models

Taking different modeling strategies, we can vary the way we operationalize \(p(h)\) to best identify the coefficients of interest, including the potential for spatial spillover (e.g. the price of one unit affecting the prices of those nearby)

Code
import statsmodels.formula.api as smf
Code
# Treat the structure grade variable like its continuous. whatever.
balt_sales.STRUGRAD = balt_sales.STRUGRAD.astype(int)
Code
formula = 'log_price ~ STRUGRAD + C(RESITYP) + C(DESCCNST) + np.log(SQFTSTRC) + YEARBLT + I(YEARBLT**2) + p_nonhisp_white_persons + p_edu_college_greater + median_household_income + p_housing_units_multiunit_structures + p_owner_occupied_units'

31.3.1 Aspatial (OLS)

\[y = \beta X + \epsilon\]

Code
model_ols = smf.ols(formula,balt_sales).fit()
Code
model_ols.summary()
OLS Regression Results
Dep. Variable: log_price R-squared: 0.719
Model: OLS Adj. R-squared: 0.719
Method: Least Squares F-statistic: 900.7
Date: Thu, 03 Apr 2025 Prob (F-statistic): 0.00
Time: 17:27:53 Log-Likelihood: -1356.8
No. Observations: 5284 AIC: 2746.
Df Residuals: 5268 BIC: 2851.
Df Model: 15
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 76.8636 10.352 7.425 0.000 56.569 97.158
C(RESITYP)[T.TH] -0.2046 0.014 -15.092 0.000 -0.231 -0.178
C(DESCCNST)[T.CNST Frame] -0.0117 0.012 -0.950 0.342 -0.036 0.012
C(DESCCNST)[T.CNST Shingle Asbestos] 0.0286 0.027 1.048 0.295 -0.025 0.082
C(DESCCNST)[T.CNST Shingle Wood] -0.0419 0.034 -1.240 0.215 -0.108 0.024
C(DESCCNST)[T.CNST Siding] 0.0374 0.013 2.851 0.004 0.012 0.063
C(DESCCNST)[T.CNST Stucco] -0.0445 0.026 -1.729 0.084 -0.095 0.006
STRUGRAD 0.1719 0.006 30.879 0.000 0.161 0.183
np.log(SQFTSTRC) 0.5050 0.015 33.509 0.000 0.475 0.535
YEARBLT -0.0747 0.011 -7.018 0.000 -0.096 -0.054
I(YEARBLT ** 2) 1.958e-05 2.74e-06 7.139 0.000 1.42e-05 2.5e-05
p_nonhisp_white_persons 0.0014 0.000 6.086 0.000 0.001 0.002
p_edu_college_greater 0.0052 0.000 13.538 0.000 0.004 0.006
median_household_income 0.1501 0.017 8.792 0.000 0.117 0.184
p_housing_units_multiunit_structures 0.0019 0.000 7.796 0.000 0.001 0.002
p_owner_occupied_units 0.0047 0.000 9.447 0.000 0.004 0.006
Omnibus: 771.083 Durbin-Watson: 1.428
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1236.192
Skew: -1.001 Prob(JB): 3.67e-269
Kurtosis: 4.267 Cond. No. 8.94e+09


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.94e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
model_ols.resid
0       0.366840
1       0.078007
2       0.166342
3       0.089591
4      -0.011663
          ...   
5489   -0.058454
5490   -0.028995
5491    0.336649
5492    0.249637
5493    0.133070
Length: 5284, dtype: float64
Code
ax = pd.DataFrame({'predy': zscore(model_ols.fittedvalues), 'u': zscore(model_ols.resid)}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_ols.fittedvalues), x=zscore(model_ols.resid))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

the residuals are pretty heteroskedastic (and the Jarque-Bera test is highly significant in the statsmodels results table above) so we will need to handle that if we plan to make valid inferences

Code
moran_resid = esda.Moran(model_ols.resid, W)
Code
moran_resid_knn = esda.Moran(model_ols.resid, w_knn)
Code
moran_resid_knn.p_sim
0.001
Code
esdplt.plot_moran(moran_resid)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.13', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.13)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

There’s still significant autocorrelation in the residual, so we’re either failing to account for a spatial spillover process (implying an SAR model), or omitting crucial autocorrelated explanatory variables (implying a SE model) (or both, implying SARAR model)

Code
esdplt.plot_moran(moran_resid_knn)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.23', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.23)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

The autocorrelation is even stronger at smaller scales (First Law, and all)

We could fit the same model with spreg

We can use patsy to generate the X and y inputs spreg requires using the same formula as before

Code
from patsy import dmatrices
Code
# spreg will add an intercept so redefine the formula without one

#formula_spreg = "log_price ~ C(DESCBLDG)  + C(RESITYP) + STRUGRAD + SQFTSTRC +LANDAREA + YEARBLT + I(YEARBLT**2) + p_nonhisp_black_persons + p_hispanic_persons + p_edu_college_greater + median_household_income + p_housing_units_multiunit_structures + p_vacant_housing_units"
Code
y, X = dmatrices(formula, balt_sales)
Code
# need to drop the intercept. Note you can't just define the formula without one
# because then patsy wont drop a level from the first categorical variable, so you need this fix anyway

# slice off the inttercept
name_x=X.design_info.column_names[1:]
name_y=y.design_info.column_names[0]

y = np.asarray(y).flatten()
X = np.asarray(X)[:,1:]
Code
from spreg import OLS
Code
model_ols_spreg = OLS(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
from rich.jupyter import print
Code
print(model_ols_spreg.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          16
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5268
R-squared           :      0.7195
Adjusted R-squared  :      0.7187
Sum squared residual:     517.042                F-statistic           :    900.6752
Sigma-square        :       0.098                Prob(F-statistic)     :           0
S.E. of regression  :       0.313                Log likelihood        :   -1356.831
Sigma-square ML     :       0.098                Akaike info criterion :    2745.661
S.E of regression ML:      0.3128                Schwarz criterion     :    2850.820

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        76.86357        10.35211         7.42492         0.00000
    C(RESITYP)[T.TH]        -0.20462         0.01356       -15.09201         0.00000
C(DESCCNST)[T.CNST Frame]        -0.01174         0.01235        -0.95004         0.34213
C(DESCCNST)[T.CNST Shingle Asbestos]         0.02856         0.02725         1.04810         0.29464
C(DESCCNST)[T.CNST Shingle Wood]        -0.04188         0.03377        -1.24019         0.21496
C(DESCCNST)[T.CNST Siding]         0.03735         0.01310         2.85144         0.00437
C(DESCCNST)[T.CNST Stucco]        -0.04452         0.02575        -1.72907         0.08386
            STRUGRAD         0.17190         0.00557        30.87943         0.00000
    np.log(SQFTSTRC)         0.50499         0.01507        33.50887         0.00000
             YEARBLT        -0.07474         0.01065        -7.01754         0.00000
     I(YEARBLT ** 2)         0.00002         0.00000         7.13895         0.00000
p_nonhisp_white_persons         0.00144         0.00024         6.08560         0.00000
p_edu_college_greater         0.00525         0.00039        13.53784         0.00000
median_household_income         0.15006         0.01707         8.79180         0.00000
p_housing_units_multiunit_structures         0.00191         0.00024         7.79632         0.00000
p_owner_occupied_units         0.00470         0.00050         9.44669         0.00000
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER       18745.447

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2       1236.193           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               15        553.881           0.0000
Koenker-Bassett test             15        339.085           0.0000
================================ END OF REPORT =====================================

Most of the variables are significant, except for the construction materials, in which only the siding has a significant effect. Otherwise, the coefficients are as expected. Housing prices increase with larger and newer homes (though historic homes also get a bump). Townhomes are less expensive than single family, and there are some small effects from the neighborhood’s demographic composition, the one exception being median income, which is relatively large

Code
np.set_printoptions(suppress=True) #prevent scientific format
Code
balt_sales.DESCBLDG = balt_sales.DESCBLDG.astype('category')

31.3.2 Spatial Models

Code
from spreg import GM_Error, GM_Lag, GM_Combo, GM_Error_Het, GM_Combo_Het

31.3.2.1 Spatial Error

\[y = \beta X + u\] \[ u = \lambda Wu+ \epsilon \]

Code
model_sem = GM_Error(y=y, x=X, w=w_knn, name_x=name_x, name_y=name_y,)
Code
print(model_sem.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: GM SPATIALLY WEIGHTED LEAST SQUARES
------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          16
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5268
Pseudo R-squared    :      0.7172

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        61.20792        11.53968         5.30413         0.00000
    C(RESITYP)[T.TH]        -0.22905         0.01584       -14.45575         0.00000
C(DESCCNST)[T.CNST Frame]         0.00014         0.01185         0.01144         0.99088
C(DESCCNST)[T.CNST Shingle Asbestos]         0.01347         0.02553         0.52755         0.59781
C(DESCCNST)[T.CNST Shingle Wood]        -0.02241         0.03212        -0.69770         0.48537
C(DESCCNST)[T.CNST Siding]         0.01895         0.01300         1.45739         0.14501
C(DESCCNST)[T.CNST Stucco]        -0.02542         0.02424        -1.04882         0.29426
            STRUGRAD         0.16273         0.00566        28.77062         0.00000
    np.log(SQFTSTRC)         0.49084         0.01557        31.52374         0.00000
             YEARBLT        -0.05792         0.01189        -4.87239         0.00000
     I(YEARBLT ** 2)         0.00002         0.00000         4.97828         0.00000
p_nonhisp_white_persons         0.00218         0.00038         5.75018         0.00000
p_edu_college_greater         0.00432         0.00055         7.81676         0.00000
median_household_income         0.11367         0.02211         5.14053         0.00000
p_housing_units_multiunit_structures         0.00153         0.00034         4.54766         0.00001
p_owner_occupied_units         0.00342         0.00066         5.20956         0.00000
              lambda         0.55915    
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

Still most variables are significant (though siding construction falls out). The rest of the significant coefficients have changed values a bit

Code
pd.DataFrame({'variable':model_sem.name_x, 
              'beta' : model_sem.betas.flatten(),
              'p_val':pd.concat([pd.DataFrame(model_sem.z_stat)[1].round(4), pd.Series(np.nan)], ignore_index=True)})
variable beta p_val
0 CONSTANT 61.207916 0.0000
1 C(RESITYP)[T.TH] -0.229048 0.0000
2 C(DESCCNST)[T.CNST Frame] 0.000135 0.9909
3 C(DESCCNST)[T.CNST Shingle Asbestos] 0.013467 0.5978
4 C(DESCCNST)[T.CNST Shingle Wood] -0.022408 0.4854
5 C(DESCCNST)[T.CNST Siding] 0.018945 0.1450
6 C(DESCCNST)[T.CNST Stucco] -0.025422 0.2943
7 STRUGRAD 0.162727 0.0000
8 np.log(SQFTSTRC) 0.490843 0.0000
9 YEARBLT -0.057920 0.0000
10 I(YEARBLT ** 2) 0.000015 0.0000
11 p_nonhisp_white_persons 0.002181 0.0000
12 p_edu_college_greater 0.004325 0.0000
13 median_household_income 0.113666 0.0000
14 p_housing_units_multiunit_structures 0.001526 0.0000
15 p_owner_occupied_units 0.003416 0.0000
16 lambda 0.559152 NaN
Code
moran_resid_sem = esda.Moran(model_sem.e_filtered, W,)
Code
esdplt.plot_moran(moran_resid_sem)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.03', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.03)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

The SEM specification removes all spatial autocorrelation from the residual, as expected. But we also have no sense for how prices spillover from one unit to another. To estimate spillover effects, we need a spatial autoregressive (SAR) model

Code
ax = pd.DataFrame({'predy': zscore(model_sem.predy.flatten()), 'u': zscore(model_sem.e_filtered.flatten())}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_sem.predy.flatten()), x=zscore(model_sem.e_filtered.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
pd.DataFrame(zscore(model_sem.e_filtered)).plot(kind='density')

But the residuals are not perfectly random, either. It might be a good idea to refit with a model that accounts for heteroskedastic errors

31.3.2.1.1 SEM w/ heteroskedastic errors
Code
model_sem_het = GM_Error_Het(y=y, x=X, w=W, name_x=name_x, name_y=name_y, )
Code
print(model_sem_het.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: GM SPATIALLY WEIGHTED LEAST SQUARES (HET)
------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          16
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5268
Pseudo R-squared    :      0.7060
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        64.64146         8.54884         7.56143         0.00000
    C(RESITYP)[T.TH]        -0.25521         0.01636       -15.60421         0.00000
C(DESCCNST)[T.CNST Frame]        -0.00296         0.01254        -0.23584         0.81356
C(DESCCNST)[T.CNST Shingle Asbestos]        -0.00380         0.02149        -0.17659         0.85983
C(DESCCNST)[T.CNST Shingle Wood]        -0.00774         0.02736        -0.28303         0.77715
C(DESCCNST)[T.CNST Siding]         0.01372         0.01203         1.14063         0.25402
C(DESCCNST)[T.CNST Stucco]        -0.02606         0.02157        -1.20815         0.22699
            STRUGRAD         0.16573         0.00660        25.12683         0.00000
    np.log(SQFTSTRC)         0.50843         0.01654        30.74095         0.00000
             YEARBLT        -0.06122         0.00877        -6.97893         0.00000
     I(YEARBLT ** 2)         0.00002         0.00000         7.17193         0.00000
p_nonhisp_white_persons         0.00250         0.00037         6.69357         0.00000
p_edu_college_greater         0.00319         0.00050         6.35416         0.00000
median_household_income         0.06525         0.02034         3.20728         0.00134
p_housing_units_multiunit_structures         0.00089         0.00027         3.36644         0.00076
p_owner_occupied_units         0.00226         0.00058         3.88627         0.00010
              lambda         0.86510         0.03296        26.25080         0.00000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================
Code
pd.DataFrame({'variable':model_sem_het.name_x, 
              'beta' : model_sem_het.betas.flatten(),
              'p_val':pd.DataFrame(model_sem_het.z_stat)[1].round(4)})
variable beta p_val
0 CONSTANT 64.641457 0.0000
1 C(RESITYP)[T.TH] -0.255213 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.002957 0.8136
3 C(DESCCNST)[T.CNST Shingle Asbestos] -0.003796 0.8598
4 C(DESCCNST)[T.CNST Shingle Wood] -0.007744 0.7772
5 C(DESCCNST)[T.CNST Siding] 0.013721 0.2540
6 C(DESCCNST)[T.CNST Stucco] -0.026060 0.2270
7 STRUGRAD 0.165725 0.0000
8 np.log(SQFTSTRC) 0.508430 0.0000
9 YEARBLT -0.061221 0.0000
10 I(YEARBLT ** 2) 0.000016 0.0000
11 p_nonhisp_white_persons 0.002502 0.0000
12 p_edu_college_greater 0.003185 0.0000
13 median_household_income 0.065249 0.0013
14 p_housing_units_multiunit_structures 0.000894 0.0008
15 p_owner_occupied_units 0.002258 0.0001
16 lambda 0.865098 0.0000
Code
pd.DataFrame(model_sem_het.e_filtered).plot(kind='density')

Code
ax = pd.DataFrame({'predy': zscore(model_sem_het.predy.flatten()), 'eps': zscore(model_sem_het.e_filtered.flatten())}).plot(x='predy', y='eps', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_sem_het.predy.flatten()), x=zscore(model_sem_het.e_filtered.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
moran_resid_sem_het = esda.Moran(model_sem_het.e_filtered, W,)
Code
esdplt.plot_moran(moran_resid_sem_het)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: -0.01', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (-0.01)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Again, the spatially-filtered residuals take care of the unobserved spatial effects, but they also mask out any substantively interesting spillover effects.

31.3.2.2 Spatial Lag

\[y = \rho W y + \beta X + \epsilon\]

Code
model_sar = GM_Lag(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
print(model_sar.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          17
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5267
Pseudo R-squared    :      0.7456
Spatial Pseudo R-squared:  0.7223

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        66.07409         9.86125         6.70038         0.00000
    C(RESITYP)[T.TH]        -0.18124         0.01296       -13.98737         0.00000
C(DESCCNST)[T.CNST Frame]        -0.01172         0.01175        -0.99787         0.31834
C(DESCCNST)[T.CNST Shingle Asbestos]         0.02363         0.02591         0.91172         0.36192
C(DESCCNST)[T.CNST Shingle Wood]        -0.06439         0.03213        -2.00396         0.04507
C(DESCCNST)[T.CNST Siding]         0.02360         0.01248         1.89136         0.05858
C(DESCCNST)[T.CNST Stucco]        -0.05386         0.02448        -2.19990         0.02781
            STRUGRAD         0.15073         0.00542        27.79287         0.00000
    np.log(SQFTSTRC)         0.46627         0.01449        32.17537         0.00000
             YEARBLT        -0.06696         0.01014        -6.60722         0.00000
     I(YEARBLT ** 2)         0.00002         0.00000         6.76186         0.00000
p_nonhisp_white_persons         0.00092         0.00023         4.03543         0.00005
p_edu_college_greater         0.00325         0.00039         8.44253         0.00000
median_household_income         0.07710         0.01673         4.60782         0.00000
p_housing_units_multiunit_structures         0.00075         0.00024         3.08887         0.00201
p_owner_occupied_units         0.00420         0.00047         8.88054         0.00000
         W_log_price         0.35576         0.01987        17.89999         0.00000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_C(DESCCNST)[T.CNST Frame], W_C(DESCCNST)[T.CNST Shingle
             Asbestos], W_C(DESCCNST)[T.CNST Shingle Wood],
             W_C(DESCCNST)[T.CNST Siding], W_C(DESCCNST)[T.CNST Stucco],
             W_C(RESITYP)[T.TH], W_I(YEARBLT ** 2), W_STRUGRAD, W_YEARBLT,
             W_median_household_income, W_np.log(SQFTSTRC),
             W_p_edu_college_greater,
             W_p_housing_units_multiunit_structures,
             W_p_nonhisp_white_persons, W_p_owner_occupied_units

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                              DF         VALUE           PROB
Anselin-Kelejian Test             1        224.508           0.0000

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
    C(RESITYP)[T.TH]        -0.1812         -0.1001         -0.2813
C(DESCCNST)[T.CNST Frame]        -0.0117         -0.0065         -0.0182
C(DESCCNST)[T.CNST Shingle Asbestos]         0.0236          0.0130          0.0367
C(DESCCNST)[T.CNST Shingle Wood]        -0.0644         -0.0356         -0.0999
C(DESCCNST)[T.CNST Siding]         0.0236          0.0130          0.0366
C(DESCCNST)[T.CNST Stucco]        -0.0539         -0.0297         -0.0836
            STRUGRAD         0.1507          0.0832          0.2340
    np.log(SQFTSTRC)         0.4663          0.2575          0.7237
             YEARBLT        -0.0670         -0.0370         -0.1039
     I(YEARBLT ** 2)         0.0000          0.0000          0.0000
p_nonhisp_white_persons         0.0009          0.0005          0.0014
p_edu_college_greater         0.0033          0.0018          0.0050
median_household_income         0.0771          0.0426          0.1197
p_housing_units_multiunit_structures         0.0007          0.0004          0.0012
p_owner_occupied_units         0.0042          0.0023          0.0065
================================ END OF REPORT =====================================
Code
model_sar.name_x.append('rho')
Code
pd.DataFrame({'variable':model_sar.name_x, 
              'beta' : model_sar.betas.flatten(),
              'p_val':pd.DataFrame(model_sar.z_stat)[1].round(4)})
variable beta p_val
0 CONSTANT 66.074088 0.0000
1 C(RESITYP)[T.TH] -0.181238 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.011721 0.3183
3 C(DESCCNST)[T.CNST Shingle Asbestos] 0.023626 0.3619
4 C(DESCCNST)[T.CNST Shingle Wood] -0.064391 0.0451
5 C(DESCCNST)[T.CNST Siding] 0.023602 0.0586
6 C(DESCCNST)[T.CNST Stucco] -0.053864 0.0278
7 STRUGRAD 0.150735 0.0000
8 np.log(SQFTSTRC) 0.466268 0.0000
9 YEARBLT -0.066965 0.0000
10 I(YEARBLT ** 2) 0.000018 0.0000
11 p_nonhisp_white_persons 0.000915 0.0001
12 p_edu_college_greater 0.003251 0.0000
13 median_household_income 0.077101 0.0000
14 p_housing_units_multiunit_structures 0.000745 0.0020
15 p_owner_occupied_units 0.004204 0.0000
16 rho 0.355761 0.0000
Code
ax = pd.DataFrame({'predy': zscore(model_sar.predy.flatten()), 'u': zscore(model_sar.e_pred.flatten())}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
import seaborn as sns
Code
ax = sns.kdeplot(y=zscore(model_sar.predy.flatten()), x=zscore(model_sar.e_pred.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
pd.DataFrame(zscore(model_sar.e_pred)).plot(kind='density')

Code
moran_resid_sar = esda.Moran(model_sar.u, W)
moran_resid_sar_local= esda.Moran_Local(model_sar.u, W)
Code
esdplt.plot_moran(moran_resid_sar)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.06', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.06)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Though the graph doesnt look crazy, but compared to the distribution, its clear there’s still significant autocorrelation in the residual. To keep the spillover term and also account for autocorrelation in the residual, we need to use a ‘spatial combo’ model

31.3.2.3 Spatial Combo

The spatial autoregressive combination (SAC) or “spatial combo” model includes both a lagged dependent variable to capture endogenous spillover effects and a spatial error term that can capture any residual autocorrelation from omitted variables

\[y = \rho W y + \beta X + u\] \[ u = \lambda Wu+ \epsilon \]

Code
model_combo = GM_Combo(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
print(model_combo.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIALLY WEIGHTED 2SLS - GM-COMBO MODEL
-----------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          17
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5267
Pseudo R-squared    :      0.7415
Spatial Pseudo R-squared:  0.7112

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        62.57357        10.50754         5.95511         0.00000
    C(RESITYP)[T.TH]        -0.24186         0.01536       -15.74377         0.00000
C(DESCCNST)[T.CNST Frame]        -0.00538         0.01161        -0.46347         0.64303
C(DESCCNST)[T.CNST Shingle Asbestos]        -0.00001         0.02555        -0.00029         0.99977
C(DESCCNST)[T.CNST Shingle Wood]        -0.01291         0.03239        -0.39855         0.69022
C(DESCCNST)[T.CNST Siding]         0.01408         0.01263         1.11481         0.26493
C(DESCCNST)[T.CNST Stucco]        -0.03140         0.02398        -1.30948         0.19037
            STRUGRAD         0.15877         0.00562        28.23458         0.00000
    np.log(SQFTSTRC)         0.49767         0.01508        33.00185         0.00000
             YEARBLT        -0.06317         0.01080        -5.85124         0.00000
     I(YEARBLT ** 2)         0.00002         0.00000         6.01551         0.00000
p_nonhisp_white_persons         0.00173         0.00033         5.30612         0.00000
p_edu_college_greater         0.00272         0.00044         6.15927         0.00000
median_household_income         0.05159         0.01793         2.87783         0.00400
p_housing_units_multiunit_structures         0.00058         0.00026         2.23029         0.02573
p_owner_occupied_units         0.00245         0.00052         4.66969         0.00000
         W_log_price         0.34201         0.03045        11.23303         0.00000
              lambda         0.61024    
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_C(DESCCNST)[T.CNST Frame], W_C(DESCCNST)[T.CNST Shingle
             Asbestos], W_C(DESCCNST)[T.CNST Shingle Wood],
             W_C(DESCCNST)[T.CNST Siding], W_C(DESCCNST)[T.CNST Stucco],
             W_C(RESITYP)[T.TH], W_I(YEARBLT ** 2), W_STRUGRAD, W_YEARBLT,
             W_median_household_income, W_np.log(SQFTSTRC),
             W_p_edu_college_greater,
             W_p_housing_units_multiunit_structures,
             W_p_nonhisp_white_persons, W_p_owner_occupied_units
================================ END OF REPORT =====================================
Code
model_combo.name_x.append('rho')
model_combo.name_x.append('lambda')
Code
pd.DataFrame({'variable':model_combo.name_x, 
              'beta' : model_combo.betas.flatten(),
              'p_val':pd.concat([pd.DataFrame(model_combo.z_stat)[1].round(4),pd.Series(np.nan)])})
variable beta p_val
0 CONSTANT 62.573571 0.0000
1 C(RESITYP)[T.TH] -0.241862 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.005379 0.6430
3 C(DESCCNST)[T.CNST Shingle Asbestos] -0.000007 0.9998
4 C(DESCCNST)[T.CNST Shingle Wood] -0.012907 0.6902
5 C(DESCCNST)[T.CNST Siding] 0.014085 0.2649
6 C(DESCCNST)[T.CNST Stucco] -0.031400 0.1904
7 STRUGRAD 0.158769 0.0000
8 np.log(SQFTSTRC) 0.497670 0.0000
9 YEARBLT -0.063170 0.0000
10 I(YEARBLT ** 2) 0.000017 0.0000
11 p_nonhisp_white_persons 0.001733 0.0000
12 p_edu_college_greater 0.002716 0.0000
13 median_household_income 0.051591 0.0040
14 p_housing_units_multiunit_structures 0.000585 0.0257
15 p_owner_occupied_units 0.002449 0.0000
16 rho 0.342010 0.0000
0 lambda 0.610241 NaN
Code
pd.DataFrame(zscore(model_combo.u)).plot(kind='density')

Code
model_combo.e_filtered.flatten().mean().round(6)
0.0
Code
ax = pd.DataFrame({'predy': zscore(model_combo.predy.flatten()), 'eps': zscore(model_combo.e_filtered.flatten())}).plot(x='predy', y='eps', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_combo.predy.flatten()), x=zscore(model_combo.e_pred.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

residuals still aren’t perfect, so might be good to fit a model accounting for heteroskedastic errors again

Code
moran_resid_combo = esda.Moran(model_combo.e_filtered, W)
Code
esdplt.plot_moran(moran_resid_combo)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: 0.0', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (0.0)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

The spatial distribution of the residuals looks great though

Code
model_combo_het = GM_Combo_Het(y=y, x=X, w=W, name_x=name_x, name_y=name_y)
Code
print(model_combo_het.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIALLY WEIGHTED 2SLS- GM-COMBO MODEL (HET)
----------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5284
Mean dependent var  :     12.1935                Number of Variables   :          17
S.D. dependent var  :      0.5906                Degrees of Freedom    :        5267
Pseudo R-squared    :      0.7417
Spatial Pseudo R-squared:  0.7117
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        62.74464         8.43863         7.43540         0.00000
    C(RESITYP)[T.TH]        -0.23986         0.01597       -15.02121         0.00000
C(DESCCNST)[T.CNST Frame]        -0.00553         0.01252        -0.44143         0.65890
C(DESCCNST)[T.CNST Shingle Asbestos]         0.00038         0.02152         0.01763         0.98593
C(DESCCNST)[T.CNST Shingle Wood]        -0.01405         0.02732        -0.51438         0.60699
C(DESCCNST)[T.CNST Siding]         0.01422         0.01192         1.19289         0.23291
C(DESCCNST)[T.CNST Stucco]        -0.03207         0.02189        -1.46511         0.14289
            STRUGRAD         0.15865         0.00658        24.11135         0.00000
    np.log(SQFTSTRC)         0.49690         0.01651        30.09898         0.00000
             YEARBLT        -0.06335         0.00867        -7.30926         0.00000
     I(YEARBLT ** 2)         0.00002         0.00000         7.51487         0.00000
p_nonhisp_white_persons         0.00171         0.00036         4.75191         0.00000
p_edu_college_greater         0.00273         0.00049         5.56271         0.00000
median_household_income         0.05211         0.02016         2.58481         0.00974
p_housing_units_multiunit_structures         0.00059         0.00027         2.20779         0.02726
p_owner_occupied_units         0.00249         0.00058         4.32686         0.00002
         W_log_price         0.34236         0.03925         8.72249         0.00000
              lambda         0.74801         0.05451        13.72283         0.00000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_C(DESCCNST)[T.CNST Frame], W_C(DESCCNST)[T.CNST Shingle
             Asbestos], W_C(DESCCNST)[T.CNST Shingle Wood],
             W_C(DESCCNST)[T.CNST Siding], W_C(DESCCNST)[T.CNST Stucco],
             W_C(RESITYP)[T.TH], W_I(YEARBLT ** 2), W_STRUGRAD, W_YEARBLT,
             W_median_household_income, W_np.log(SQFTSTRC),
             W_p_edu_college_greater,
             W_p_housing_units_multiunit_structures,
             W_p_nonhisp_white_persons, W_p_owner_occupied_units
================================ END OF REPORT =====================================
Code
model_combo_het.name_x.append('rho')
model_combo_het.name_x.append('lambda')
Code
pd.DataFrame({'variable':model_combo_het.name_x, 
              'beta' : model_combo_het.betas.flatten(),
              'p_val':pd.DataFrame(model_combo_het.z_stat)[1].round(4)})
variable beta p_val
0 CONSTANT 62.744638 0.0000
1 C(RESITYP)[T.TH] -0.239859 0.0000
2 C(DESCCNST)[T.CNST Frame] -0.005529 0.6589
3 C(DESCCNST)[T.CNST Shingle Asbestos] 0.000379 0.9859
4 C(DESCCNST)[T.CNST Shingle Wood] -0.014055 0.6070
5 C(DESCCNST)[T.CNST Siding] 0.014224 0.2329
6 C(DESCCNST)[T.CNST Stucco] -0.032073 0.1429
7 STRUGRAD 0.158651 0.0000
8 np.log(SQFTSTRC) 0.496905 0.0000
9 YEARBLT -0.063350 0.0000
10 I(YEARBLT ** 2) 0.000017 0.0000
11 p_nonhisp_white_persons 0.001712 0.0000
12 p_edu_college_greater 0.002735 0.0000
13 median_household_income 0.052112 0.0097
14 p_housing_units_multiunit_structures 0.000587 0.0273
15 p_owner_occupied_units 0.002493 0.0000
16 rho 0.342361 0.0000
17 lambda 0.748006 0.0000
Code
pd.DataFrame(zscore(model_combo_het.u)).plot(kind='density')

Code
ax = pd.DataFrame({'predy': zscore(model_combo_het.predy.flatten()), 'u': zscore(model_combo_het.e_filtered.flatten())}).plot(x='predy', y='u', kind='scatter')
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

Code
ax = sns.kdeplot(y=zscore(model_sar.predy.flatten()), x=zscore(model_sar.e_pred.flatten()))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

The residuals still aren’t perfectly random, and we can’t get rid of that skew, but these are real data, and they’re imperfect. It is clear that we get closer to normal residuals once we account for autocorrelation

Code
moran_resid_combo_het = esda.Moran(model_combo_het.e_filtered, W)
Code
esdplt.plot_moran(moran_resid_combo_het)
(<Figure size 960x384 with 2 Axes>,
 array([<Axes: title={'center': 'Reference Distribution'}, xlabel='Moran I: -0.01', ylabel='Density'>,
        <Axes: title={'center': 'Moran Scatterplot (-0.01)'}, xlabel='Attribute', ylabel='Spatial Lag'>],
       dtype=object))

Code
moran_resid_combo_het.p_sim
0.001

Interesting… If we fit a model with heteroskedastic errors, we can reintroduce some significant spatial dependence in the residuals, even if its small. In this case, the difference between the models with/without the heteroskedasticity is negligible; their results are basically identical.

Combining the coeffients from the different models into a single table lets us compare how the estimated coefficients change under each specification

Code
coefs = pd.DataFrame({
              "combo": model_combo.betas.flatten(),
              "combo_het": model_combo_het.betas.flatten()}).join(pd.DataFrame({"sar": model_sar.betas.flatten()})).join(pd.DataFrame({"sem": model_sem.betas.flatten()})).round(4)
Code
coefs
combo combo_het sar sem
0 62.5736 62.7446 66.0741 61.2079
1 -0.2419 -0.2399 -0.1812 -0.2290
2 -0.0054 -0.0055 -0.0117 0.0001
3 -0.0000 0.0004 0.0236 0.0135
4 -0.0129 -0.0141 -0.0644 -0.0224
5 0.0141 0.0142 0.0236 0.0189
6 -0.0314 -0.0321 -0.0539 -0.0254
7 0.1588 0.1587 0.1507 0.1627
8 0.4977 0.4969 0.4663 0.4908
9 -0.0632 -0.0634 -0.0670 -0.0579
10 0.0000 0.0000 0.0000 0.0000
11 0.0017 0.0017 0.0009 0.0022
12 0.0027 0.0027 0.0033 0.0043
13 0.0516 0.0521 0.0771 0.1137
14 0.0006 0.0006 0.0007 0.0015
15 0.0024 0.0025 0.0042 0.0034
16 0.3420 0.3424 0.3558 0.5592
17 0.6102 0.7480 NaN NaN
Code
coeffs = pd.DataFrame(model_ols.params).reset_index().join(coefs, how='right')
Code
coeffs.at[16, 'index'] = 'rho'
coeffs.at[17, 'index'] = 'lambda'
Code
coeffs['sem'].at[17] = coeffs['sem'].at[16]
coeffs['sem'].at[16] = np.nan
Code
coeffs = coeffs.rename(columns={0: 'ols'})
Code
coeffs.set_index('index')[['ols', 'sem', 'sar', 'combo', 'combo_het']].round(3)
ols sem sar combo combo_het
index
Intercept 76.864 61.208 66.074 62.574 62.745
C(RESITYP)[T.TH] -0.205 -0.229 -0.181 -0.242 -0.240
C(DESCCNST)[T.CNST Frame] -0.012 0.000 -0.012 -0.005 -0.006
C(DESCCNST)[T.CNST Shingle Asbestos] 0.029 0.014 0.024 -0.000 0.000
C(DESCCNST)[T.CNST Shingle Wood] -0.042 -0.022 -0.064 -0.013 -0.014
C(DESCCNST)[T.CNST Siding] 0.037 0.019 0.024 0.014 0.014
C(DESCCNST)[T.CNST Stucco] -0.045 -0.025 -0.054 -0.031 -0.032
STRUGRAD 0.172 0.163 0.151 0.159 0.159
np.log(SQFTSTRC) 0.505 0.491 0.466 0.498 0.497
YEARBLT -0.075 -0.058 -0.067 -0.063 -0.063
I(YEARBLT ** 2) 0.000 0.000 0.000 0.000 0.000
p_nonhisp_white_persons 0.001 0.002 0.001 0.002 0.002
p_edu_college_greater 0.005 0.004 0.003 0.003 0.003
median_household_income 0.150 0.114 0.077 0.052 0.052
p_housing_units_multiunit_structures 0.002 0.002 0.001 0.001 0.001
p_owner_occupied_units 0.005 0.003 0.004 0.002 0.002
rho NaN NaN 0.356 0.342 0.342
lambda NaN 0.559 NaN 0.610 0.748

Comparing the estimated coefficients shows that median income was doing a lot of heavy lifting in the OLS model. That is, when we ignore spatial effects, the median income variable absorbs a lot of the spatial variance, inflating its coefficient and biasing our interpretation of how much residents value higher-income neighbors. In the spatial models, the estimated coefficient for income is less than half as large. That’s basically true of all the other neighborhood characteristics as well

There are a few things going on here. First, it’s important to remember that our data are measured at two different spatial levels. Our housing sales are measured at the unit level, but the ‘neighborhood’ variables are measured at the census blockgroup level. Because we have several sales nested inside each census unit, we’re building spatial autocorrelation into the model by design. That is, because we don’t know the income, education, race, etc of the home’s previous owner, the neighborhood variables are essentially \(Wx\) instead of just \(x\), where \(W\) is defined by discrete membership inside a census unit. That implies we’re actually fitting something closer to a spatial Durbin model (SDM), since our preferred models have spatially lagged \(Y\) variables, and some of the \(X\) variables are “lagged” themselves. [note: i think i would say it’s not technically durbin because we dont have x and Wx for all inputs, but a mix of the two… i know you dont have to lag all the X in a durbin specification, but i dont think i’ve seen one where you have only \(Wx\) for some \(x\) (as opposed to only \(x\), NOT, \(Wx\))]

Accounting for space, there are two things going on. In the SAR and spatial combo models, the autoregressive term \(\rho\) is positive, significant, and pretty large in magnitude, showing pretty strong evidence of a spillover effect in housing prices. But the combo models also have a large and significant spatial error term \(\lambda\), so we also have a big unmeasured spatial effect. There are several important variables we might be missing, like perceptions of crime, school quality, and access to amenities. Indeed, according to neoclassical urban theory, we have not included the determining factor in land rent prices: access to the CBD (or jobs) (alonso?, muth, mills, ….)

31.4 Estimation Techniques

Classically, spatial econometrics moels are estimated using Maximum-Likelihood (ML), but modern approaches prefer Generalized Method of Moments (GMM). GMM estimation is much faster and applicable to much larger datasets, but it also depends on instrumental variables. You can provide these instruments yourself, or you can use spatially-lagged regressions (i.e. lag of \(X\) variables) which is the standard technique. This is important because when using estimating a spatial Durbin model using GMM, you will need either to use high-order neighbors as instruments, or provide your own.