[30]{.chapter-number}  [Hedonic Models in Space]{.chapter-title}

30 Hedonic Models in Space

Code

import requests

import contextily as ctx
import geopandas as gpd
import pandas as pd
import numpy as np
import esda
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

from formulaic import Formula
from geosnap import DataStore
from geosnap.io import get_acs, get_lodes, get_network_from_gdf
from libpysal.graph import Graph
from scipy.stats import zscore
from splot import esda as esdplt
from spreg import GMM_Error, GM_Lag, OLS, spsearch

%load_ext watermark
%watermark -a 'eli knaap' -iv

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: eli knaap

pandas     : 2.2.3
geopandas  : 1.0.1
formulaic  : 1.1.1
matplotlib : 3.10.0
esda       : 2.7.0
statsmodels: 0.14.4
requests   : 2.32.3
geosnap    : 0.15.3.dev3+g3fa22c0.d20250409
scipy      : 1.15.2
splot      : 1.1.7
seaborn    : 0.13.2
contextily : 1.6.2
spreg      : 1.9.dev28+g0b017db
libpysal   : 4.13.0
numpy      : 1.26.4

In the last chapter, several commonly-used model specifications were defined, but the choice among them depends on the research question. Naturally, if spillover effects are of substantive interest, then it is necessary to choose a model that includes (at least) \(\rho WY\) or \(\theta WX\), but beyond that, the choice is more complex. The literature basically offers two different strategies; Anselin & Rey (2014) argue that a “bottom-up” strategy is most effective, where we begin with simple models (e.g., OLS), then conduct spatial diagnostic tests to decide whether more complex specifications are necessary. LeSage (2014) and LeSage & Pace (2014) argue for a “top-down” strategy where we begin with a ‘fully specified’ conceptual model.

LeSage (2014, p. 19) goes so far as to say that “there are only two model specifications worth considering for applied work.” When there are theoretical reasons to consider global spillover processes (such as home prices in a hedonic model), then the SDM model is appropriate. In other cases that consider only local spillover effects, the SDEM specification will provide unbiased estimates of all parameters. Their logic (to always prefer the Durbin specifications) is simple: the inclusion of the \(\theta WX\) term ensures that if there is an exogenous spillover process, then it will be measured accurately with only a small efficiency penalty, but omitting the term would lead to bias. The distinction between the SDM versus the SDEM, however, is largely a theoretical choice for a given research question rather than an empirical matter. As LeSage & Pace (2014, p. 1551) describe:

“Global spillovers arise when changes in a characteristic of one region impact outcomes in more than just immediately neighboring regions. Global impacts fall on neighbors to the neighbors, neighbors to the neighbors to the neighbors, and so on… Local spillovers represent a situation where the impacts fall only on nearby or immediate neighbors, dying out before they impact regions that are neighbors to the neighbors… feedback effects arise in the case of global spillovers, but not in the case of local spillovers.”

It follows “that practitioners cannot simply estimate an SDM and SDEM model and draw meaningful conclusions about the correct model specification from point estimates of \(\rho\) or \(\lambda\)… the SDM specification implies global spatial spillovers, while the SDEM is consistent with local spillovers. These have very different policy implications in applied work” (LeSage, 2014, p. 22), because indirect (and total) effects will be underestimated in the case when SDM model is appropriate but the SDEM model is fit because SDEM does not include feedback effects.¹

Different specifications imply different DGPs and reflect what you want to study.

The SAR and SDM models which contain the spatially-lagged dependent variable \(\rho Wy\) specify global spillovers; the SLX, SDM, and SDEM models which contain the \(\theta WX\) term specify local spillovers (SDM specifies both). The difference between global and local spillovers pertains to the ways that effects propagate through the system and the ways that model coefficients may be interpreted (Juhl, 2021); SAR assumes explicit spillover processes, whereas SEM assumes spatially correlated omitted variables. You should have theory guiding which specification you want to use, even though most applied work uses a fit-based model selection approach. As described in preceding sections, the difference between the different types of spillovers is a critical distinction for applied work and policy analysis because we generally need to consider marginal effects, which means computing direct and indirect effects for models including a spatial lag. In the examples below, we examine how to fit and compare diagnostics for these different models using the spreg package (Anselin & Rey, 2014; Rey et al., 2021; Rey & Anselin, 2010).

30.1 Data Preparation

Since we are collecting raw parcel data, we will need to do a bit of cleanup and preparation before beginning model construction. The state of Maryland makes available a great set of parcel-level data called PropertyView, which include point and polygon datasets including attributes from the tax assessor’s’ office (e.g. size and sales price of the property, etc.). They also provide sales listings broken down by quarter or aggregated by year, and this is the dataset we use for the model. We will also attach these parcel points to the attributes of the blockgroup they fall inside.

Code

np.set_printoptions(suppress=True)  # prevent scientific format
datasets = DataStore()
balt_acs = get_acs(datasets, county_fips="24510", years=2019)
balt_lodes = get_lodes(datasets, county_fips="24510", years=2019)

# compute access variables
balt_net = get_network_from_gdf(balt_acs)
balt_lodes["node_ids"] = balt_net.get_node_ids(
    balt_lodes.centroid.x, balt_lodes.centroid.y
)
balt_net.set(balt_lodes["node_ids"], balt_lodes["total_employees"].values, name="jobs")
access = balt_net.aggregate(2000, name="jobs", decay="exp")

# download PropertyView data and read as a geodataframe
headers = {"user-agent": "Wget/1.16 (linux-gnu)"}
zfilename = "BACIparcels0225.zip"
# download the zip file onto the local disk
z = requests.get(
    "https://www.dropbox.com/scl/fi/xub9kdq5eug6pcsiib0bf/BACIparcels0225.zip?rlkey=f5zwjp7uyl7xpzkmep9o92aec&st=9vc03ft5&dl=0",
    stream=True,
    headers=headers,
)
with open(f"./{zfilename}", "wb") as f:
    f.write(z.content)
gdf = gpd.read_file("./BACIparcels0225.zip", layer="BACIMDPV")
gdf = gdf.set_index("ACCTID")


zfilename = "Statewide-Res-Sales-CY2018.zip"
# get the 2018 Sales data to join to parcel points
z = requests.get(
    "https://www.dropbox.com/s/sstsedw9bhtlhyw/Statewide-Res-Sales-CY2018.zip",
    stream=True,
    headers=headers,
)
with open(f"./{zfilename}", "wb") as f:
    f.write(z.content)

sales = gpd.read_file("./Statewide-Res-Sales-CY2018.zip")
sales = sales.set_index("acctid")
# keep only sales from 2018, drop duplicate geom column
# CRS in this file is epsg:26985 (meters)
balt_sales = sales.join(gdf.drop(columns=["geometry"]), how="inner")

# attach job access data
balt_sales["node_ids"] = balt_net.get_node_ids(
    balt_sales.to_crs(4326).geometry.x, balt_sales.to_crs(4326).geometry.y
)
balt_sales = balt_sales.merge(
    access.rename("job_access"), left_on="node_ids", right_index=True
)

# keep only residential parcels
balt_sales = balt_sales[balt_sales.RESIDENT == 1]
balt_sales = balt_sales[balt_sales.DESCLU == "Residential"]
# remove apartment buildings
balt_sales = balt_sales[balt_sales.RESITYP != "AP"]
# drop observations with missing variables
balt_sales = balt_sales.dropna(subset=["STRUGRAD", "YEARBLT"])

# get a month column
balt_sales["date"] = balt_sales.tradate.astype("datetime64[ns]")
balt_sales["month"] = balt_sales.tradate.str[4:6]
# convert type
balt_sales["YEARBLT"] = balt_sales["YEARBLT"].astype(int)
balt_sales["struc_age"] = 2018 - balt_sales["YEARBLT"]
balt_sales["age_square"] = balt_sales["struc_age"] ** 2
balt_sales.DESCBLDG = balt_sales.DESCBLDG.astype("category")
balt_sales["STRUGRAD"] = balt_sales["STRUGRAD"].astype(float)

# keep only improved parcels less than $1.5m
balt_sales = balt_sales[balt_sales.considr1 > 35000]
balt_sales = balt_sales[balt_sales.considr1 < 1500000]
balt_sales = balt_sales[balt_sales.struc_age > 0]
balt_sales = balt_sales[(balt_sales["STRUGRAD"] < 9) & (balt_sales["STRUGRAD"] > 2)]
balt_sales = balt_sales[balt_sales["SQFTSTRC"] > 300]

balt_sales = balt_sales.merge(
    balt_acs.drop(columns=["geometry"]), left_on="bg2010", right_on="geoid"
)
balt_sales = balt_sales.reset_index(drop=True)

Generating contraction hierarchies with 16 threads.
Setting CH node vector of size 81831
Setting CH edge vector of size 247436
Range graph removed 251098 edges of 494872
. 10% . 20% . 30% . 40% . 50% . 60% . 70% . 80% . 90% . 100%
Removed 6336 rows because they contain missing values

/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_13809/4245329051.py:9: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  balt_lodes.centroid.x, balt_lodes.centroid.y

Code

balt_sales.DESCBLDG.value_counts()
balt_sales.SQFTSTRC.describe()

count    5782.000000
mean     1488.406607
std       640.387661
min       320.000000
25%      1118.500000
50%      1300.000000
75%      1640.000000
max      6418.000000
Name: SQFTSTRC, dtype: float64

Code

balt_sales = balt_sales[~balt_sales.STRUBLDG.isin(['0005', '0004'])]

Code

balt_sales.RESITYP.value_counts().plot(kind='bar')

Mostly townhomes that sold in 2018. Not surprising, given Baltimore’s housing stock

Code

balt_sales.STRUGRAD.plot(kind='hist')

Code

f, ax = plt.subplots(1,2, figsize=(8,4))

balt_sales.considr1.plot(kind='density', ax=ax[0])
balt_sales.considr1.plot(kind='box', ax=ax[1])

Code

balt_sales.considr1.apply(np.log).plot(kind='density')

Log transform gets a much better shape

Code

balt_sales.considr1.apply(np.log).plot(kind='box')

Code

balt_sales['log_price'] = balt_sales.considr1.apply(np.log)

Code

balt_sales.log_price.hist()

Code

pd.Series(zscore(balt_sales.considr1)).plot(kind='density')

Code

balt_sales["log_inc"] = balt_sales["median_household_income"].apply(np.log)
balt_sales["log_jobaccess"] = balt_sales["job_access"].apply(np.log1p)
balt_sales["STRUGRAD"] = balt_sales["STRUGRAD"].astype(float).apply(np.log)

Code

COLS = [
    "DESCBLDG",
    "RESITYP",
    "DESCCNST",
    "SQFTSTRC",
    "LANDAREA",
    "struc_age",
    "age_square",
    "p_nonhisp_black_persons",
    "p_hispanic_persons",
    "p_edu_college_greater",
    "median_household_income",
    "p_housing_units_multiunit_structures",
    "p_owner_occupied_units",
    "job_access",
    "log_inc",
    "log_jobaccess",
]

balt_sales = balt_sales.dropna(subset=COLS)
balt_sales = balt_sales.to_crs(balt_sales.estimate_utm_crs())

30.2 ESDA

We generally start with ESDA (following the bottom-up strategy) to ensure we even care about spatial econometrics. We begin by mapping the dependent variable (housing price, here) and examining any global and local autocorrelation

Code

balt_sales[["considr1", "log_price", "geometry"]].explore(
    "log_price", scheme="fisher_jenks", k=10, cmap="plasma", tiles="CartoDB Positron"
)

Make this Notebook Trusted to load map: File -> Trust Notebook

There are definitely spatial patterns, but they’re also not uniform. There’s probably autocorrelation in the sales prices, but it seems to be particularly strong in some pockets. We should be sure to check both global and local statistics. As such, we need a spatial Graph object to define the relationship between housing sales, and we have seen in prior chapters the importance of considering travel network structure. Here we will fit two different graphs and examine global and local autocorrelation using both.

Code

w_dist = Graph.build_distance_band(balt_sales, threshold=1000)
w_dist = w_dist.transform("r")

w_net = Graph.build_travel_cost(balt_sales.to_crs(4326), balt_net, threshold=1000)
w_net = w_net.transform("r")

30.2.1 Global Stats

To check global autocorrelation we follow the same basic procedure, creating Moran objects for both types of Graph (to which we refer as \(W\), following the spatial econometric terminology).

Code

moran = esda.Moran(balt_sales["log_price"], w_dist)
moran_net = esda.Moran(balt_sales["log_price"], w_net)

f, ax = plt.subplots(2, 2, figsize=(10, 10))
ax = ax.flatten()
moran.plot_scatter(ax=ax[0])
moran_net.plot_scatter(ax=ax[1])

ax[0].set_title(f"Euclidean Neighbors \nMoran's I: {round(moran.I, 3)}")
ax[1].set_title(f"Network Neighbors\nMoran's I: {round(moran_net.I, 3)}")

w_dist.plot(balt_sales, ax=ax[2])
ax[2].axis("off")
w_net.plot(balt_sales, ax=ax[3])
ax[3].axis("off")

plt.tight_layout

For reference, we also plot the spatial structure of each graph below its Moran scatterplot. The comparison shows (as expected) that the network-based neighborhood results in more localized neighbor-sets and higher levels of autocorrelation (.55 versus .52).

30.2.2 Local Stats

And following, we will do the same with LISA plots of the local Moran statistics.

Code

lisa = esda.Moran_Local(
    balt_sales.log_price,
    w_net,
    n_jobs=-1,
)
ax = lisa.plot(balt_sales, alpha=0.7, figsize=(7, 7))
ax.axis("off")
ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron, crs=balt_sales.crs)

The local Moran shows that our significant Moran statistic is driven by hotspots in “the Wedge” in the north-center of the city and the Inner Harbor, as well as coldspots in East and West Baltimore.

30.3 Models

Taking different modeling strategies, we can vary the way we operationalize \(p(h)\) to best identify the coefficients of interest, including the potential for spatial spillover (e.g. the price of one unit affecting the prices of those nearby). We specify the main parts of this model using Wilkinson formula, which is familiar if you have ever used R (Wilkinson & Rogers, 1973).

Here we will specify the log of sales price as a function of structural characteristics (size, age, dwelling type, construction type, and ‘structure grade’, as well as a squared age term to capture historic effects), demographic and socioeconomic attributes of the home’s census blockgroup, and its accessibility to jobs. A nice thing about Wilkinson formulas in formulaic is the ability to transform variables dynamically, thus here we set several variables as categorical (using the C() convention), and log-transform the square footage of the structure.

Code

formula = "log_price ~ 1 + STRUGRAD + C(RESITYP) + C(DESCCNST) + np.log(SQFTSTRC) + struc_age + age_square + p_nonhisp_white_persons + p_edu_college_greater + log_inc + p_housing_units_multiunit_structures + p_owner_occupied_units +  log_jobaccess"

30.3.1 Aspatial (OLS)

We will start by fitting the classic regression model which assumes observations are independent; in this case, that means the sales price of a home is dependent entirely on attributes of that home and is unaffected by homes nearby. We can start by fitting the model using the statsmodels package, which is the standard regression library in Python and provides a nice interface a builtin (classic) diagnostics.

\[y = \beta X + \epsilon\]

30.3.1.1 Using `statsmodels`

To proceed with statsmodels, we use the ols function from its ‘formula api’ (which is designed to accept Wilkinson formulas). The function returns an OLS class, upon which we call the fit method. Once the model is fit, the summary method returns a table of coefficients and model diagnostics.

Code

model_ols = smf.ols(formula, balt_sales).fit()
model_ols.summary()

OLS Regression Results
Dep. Variable:	log_price	R-squared:	0.709
Model:	OLS	Adj. R-squared:	0.708
Method:	Least Squares	F-statistic:	611.8
Date:	Thu, 22 May 2025	Prob (F-statistic):	0.00
Time:	10:52:45	Log-Likelihood:	-1767.8
No. Observations:	5543	AIC:	3582.
Df Residuals:	5520	BIC:	3734.
Df Model:	22
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.0446	0.237	17.083	0.000	3.580	4.509
C(RESITYP)[T.TH]	-0.1506	0.014	-10.555	0.000	-0.179	-0.123
C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.0465	0.079	-0.586	0.558	-0.202	0.109
C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.4852	0.117	-4.135	0.000	-0.715	-0.255
C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.4740	0.242	1.955	0.051	-0.001	0.949
C(DESCCNST)[T.CNST Block]	-0.0026	0.072	-0.037	0.971	-0.144	0.139
C(DESCCNST)[T.CNST Brick]	-0.0113	0.051	-0.220	0.826	-0.112	0.089
C(DESCCNST)[T.CNST Frame]	-0.0148	0.052	-0.285	0.776	-0.117	0.087
C(DESCCNST)[T.CNST Shingle Asbestos]	0.1088	0.057	1.896	0.058	-0.004	0.221
C(DESCCNST)[T.CNST Shingle Wood]	0.0144	0.061	0.235	0.814	-0.105	0.134
C(DESCCNST)[T.CNST Siding]	0.0638	0.052	1.235	0.217	-0.037	0.165
C(DESCCNST)[T.CNST Stone]	0.1649	0.075	2.187	0.029	0.017	0.313
C(DESCCNST)[T.CNST Stucco]	-0.0098	0.057	-0.172	0.864	-0.122	0.102
STRUGRAD	0.6925	0.025	27.730	0.000	0.644	0.741
np.log(SQFTSTRC)	0.5248	0.016	33.738	0.000	0.494	0.555
struc_age	-0.0041	0.001	-7.942	0.000	-0.005	-0.003
age_square	1.613e-05	3.06e-06	5.272	0.000	1.01e-05	2.21e-05
p_nonhisp_white_persons	-0.0003	0.000	-1.159	0.246	-0.001	0.000
p_edu_college_greater	0.0055	0.000	13.230	0.000	0.005	0.006
log_inc	0.2276	0.018	12.856	0.000	0.193	0.262
p_housing_units_multiunit_structures	0.0022	0.000	8.744	0.000	0.002	0.003
p_owner_occupied_units	0.0070	0.001	13.748	0.000	0.006	0.008
log_jobaccess	0.0366	0.005	6.671	0.000	0.026	0.047

Omnibus:	586.529	Durbin-Watson:	1.363
Prob(Omnibus):	0.000	Jarque-Bera (JB):	803.416
Skew:	-0.851	Prob(JB):	3.47e-175
Kurtosis:	3.763	Cond. No.	5.99e+05

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.99e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

For a first shot, this is a decent model, with an \(R^2\) value greater than 0.7. Most of the variables are significant, except for the construction materials, in which only the siding has a significant effect. Otherwise, the coefficients are as expected. Housing prices increase with larger and newer homes (though historic homes also get a bump). Townhomes are less expensive than single family, and there are some small effects from the neighborhood’s demographic composition, the one exception being median income, which is relatively large.

With linear regression models, among our primary concerns are that model residuals are independent and identically-distributed (I.I.D.). They do not necessarily need to be normal (though it’s ideal if they are), but they should be independent from one another (and ideally follow a known probability distribution). If the residuals are not identically-distributed, then robust errors can help insulate inferences, however the independence criterion is more difficult to handle. To check these criteria, it is common to plot the (standardized) predicted values against (standardized) residuals and visually assess randomness.

Code

ax = pd.DataFrame(
    {"predy": zscore(model_ols.fittedvalues), "$\epsilon$": zscore(model_ols.resid)}
).plot(x="predy", y="$\epsilon$", kind="scatter")
ax.hlines(0, -5, 5, "red")
ax.vlines(0, -5, 5, "red")

<>:2: SyntaxWarning: invalid escape sequence '\e'
<>:3: SyntaxWarning: invalid escape sequence '\e'
<>:2: SyntaxWarning: invalid escape sequence '\e'
<>:3: SyntaxWarning: invalid escape sequence '\e'
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_13809/2259918956.py:2: SyntaxWarning: invalid escape sequence '\e'
  {"predy": zscore(model_ols.fittedvalues), "$\epsilon$": zscore(model_ols.resid)}
/var/folders/j8/5bgcw6hs7cqcbbz48d6bsftw0000gp/T/ipykernel_13809/2259918956.py:3: SyntaxWarning: invalid escape sequence '\e'
  ).plot(x="predy", y="$\epsilon$", kind="scatter")

Code

ax = sns.kdeplot(y=zscore(model_ols.fittedvalues), x=zscore(model_ols.resid))
ax.hlines(0, -5, 5, 'red')
ax.vlines(0,-5, 5, 'red')

With this model the residuals are imperfect though not wholly unreasonable.

30.3.1.2 Using `spreg`

To generate spatial diagnostics for our model, we need to re-fit it using the spreg package from PySAL. However, since spreg does not understand Wilkinson formulas natively, we use the formulaic package, whose Formula class consumes a dataframe and the formula string, then returns the response vector and design matrix as spreg requires them. All other spatial specifications can be performed via options in the spreg estimation classes².

Code

y, X = Formula(formula).get_model_matrix(pd.DataFrame(balt_sales))

Fitting this model using spreg we pass the y and X variables to the OLS class, but since we also want spatial diagnostics, we also include the Graph object (to the w argument), and include spat_diag=True which computes Lagrange Multiplier tests for spatial dependence in the residual³. Here we also compute the Moran test on the residuals, which requires an extra option given its additional computational overhead.

Code

model_ols_spreg = OLS(
    y=y,
    x=X,
    w=w_net,
    moran=True,
    spat_diag=True,
)

Once model is estimated using spreg, the coefficients and tests can be viewed

Code

print(model_ols_spreg.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5543
Mean dependent var  :     12.1398                Number of Variables   :          23
S.D. dependent var  :      0.6173                Degrees of Freedom    :        5520
R-squared           :      0.7092
Adjusted R-squared  :      0.7080
Sum squared residual:     614.167                F-statistic           :    611.8235
Sigma-square        :       0.111                Prob(F-statistic)     :           0
S.E. of regression  :       0.334                Log likelihood        :   -1767.807
Sigma-square ML     :       0.111                Akaike info criterion :    3581.614
S.E of regression ML:      0.3329                Schwarz criterion     :    3733.881

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         4.04459         0.23675        17.08350         0.00000
            STRUGRAD         0.69246         0.02497        27.72953         0.00000
    C(RESITYP)[T.TH]        -0.15059         0.01427       -10.55469         0.00000
C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.04646         0.07929        -0.58593         0.55795
C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.48521         0.11735        -4.13475         0.00004
C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.47405         0.24243         1.95540         0.05059
C(DESCCNST)[T.CNST Block]        -0.00263         0.07200        -0.03656         0.97084
C(DESCCNST)[T.CNST Brick]        -0.01128         0.05129        -0.22000         0.82588
C(DESCCNST)[T.CNST Frame]        -0.01480         0.05196        -0.28482         0.77579
C(DESCCNST)[T.CNST Shingle Asbestos]         0.10879         0.05738         1.89607         0.05800
C(DESCCNST)[T.CNST Shingle Wood]         0.01438         0.06109         0.23547         0.81385
C(DESCCNST)[T.CNST Siding]         0.06378         0.05164         1.23502         0.21688
C(DESCCNST)[T.CNST Stone]         0.16493         0.07543         2.18656         0.02882
C(DESCCNST)[T.CNST Stucco]        -0.00981         0.05711        -0.17180         0.86360
    np.log(SQFTSTRC)         0.52480         0.01555        33.73850         0.00000
           struc_age        -0.00413         0.00052        -7.94159         0.00000
          age_square         0.00002         0.00000         5.27204         0.00000
p_nonhisp_white_persons        -0.00028         0.00024        -1.15927         0.24640
p_edu_college_greater         0.00547         0.00041        13.22992         0.00000
             log_inc         0.22761         0.01771        12.85557         0.00000
p_housing_units_multiunit_structures         0.00225         0.00026         8.74427         0.00000
p_owner_occupied_units         0.00701         0.00051        13.74775         0.00000
       log_jobaccess         0.03663         0.00549         6.67101         0.00000
------------------------------------------------------------------------------------
Warning: Variable(s) ['Intercept'] removed for being constant.

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER         225.244

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        803.416           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               22        656.878           0.0000
Koenker-Bassett test             22        475.417           0.0000

DIAGNOSTICS FOR SPATIAL DEPENDENCE
- SARERR -
TEST                           MI/DF       VALUE           PROB
Moran's I (error)              0.1570       27.192           0.0000
Lagrange Multiplier (lag)         1        259.451           0.0000
Robust LM (lag)                   1         49.536           0.0000
Lagrange Multiplier (error)       1        720.495           0.0000
Robust LM (error)                 1        510.580           0.0000
Lagrange Multiplier (SARMA)       2        770.032           0.0000

- Spatial Durbin -
TEST                              DF       VALUE           PROB
LM test for WX                   22        496.495           0.0000
Robust LM WX test                22        844.946           0.0000
Lagrange Multiplier (lag)         1        259.451           0.0000
Robust LM Lag - SDM               1        607.902           0.0000
Joint test for SDM               23       1104.397           0.0000
================================ END OF REPORT =====================================

The estimated model coefficients are the same between packages, but using spreg, we get a full suite of spatial diagnostics in the resulting summary, all of which are highly significant. These results show overwhelming evidence our model should account for some kind of spatial dependence, but since all the tests are significant, we need more exploration to understand which model is most appropriate. So we will step-up our spatial models one at a time, following Anselin (2007). Further, as with the statsmodels result, the Jarque-Bera test of normality on the residuals is highly significant, which could indicate we need to use a robust estimator.

30.3.2 Spatial Lag-of-X (SLX)

To fit an SLX model using the same spatial weights matrix for the dependent variable and every independent variable, you can simply use spreg’s OLS class and pass a Graph, and the argument slx_lags=1, which indicates that the model should include a \(WX\) variable for each of the explanatory variables in the model formula (design matrix). This fits the model:

\[y = \beta X + \theta WX + \epsilon\]

Code

model_slx = OLS(
    y=y,
    x=X,
    w=w_net,
    moran=True,
    spat_diag=True,
    slx_lags=1,
)
model_slx.output

	var_names	coefficients	std_err	zt_stat	prob
0	CONSTANT	4.657921	0.275178	16.926912	0.0
1	STRUGRAD	0.656874	0.025777	25.482499	0.0
2	C(RESITYP)[T.TH]	-0.1728	0.017017	-10.154294	0.0
3	C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.038273	0.075952	-0.503912	0.614343
4	C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.483672	0.112968	-4.281496	0.000019
5	C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.346293	0.23223	1.491168	0.135975
6	C(DESCCNST)[T.CNST Block]	-0.051904	0.069043	-0.751759	0.452228
7	C(DESCCNST)[T.CNST Brick]	-0.02521	0.049199	-0.512396	0.608394
8	C(DESCCNST)[T.CNST Frame]	-0.030854	0.049825	-0.61924	0.535784
9	C(DESCCNST)[T.CNST Shingle Asbestos]	0.046052	0.055211	0.834112	0.404254
10	C(DESCCNST)[T.CNST Shingle Wood]	-0.024746	0.059089	-0.418798	0.67538
11	C(DESCCNST)[T.CNST Siding]	0.019695	0.0497	0.396283	0.691912
12	C(DESCCNST)[T.CNST Stone]	0.15744	0.073158	2.152063	0.031436
13	C(DESCCNST)[T.CNST Stucco]	-0.03956	0.054792	-0.722	0.470325
14	np.log(SQFTSTRC)	0.542276	0.015926	34.049557	0.0
15	struc_age	-0.004158	0.000541	-7.678068	0.0
16	age_square	0.000015	0.000003	4.575224	0.000005
17	p_nonhisp_white_persons	0.001509	0.000391	3.860529	0.000114
18	p_edu_college_greater	0.002251	0.000475	4.737025	0.000002
19	log_inc	0.14925	0.018049	8.269023	0.0
20	p_housing_units_multiunit_structures	0.001594	0.000291	5.480163	0.0
21	p_owner_occupied_units	0.003837	0.000573	6.695649	0.0
22	log_jobaccess	0.05092	0.01723	2.955358	0.003136
23	W_STRUGRAD	0.033285	0.051052	0.651985	0.514438
24	W_C(RESITYP)[T.TH]	0.071651	0.028967	2.47352	0.013409
25	W_C(DESCCNST)[T.CNST 1/2 Brick Siding]	-1.101108	0.228329	-4.822467	0.000001
26	W_C(DESCCNST)[T.CNST 1/2 Stone Frame]	-1.461694	0.289068	-5.056574	0.0
27	W_C(DESCCNST)[T.CNST 1/2 Stone Siding]	1.234218	1.029192	1.199211	0.230497
28	W_C(DESCCNST)[T.CNST Block]	-0.9573	0.207069	-4.623095	0.000004
29	W_C(DESCCNST)[T.CNST Brick]	-0.968864	0.155134	-6.245351	0.0
30	W_C(DESCCNST)[T.CNST Frame]	-1.06164	0.156438	-6.786312	0.0
31	W_C(DESCCNST)[T.CNST Shingle Asbestos]	-0.750257	0.171349	-4.37852	0.000012
32	W_C(DESCCNST)[T.CNST Shingle Wood]	-1.154997	0.175167	-6.593674	0.0
33	W_C(DESCCNST)[T.CNST Siding]	-0.785507	0.153236	-5.126132	0.0
34	W_C(DESCCNST)[T.CNST Stone]	-1.060029	0.236464	-4.482833	0.000008
35	W_C(DESCCNST)[T.CNST Stucco]	-0.878195	0.172098	-5.102881	0.0
36	W_np.log(SQFTSTRC)	-0.137022	0.03084	-4.442984	0.000009
37	W_struc_age	0.002499	0.001116	2.240427	0.025103
38	W_age_square	-0.00001	0.000007	-1.480908	0.138688
39	W_p_nonhisp_white_persons	-0.004023	0.000492	-8.175412	0.0
40	W_p_edu_college_greater	0.005915	0.000695	8.509835	0.0
41	W_log_inc	0.131797	0.025514	5.165596	0.0
42	W_p_housing_units_multiunit_structures	0.001803	0.000522	3.455376	0.000554
43	W_p_owner_occupied_units	0.008587	0.000998	8.603926	0.0
44	W_log_jobaccess	-0.019147	0.019066	-1.004258	0.315299

In addition to the summary attribute, which stores a large textual report, the output attribute of each spreg class stores the estimated coefficients and significance measures as a pandas DataFrame. In this output, we can see each of the \(X\) variables in the original specification now also has a \(WX\) version (prefixed by W_), many of which are significant. These results show clear and substantive evidence of spatial spillover (especially for neighborhood income W_loginc, which has a very large coefficient), where attributes of observations nearby influence the outcome at a different location.

Code

print(f"Robust LM Error p-value: {model_slx.rlm_error[1]}")
print(f"Robust LM Lag p-value: {model_slx.rlm_lag[1]}")

Robust LM Error p-value: 0.004444097228419605
Robust LM Lag p-value: 0.00017505329128955264

Despite significant \(WX\) results, the robust Lagrange Multiplier tests for both the SEM and SAR models are highly significant, showing significant autocorrelation remaining in the residual and suggesting we still need to explore more models.

30.3.3 Spatial Error (SEM)

To fit the spatial error model it is best to use the GMM_Error class, which fits the model:

\[y = \beta X + u\] \[ u = \lambda Wu+ \epsilon \]

Note

Classic estimation for spatial econometric models uses Maximum-Likelihood (ML), but modern applications typically prefer Generalized Method of Moments (GMM) for spatial error variants and Two-Stage Least Squares (TSLS) estimation for lag-based models, both of which are faster and applicable to larger datasets. Below, we will only use the spreg classes based on these estimators, however this is important to know, because TSLS estimation uses higher-order spatial lags of the X variables as instruments for estimating \(\rho WY\) by default. This will cause problems when estimating models that also include lagged \(X\)-variables (i.e. SLX-Error and Spatial Durbin). To avoid these errors, you can provide alternative instruments yourself, or you can pass w_lags >=2 which uses even higher-order neighbors as instruments.

Following the same conventions as above, we pass y, X, and W, then examine the output. Since the OLS model also showed some evidence of heteroskedasticity in the errors, we will also use a robust estimator by passing estimator="het".

Code

model_sem = GMM_Error(
    y=y,
    x=X,
    w=w_net,
    estimator="het",
)
model_sem.output.round(4)

	var_names	coefficients	std_err	zt_stat	prob
0	CONSTANT	5.161013	0.264476	19.514136	0.0
1	STRUGRAD	0.688954	0.027537	25.018805	0.0
2	C(RESITYP)[T.TH]	-0.176343	0.016622	-10.608905	0.0
3	C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.000097	0.066132	-0.001459	0.998836
4	C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.432683	0.143303	-3.019352	0.002533
5	C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.36846	0.119828	3.074907	0.002106
6	C(DESCCNST)[T.CNST Block]	0.013489	0.066659	0.202355	0.839639
7	C(DESCCNST)[T.CNST Brick]	0.024513	0.045946	0.533529	0.593667
8	C(DESCCNST)[T.CNST Frame]	0.027819	0.046847	0.593822	0.552631
9	C(DESCCNST)[T.CNST Shingle Asbestos]	0.094486	0.049553	1.906755	0.056552
10	C(DESCCNST)[T.CNST Shingle Wood]	0.054446	0.052307	1.040898	0.297923
11	C(DESCCNST)[T.CNST Siding]	0.06578	0.046627	1.410793	0.158306
12	C(DESCCNST)[T.CNST Stone]	0.211491	0.05668	3.731341	0.00019
13	C(DESCCNST)[T.CNST Stucco]	0.011386	0.048705	0.233772	0.815162
14	np.log(SQFTSTRC)	0.533052	0.016739	31.844092	0.0
15	struc_age	-0.00438	0.00043	-10.182988	0.0
16	age_square	0.000016	0.000003	6.472368	0.0
17	p_nonhisp_white_persons	0.001289	0.000347	3.710607	0.000207
18	p_edu_college_greater	0.004436	0.000493	9.002822	0.0
19	log_inc	0.141863	0.020014	7.088126	0.0
20	p_housing_units_multiunit_structures	0.001644	0.000276	5.959066	0.0
21	p_owner_occupied_units	0.004236	0.000616	6.874524	0.0
22	log_jobaccess	0.038429	0.009645	3.984353	0.000068
23	lambda	0.664089	0.017602	37.72874	0.0

Code

print(model_sem.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: GM SPATIALLY WEIGHTED LEAST SQUARES (HET)
------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5543
Mean dependent var  :     12.1398                Number of Variables   :          23
S.D. dependent var  :      0.6173                Degrees of Freedom    :        5520
Pseudo R-squared    :      0.7025
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         5.16101         0.26448        19.51414         0.00000
            STRUGRAD         0.68895         0.02754        25.01881         0.00000
    C(RESITYP)[T.TH]        -0.17634         0.01662       -10.60891         0.00000
C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.00010         0.06613        -0.00146         0.99884
C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.43268         0.14330        -3.01935         0.00253
C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.36846         0.11983         3.07491         0.00211
C(DESCCNST)[T.CNST Block]         0.01349         0.06666         0.20236         0.83964
C(DESCCNST)[T.CNST Brick]         0.02451         0.04595         0.53353         0.59367
C(DESCCNST)[T.CNST Frame]         0.02782         0.04685         0.59382         0.55263
C(DESCCNST)[T.CNST Shingle Asbestos]         0.09449         0.04955         1.90676         0.05655
C(DESCCNST)[T.CNST Shingle Wood]         0.05445         0.05231         1.04090         0.29792
C(DESCCNST)[T.CNST Siding]         0.06578         0.04663         1.41079         0.15831
C(DESCCNST)[T.CNST Stone]         0.21149         0.05668         3.73134         0.00019
C(DESCCNST)[T.CNST Stucco]         0.01139         0.04871         0.23377         0.81516
    np.log(SQFTSTRC)         0.53305         0.01674        31.84409         0.00000
           struc_age        -0.00438         0.00043       -10.18299         0.00000
          age_square         0.00002         0.00000         6.47237         0.00000
p_nonhisp_white_persons         0.00129         0.00035         3.71061         0.00021
p_edu_college_greater         0.00444         0.00049         9.00282         0.00000
             log_inc         0.14186         0.02001         7.08813         0.00000
p_housing_units_multiunit_structures         0.00164         0.00028         5.95907         0.00000
p_owner_occupied_units         0.00424         0.00062         6.87452         0.00000
       log_jobaccess         0.03843         0.00964         3.98435         0.00007
              lambda         0.66409         0.01760        37.72874         0.00000
------------------------------------------------------------------------------------
Warning: Variable(s) ['Intercept'] removed for being constant.
================================ END OF REPORT =====================================

In these results, we now have a lambda coefficient for our spatial error term, which is highly significant (and pretty large at 0.66). Most \(X\) variables related to the building structure are still significant, and now the neighborhood racial composition becomes significant. The rest of the significant coefficients have changed values a bit, for example the log_inc coefficient is reduced to 0.142 (down from 0.228 in the OLS model).

30.3.4 Spatial Lag (SAR)

To fit a spatial lag model the preferred approach is to use the GM_Lag class, which uses TSLS under the hood. If you have additional instruments to use, you can pass them to yend, but here we will adopt the default (which computes then uses \(WX\) as instruments). For robustness, we will also include White standard errors.

\[y = \rho W y + \beta X + \epsilon\]

Code

model_sar = GM_Lag(
    y=y,
    x=X,
    w=w_net,
    robust="white",
)
model_sar.output

	var_names	coefficients	std_err	zt_stat	prob
0	CONSTANT	3.226568	0.365568	8.826187	0.0
1	STRUGRAD	0.66338	0.028293	23.446806	0.0
2	C(RESITYP)[T.TH]	-0.145024	0.015332	-9.459168	0.0
3	C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.035061	0.07168	-0.489129	0.62475
4	C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.482857	0.147447	-3.274794	0.001057
5	C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.439667	0.117774	3.733133	0.000189
6	C(DESCCNST)[T.CNST Block]	-0.003551	0.067755	-0.052416	0.958197
7	C(DESCCNST)[T.CNST Brick]	-0.004382	0.047774	-0.091733	0.92691
8	C(DESCCNST)[T.CNST Frame]	-0.007347	0.048774	-0.150637	0.880262
9	C(DESCCNST)[T.CNST Shingle Asbestos]	0.107498	0.05137	2.092609	0.036384
10	C(DESCCNST)[T.CNST Shingle Wood]	0.007668	0.054731	0.140113	0.88857
11	C(DESCCNST)[T.CNST Siding]	0.062658	0.047973	1.306116	0.191513
12	C(DESCCNST)[T.CNST Stone]	0.154081	0.056785	2.713432	0.006659
13	C(DESCCNST)[T.CNST Stucco]	-0.010339	0.051645	-0.200204	0.841321
14	np.log(SQFTSTRC)	0.509277	0.01695	30.04528	0.0
15	struc_age	-0.00419	0.000402	-10.410229	0.0
16	age_square	0.000016	0.000002	6.762482	0.0
17	p_nonhisp_white_persons	-0.000185	0.000254	-0.725947	0.467871
18	p_edu_college_greater	0.004669	0.000485	9.634843	0.0
19	log_inc	0.201339	0.021762	9.251868	0.0
20	p_housing_units_multiunit_structures	0.001889	0.000287	6.589082	0.0
21	p_owner_occupied_units	0.006502	0.00057	11.409809	0.0
22	log_jobaccess	0.031588	0.005725	5.517353	0.0
23	W_log_price	0.113879	0.037709	3.019926	0.002528

Compared to the SEM, this model has a slightly better model fit, and some of the coefficients have changed a bit. The spatial autoregressive parameter called W_dep_var in the report is relatively small but highly significant. Racial composition in the neighborhood is no longer significant, and the neighborhood income coefficient has increased again up to 0.20.

Code

print(model_sar.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5543
Mean dependent var  :     12.1398                Number of Variables   :          24
S.D. dependent var  :      0.6173                Degrees of Freedom    :        5519
Pseudo R-squared    :      0.7208
Spatial Pseudo R-squared:  0.7113

White Standard Errors
------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         3.22657         0.36557         8.82619         0.00000
            STRUGRAD         0.66338         0.02829        23.44681         0.00000
    C(RESITYP)[T.TH]        -0.14502         0.01533        -9.45917         0.00000
C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.03506         0.07168        -0.48913         0.62475
C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.48286         0.14745        -3.27479         0.00106
C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.43967         0.11777         3.73313         0.00019
C(DESCCNST)[T.CNST Block]        -0.00355         0.06776        -0.05242         0.95820
C(DESCCNST)[T.CNST Brick]        -0.00438         0.04777        -0.09173         0.92691
C(DESCCNST)[T.CNST Frame]        -0.00735         0.04877        -0.15064         0.88026
C(DESCCNST)[T.CNST Shingle Asbestos]         0.10750         0.05137         2.09261         0.03638
C(DESCCNST)[T.CNST Shingle Wood]         0.00767         0.05473         0.14011         0.88857
C(DESCCNST)[T.CNST Siding]         0.06266         0.04797         1.30612         0.19151
C(DESCCNST)[T.CNST Stone]         0.15408         0.05678         2.71343         0.00666
C(DESCCNST)[T.CNST Stucco]        -0.01034         0.05164        -0.20020         0.84132
    np.log(SQFTSTRC)         0.50928         0.01695        30.04528         0.00000
           struc_age        -0.00419         0.00040       -10.41023         0.00000
          age_square         0.00002         0.00000         6.76248         0.00000
p_nonhisp_white_persons        -0.00018         0.00025        -0.72595         0.46787
p_edu_college_greater         0.00467         0.00048         9.63484         0.00000
             log_inc         0.20134         0.02176         9.25187         0.00000
p_housing_units_multiunit_structures         0.00189         0.00029         6.58908         0.00000
p_owner_occupied_units         0.00650         0.00057        11.40981         0.00000
       log_jobaccess         0.03159         0.00573         5.51735         0.00000
         W_log_price         0.11388         0.03771         3.01993         0.00253
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_C(DESCCNST)[T.CNST 1/2 Brick Siding], W_C(DESCCNST)[T.CNST
             1/2 Stone Frame], W_C(DESCCNST)[T.CNST 1/2 Stone Siding],
             W_C(DESCCNST)[T.CNST Block], W_C(DESCCNST)[T.CNST Brick],
             W_C(DESCCNST)[T.CNST Frame], W_C(DESCCNST)[T.CNST Shingle
             Asbestos], W_C(DESCCNST)[T.CNST Shingle Wood],
             W_C(DESCCNST)[T.CNST Siding], W_C(DESCCNST)[T.CNST Stone],
             W_C(DESCCNST)[T.CNST Stucco], W_C(RESITYP)[T.TH], W_STRUGRAD,
             W_age_square, W_log_inc, W_log_jobaccess, W_np.log(SQFTSTRC),
             W_p_edu_college_greater,
             W_p_housing_units_multiunit_structures,
             W_p_nonhisp_white_persons, W_p_owner_occupied_units,
             W_struc_age
Warning: Variable(s) ['Intercept'] removed for being constant.

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                              DF         VALUE           PROB
Anselin-Kelejian Test             1        198.503           0.0000

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
            STRUGRAD         0.6634          0.0853          0.7486
    C(RESITYP)[T.TH]        -0.1450         -0.0186         -0.1637
C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.0351         -0.0045         -0.0396
C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.4829         -0.0621         -0.5449
C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.4397          0.0565          0.4962
C(DESCCNST)[T.CNST Block]        -0.0036         -0.0005         -0.0040
C(DESCCNST)[T.CNST Brick]        -0.0044         -0.0006         -0.0049
C(DESCCNST)[T.CNST Frame]        -0.0073         -0.0009         -0.0083
C(DESCCNST)[T.CNST Shingle Asbestos]         0.1075          0.0138          0.1213
C(DESCCNST)[T.CNST Shingle Wood]         0.0077          0.0010          0.0087
C(DESCCNST)[T.CNST Siding]         0.0627          0.0081          0.0707
C(DESCCNST)[T.CNST Stone]         0.1541          0.0198          0.1739
C(DESCCNST)[T.CNST Stucco]        -0.0103         -0.0013         -0.0117
    np.log(SQFTSTRC)         0.5093          0.0654          0.5747
           struc_age        -0.0042         -0.0005         -0.0047
          age_square         0.0000          0.0000          0.0000
p_nonhisp_white_persons        -0.0002         -0.0000         -0.0002
p_edu_college_greater         0.0047          0.0006          0.0053
             log_inc         0.2013          0.0259          0.2272
p_housing_units_multiunit_structures         0.0019          0.0002          0.0021
p_owner_occupied_units         0.0065          0.0008          0.0073
       log_jobaccess         0.0316          0.0041          0.0356
================================ END OF REPORT =====================================

The summary for models that include lagged dependent variables is also slightly different, as the end of the summary table now includes a set of impacts for each explanatory variable that show the direct, indirect, and total effects associated with each input. Thus we can see the total impact of neighborhood income is actually even higher at 0.23 after accounting for spillover effects. The Anselin-Kelejian test for spatial dependence is still significant in this model (@ Anselin & Kelejian, 1997), which suggests we may need to use an even more general model like the Spatial Durbin.

30.3.5 SLX-Error (SLXE / Durbin-Error)

To fit an SLXE model, we use the GMM_Error class once again, but this time pass slx_lags=1, which automatically handles the creation and inclusion of our \(WX\) variables.

Code

model_slxe = GMM_Error(
    y=y,
    x=X,
    w=w_net,
    slx_lags=1,
    estimator="het",
)
model_slxe.output

	var_names	coefficients	std_err	zt_stat	prob
0	CONSTANT	5.237659	0.307527	17.031549	0.0
1	STRUGRAD	0.658802	0.027699	23.784444	0.0
2	C(RESITYP)[T.TH]	-0.184412	0.016731	-11.02234	0.0
3	C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.041202	0.073425	-0.561139	0.574702
4	C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.483449	0.157716	-3.065313	0.002174
5	C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.320895	0.161336	1.988984	0.046703
6	C(DESCCNST)[T.CNST Block]	-0.038833	0.069633	-0.557675	0.577066
7	C(DESCCNST)[T.CNST Brick]	-0.017774	0.049766	-0.357151	0.720979
8	C(DESCCNST)[T.CNST Frame]	-0.021969	0.050732	-0.433041	0.664985
9	C(DESCCNST)[T.CNST Shingle Asbestos]	0.049259	0.053326	0.923726	0.355629
10	C(DESCCNST)[T.CNST Shingle Wood]	-0.008636	0.055609	-0.155299	0.876586
11	C(DESCCNST)[T.CNST Siding]	0.027519	0.050107	0.549192	0.582873
12	C(DESCCNST)[T.CNST Stone]	0.162329	0.058653	2.767616	0.005647
13	C(DESCCNST)[T.CNST Stucco]	-0.029523	0.053202	-0.554922	0.578948
14	np.log(SQFTSTRC)	0.530311	0.016905	31.370128	0.0
15	struc_age	-0.004355	0.000422	-10.327532	0.0
16	age_square	0.000016	0.000002	6.312068	0.0
17	p_nonhisp_white_persons	0.00125	0.000378	3.309019	0.000936
18	p_edu_college_greater	0.002971	0.000491	6.055285	0.0
19	log_inc	0.125151	0.019429	6.441608	0.0
20	p_housing_units_multiunit_structures	0.001493	0.000271	5.501106	0.0
21	p_owner_occupied_units	0.0036	0.000599	6.007609	0.0
22	log_jobaccess	0.040912	0.01705	2.39953	0.016416
23	W_STRUGRAD	0.035277	0.059795	0.589967	0.555213
24	W_C(RESITYP)[T.TH]	0.038051	0.03713	1.024803	0.305456
25	W_C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.766295	0.295839	-2.590245	0.009591
26	W_C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.769452	0.543346	-1.416138	0.156735
27	W_C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.485173	0.80681	0.601347	0.547609
28	W_C(DESCCNST)[T.CNST Block]	-0.552524	0.307572	-1.796403	0.07243
29	W_C(DESCCNST)[T.CNST Brick]	-0.575493	0.245681	-2.342437	0.019158
30	W_C(DESCCNST)[T.CNST Frame]	-0.646834	0.24739	-2.614638	0.008932
31	W_C(DESCCNST)[T.CNST Shingle Asbestos]	-0.534881	0.258877	-2.066157	0.038814
32	W_C(DESCCNST)[T.CNST Shingle Wood]	-0.718847	0.27203	-2.642528	0.008229
33	W_C(DESCCNST)[T.CNST Siding]	-0.464928	0.24442	-1.902171	0.057149
34	W_C(DESCCNST)[T.CNST Stone]	-0.649945	0.323313	-2.010262	0.044403
35	W_C(DESCCNST)[T.CNST Stucco]	-0.503749	0.257301	-1.957819	0.050251
36	W_np.log(SQFTSTRC)	-0.127882	0.040552	-3.15355	0.001613
37	W_struc_age	0.00062	0.001261	0.491888	0.622798
38	W_age_square	-0.000001	0.000007	-0.083142	0.933739
39	W_p_nonhisp_white_persons	-0.003292	0.000606	-5.427691	0.0
40	W_p_edu_college_greater	0.005894	0.000899	6.553917	0.0
41	W_log_inc	0.091642	0.033655	2.722953	0.00647
42	W_p_housing_units_multiunit_structures	0.001157	0.000663	1.744539	0.081065
43	W_p_owner_occupied_units	0.007976	0.001504	5.305089	0.0
44	W_log_jobaccess	-0.017689	0.021072	-0.839463	0.40121
45	lambda	0.504191	0.022161	22.751134	0.0

Code

print(model_slxe.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: GM SPATIALLY WEIGHTED LEAST SQUARES (HET) WITH SLX (SLX-Error)
---------------------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5543
Mean dependent var  :     12.1398                Number of Variables   :          45
S.D. dependent var  :      0.6173                Degrees of Freedom    :        5498
Pseudo R-squared    :      0.7334
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         5.23766         0.30753        17.03155         0.00000
            STRUGRAD         0.65880         0.02770        23.78444         0.00000
    C(RESITYP)[T.TH]        -0.18441         0.01673       -11.02234         0.00000
C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.04120         0.07342        -0.56114         0.57470
C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.48345         0.15772        -3.06531         0.00217
C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.32089         0.16134         1.98898         0.04670
C(DESCCNST)[T.CNST Block]        -0.03883         0.06963        -0.55768         0.57707
C(DESCCNST)[T.CNST Brick]        -0.01777         0.04977        -0.35715         0.72098
C(DESCCNST)[T.CNST Frame]        -0.02197         0.05073        -0.43304         0.66499
C(DESCCNST)[T.CNST Shingle Asbestos]         0.04926         0.05333         0.92373         0.35563
C(DESCCNST)[T.CNST Shingle Wood]        -0.00864         0.05561        -0.15530         0.87659
C(DESCCNST)[T.CNST Siding]         0.02752         0.05011         0.54919         0.58287
C(DESCCNST)[T.CNST Stone]         0.16233         0.05865         2.76762         0.00565
C(DESCCNST)[T.CNST Stucco]        -0.02952         0.05320        -0.55492         0.57895
    np.log(SQFTSTRC)         0.53031         0.01690        31.37013         0.00000
           struc_age        -0.00435         0.00042       -10.32753         0.00000
          age_square         0.00002         0.00000         6.31207         0.00000
p_nonhisp_white_persons         0.00125         0.00038         3.30902         0.00094
p_edu_college_greater         0.00297         0.00049         6.05528         0.00000
             log_inc         0.12515         0.01943         6.44161         0.00000
p_housing_units_multiunit_structures         0.00149         0.00027         5.50111         0.00000
p_owner_occupied_units         0.00360         0.00060         6.00761         0.00000
       log_jobaccess         0.04091         0.01705         2.39953         0.01642
          W_STRUGRAD         0.03528         0.05979         0.58997         0.55521
  W_C(RESITYP)[T.TH]         0.03805         0.03713         1.02480         0.30546
W_C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.76629         0.29584        -2.59024         0.00959
W_C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.76945         0.54335        -1.41614         0.15674
W_C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.48517         0.80681         0.60135         0.54761
W_C(DESCCNST)[T.CNST Block]        -0.55252         0.30757        -1.79640         0.07243
W_C(DESCCNST)[T.CNST Brick]        -0.57549         0.24568        -2.34244         0.01916
W_C(DESCCNST)[T.CNST Frame]        -0.64683         0.24739        -2.61464         0.00893
W_C(DESCCNST)[T.CNST Shingle Asbestos]        -0.53488         0.25888        -2.06616         0.03881
W_C(DESCCNST)[T.CNST Shingle Wood]        -0.71885         0.27203        -2.64253         0.00823
W_C(DESCCNST)[T.CNST Siding]        -0.46493         0.24442        -1.90217         0.05715
W_C(DESCCNST)[T.CNST Stone]        -0.64994         0.32331        -2.01026         0.04440
W_C(DESCCNST)[T.CNST Stucco]        -0.50375         0.25730        -1.95782         0.05025
  W_np.log(SQFTSTRC)        -0.12788         0.04055        -3.15355         0.00161
         W_struc_age         0.00062         0.00126         0.49189         0.62280
        W_age_square        -0.00000         0.00001        -0.08314         0.93374
W_p_nonhisp_white_persons        -0.00329         0.00061        -5.42769         0.00000
W_p_edu_college_greater         0.00589         0.00090         6.55392         0.00000
           W_log_inc         0.09164         0.03366         2.72295         0.00647
W_p_housing_units_multiunit_structures         0.00116         0.00066         1.74454         0.08107
W_p_owner_occupied_units         0.00798         0.00150         5.30509         0.00000
     W_log_jobaccess        -0.01769         0.02107        -0.83946         0.40121
              lambda         0.50419         0.02216        22.75113         0.00000
------------------------------------------------------------------------------------
Warning: Variable(s) ['Intercept'] removed for being constant.
================================ END OF REPORT =====================================

With this result, we have both a lambda coefficient (which remains significant and large, though a bit smaller) and a series of W_ coefficients as with the SLX specification.

30.3.6 Spatial Durbin (SDM)

Code

model_durbin = GM_Lag(
    y=y,
    x=X,
    w=w_net,
    slx_lags=1,
)
model_durbin.output

	var_names	coefficients	std_err	zt_stat	prob
0	CONSTANT	4.386308	0.266731	16.444689	0.0
1	STRUGRAD	0.669529	0.024879	26.910909	0.0
2	C(RESITYP)[T.TH]	-0.183922	0.016443	-11.185764	0.0
3	C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.006329	0.073279	-0.086369	0.931173
4	C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.475969	0.108882	-4.371436	0.000012
5	C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.274234	0.223949	1.224537	0.22075
6	C(DESCCNST)[T.CNST Block]	-0.016437	0.066647	-0.246634	0.805191
7	C(DESCCNST)[T.CNST Brick]	0.008464	0.047549	0.178008	0.858717
8	C(DESCCNST)[T.CNST Frame]	0.009433	0.048205	0.195674	0.844866
9	C(DESCCNST)[T.CNST Shingle Asbestos]	0.057944	0.053227	1.088617	0.276323
10	C(DESCCNST)[T.CNST Shingle Wood]	0.014516	0.057098	0.254224	0.799322
11	C(DESCCNST)[T.CNST Siding]	0.041765	0.047956	0.870901	0.383808
12	C(DESCCNST)[T.CNST Stone]	0.187489	0.070579	2.656422	0.007897
13	C(DESCCNST)[T.CNST Stucco]	-0.015917	0.052867	-0.301081	0.763353
14	np.log(SQFTSTRC)	0.551688	0.015381	35.868103	0.0
15	struc_age	-0.004109	0.000522	-7.871901	0.0
16	age_square	0.000014	0.000003	4.589105	0.000004
17	p_nonhisp_white_persons	0.001861	0.000378	4.917092	0.000001
18	p_edu_college_greater	0.002241	0.000458	4.891937	0.000001
19	log_inc	0.144917	0.017402	8.327709	0.0
20	p_housing_units_multiunit_structures	0.001544	0.00028	5.503491	0.0
21	p_owner_occupied_units	0.003147	0.000557	5.650023	0.0
22	log_jobaccess	0.072764	0.016762	4.34103	0.000014
23	W_STRUGRAD	-0.442253	0.069927	-6.324515	0.0
24	W_C(RESITYP)[T.TH]	0.115106	0.028285	4.069488	0.000047
25	W_C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.901737	0.221047	-4.07939	0.000045
26	W_C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.481457	0.296833	-1.62198	0.104808
27	W_C(DESCCNST)[T.CNST 1/2 Stone Siding]	-1.380083	1.028858	-1.341374	0.179799
28	W_C(DESCCNST)[T.CNST Block]	-0.778951	0.200441	-3.88618	0.000102
29	W_C(DESCCNST)[T.CNST Brick]	-0.787185	0.150718	-5.222899	0.0
30	W_C(DESCCNST)[T.CNST Frame]	-0.844437	0.152474	-5.538244	0.0
31	W_C(DESCCNST)[T.CNST Shingle Asbestos]	-0.86509	0.165582	-5.224544	0.0
32	W_C(DESCCNST)[T.CNST Shingle Wood]	-0.850145	0.171805	-4.948317	0.000001
33	W_C(DESCCNST)[T.CNST Siding]	-0.766143	0.147703	-5.187067	0.0
34	W_C(DESCCNST)[T.CNST Stone]	-0.85979	0.228862	-3.7568	0.000172
35	W_C(DESCCNST)[T.CNST Stucco]	-0.678597	0.167174	-4.059229	0.000049
36	W_np.log(SQFTSTRC)	-0.572687	0.054365	-10.534069	0.0
37	W_struc_age	0.002807	0.001076	2.609441	0.009069
38	W_age_square	-0.000008	0.000006	-1.189759	0.234141
39	W_p_nonhisp_white_persons	-0.002741	0.000493	-5.561842	0.0
40	W_p_edu_college_greater	0.003196	0.000728	4.391756	0.000011
41	W_log_inc	-0.237469	0.045753	-5.190263	0.0
42	W_p_housing_units_multiunit_structures	-0.001234	0.000594	-2.075322	0.037957
43	W_p_owner_occupied_units	-0.00079	0.001373	-0.575707	0.564813
44	W_log_jobaccess	-0.085574	0.019643	-4.356469	0.000013
45	W_log_price	0.759424	0.079348	9.57082	0.0

Code

print(model_durbin.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES WITH SLX (SPATIAL DURBIN MODEL)
----------------------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        5543
Mean dependent var  :     12.1398                Number of Variables   :          46
S.D. dependent var  :      0.6173                Degrees of Freedom    :        5497
Pseudo R-squared    :      0.7521
Spatial Pseudo R-squared:  0.7122

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         4.38631         0.26673        16.44469         0.00000
            STRUGRAD         0.66953         0.02488        26.91091         0.00000
    C(RESITYP)[T.TH]        -0.18392         0.01644       -11.18576         0.00000
C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.00633         0.07328        -0.08637         0.93117
C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.47597         0.10888        -4.37144         0.00001
C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.27423         0.22395         1.22454         0.22075
C(DESCCNST)[T.CNST Block]        -0.01644         0.06665        -0.24663         0.80519
C(DESCCNST)[T.CNST Brick]         0.00846         0.04755         0.17801         0.85872
C(DESCCNST)[T.CNST Frame]         0.00943         0.04821         0.19567         0.84487
C(DESCCNST)[T.CNST Shingle Asbestos]         0.05794         0.05323         1.08862         0.27632
C(DESCCNST)[T.CNST Shingle Wood]         0.01452         0.05710         0.25422         0.79932
C(DESCCNST)[T.CNST Siding]         0.04177         0.04796         0.87090         0.38381
C(DESCCNST)[T.CNST Stone]         0.18749         0.07058         2.65642         0.00790
C(DESCCNST)[T.CNST Stucco]        -0.01592         0.05287        -0.30108         0.76335
    np.log(SQFTSTRC)         0.55169         0.01538        35.86810         0.00000
           struc_age        -0.00411         0.00052        -7.87190         0.00000
          age_square         0.00001         0.00000         4.58910         0.00000
p_nonhisp_white_persons         0.00186         0.00038         4.91709         0.00000
p_edu_college_greater         0.00224         0.00046         4.89194         0.00000
             log_inc         0.14492         0.01740         8.32771         0.00000
p_housing_units_multiunit_structures         0.00154         0.00028         5.50349         0.00000
p_owner_occupied_units         0.00315         0.00056         5.65002         0.00000
       log_jobaccess         0.07276         0.01676         4.34103         0.00001
          W_STRUGRAD        -0.44225         0.06993        -6.32451         0.00000
  W_C(RESITYP)[T.TH]         0.11511         0.02829         4.06949         0.00005
W_C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.90174         0.22105        -4.07939         0.00005
W_C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.48146         0.29683        -1.62198         0.10481
W_C(DESCCNST)[T.CNST 1/2 Stone Siding]        -1.38008         1.02886        -1.34137         0.17980
W_C(DESCCNST)[T.CNST Block]        -0.77895         0.20044        -3.88618         0.00010
W_C(DESCCNST)[T.CNST Brick]        -0.78719         0.15072        -5.22290         0.00000
W_C(DESCCNST)[T.CNST Frame]        -0.84444         0.15247        -5.53824         0.00000
W_C(DESCCNST)[T.CNST Shingle Asbestos]        -0.86509         0.16558        -5.22454         0.00000
W_C(DESCCNST)[T.CNST Shingle Wood]        -0.85015         0.17180        -4.94832         0.00000
W_C(DESCCNST)[T.CNST Siding]        -0.76614         0.14770        -5.18707         0.00000
W_C(DESCCNST)[T.CNST Stone]        -0.85979         0.22886        -3.75680         0.00017
W_C(DESCCNST)[T.CNST Stucco]        -0.67860         0.16717        -4.05923         0.00005
  W_np.log(SQFTSTRC)        -0.57269         0.05437       -10.53407         0.00000
         W_struc_age         0.00281         0.00108         2.60944         0.00907
        W_age_square        -0.00001         0.00001        -1.18976         0.23414
W_p_nonhisp_white_persons        -0.00274         0.00049        -5.56184         0.00000
W_p_edu_college_greater         0.00320         0.00073         4.39176         0.00001
           W_log_inc        -0.23747         0.04575        -5.19026         0.00000
W_p_housing_units_multiunit_structures        -0.00123         0.00059        -2.07532         0.03796
W_p_owner_occupied_units        -0.00079         0.00137        -0.57571         0.56481
     W_log_jobaccess        -0.08557         0.01964        -4.35647         0.00001
         W_log_price         0.75942         0.07935         9.57082         0.00000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W2_C(DESCCNST)[T.CNST 1/2 Brick Siding], W2_C(DESCCNST)[T.CNST
             1/2 Stone Frame], W2_C(DESCCNST)[T.CNST 1/2 Stone Siding],
             W2_C(DESCCNST)[T.CNST Block], W2_C(DESCCNST)[T.CNST Brick],
             W2_C(DESCCNST)[T.CNST Frame], W2_C(DESCCNST)[T.CNST Shingle
             Asbestos], W2_C(DESCCNST)[T.CNST Shingle Wood],
             W2_C(DESCCNST)[T.CNST Siding], W2_C(DESCCNST)[T.CNST Stone],
             W2_C(DESCCNST)[T.CNST Stucco], W2_C(RESITYP)[T.TH],
             W2_STRUGRAD, W2_age_square, W2_log_inc, W2_log_jobaccess,
             W2_np.log(SQFTSTRC), W2_p_edu_college_greater,
             W2_p_housing_units_multiunit_structures,
             W2_p_nonhisp_white_persons, W2_p_owner_occupied_units,
             W2_struc_age
Warning: Variable(s) ['Intercept'] removed for being constant.

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                              DF         VALUE           PROB
Anselin-Kelejian Test             1         36.382           0.0000
Common Factor Hypothesis Test    22        303.684           0.0000

SPATIAL DURBIN MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
            STRUGRAD         0.6695          0.2752          0.9447
    C(RESITYP)[T.TH]        -0.1839         -0.1021         -0.2860
C(DESCCNST)[T.CNST 1/2 Brick Siding]        -0.0063         -3.7682         -3.7745
C(DESCCNST)[T.CNST 1/2 Stone Frame]        -0.4760         -3.5038         -3.9797
C(DESCCNST)[T.CNST 1/2 Stone Siding]         0.2742         -4.8709         -4.5967
C(DESCCNST)[T.CNST Block]        -0.0164         -3.2897         -3.3062
C(DESCCNST)[T.CNST Brick]         0.0085         -3.2454         -3.2369
C(DESCCNST)[T.CNST Frame]         0.0094         -3.4803         -3.4709
C(DESCCNST)[T.CNST Shingle Asbestos]         0.0579         -3.4130         -3.3551
C(DESCCNST)[T.CNST Shingle Wood]         0.0145         -3.4880         -3.4735
C(DESCCNST)[T.CNST Siding]         0.0418         -3.0528         -3.0110
C(DESCCNST)[T.CNST Stone]         0.1875         -2.9820         -2.7945
C(DESCCNST)[T.CNST Stucco]        -0.0159         -2.8710         -2.8869
    np.log(SQFTSTRC)         0.5517         -0.6390         -0.0873
           struc_age        -0.0041         -0.0013         -0.0054
          age_square         0.0000          0.0000          0.0000
p_nonhisp_white_persons         0.0019         -0.0055         -0.0037
p_edu_college_greater         0.0022          0.0204          0.0226
             log_inc         0.1449         -0.5296         -0.3847
p_housing_units_multiunit_structures         0.0015         -0.0003          0.0013
p_owner_occupied_units         0.0031          0.0066          0.0098
       log_jobaccess         0.0728         -0.1260         -0.0532
================================ END OF REPORT =====================================

Notice in the report for the Durbin model the instrumented variable (W_dep_var) is estimated using instruments prefixed with W2_ indicating second-order spatial lags of the explanatory variables. That is, the GM_Lag class noticed we included slx_lags=1 and automatically incremented the order of \(WX\) to use as instruments. Obviously, you can modify this behavior or pass your own instruments if desired.

30.3.7 Comparison

Sometimes it can be useful to compare the estimated coefficients from each model side-by-side, but until now we’ve had them each in separate summaries. For simplicity, we can concatenate the output tables from each of the models, and we can make sure everything stays aligned by setting the variable as an index.

Code

pd.concat(
    [
        model_durbin.output.set_index("var_names")["coefficients"].rename("durbin"),
        model_slxe.output.set_index("var_names")["coefficients"].rename("slxe"),
        model_sar.output.set_index("var_names")["coefficients"].rename("lag"),
        model_sem.output.set_index("var_names")["coefficients"].rename("sem"),
        model_slx.output.set_index("var_names")["coefficients"].rename("slx"),
        model_ols_spreg.output.set_index("var_names")["coefficients"].rename("ols"),
    ],
    axis=1,
)

	durbin	slxe	lag	sem	slx	ols
var_names
CONSTANT	4.386308	5.237659	3.226568	5.161013	4.657921	4.04459
STRUGRAD	0.669529	0.658802	0.66338	0.688954	0.656874	0.692456
C(RESITYP)[T.TH]	-0.183922	-0.184412	-0.145024	-0.176343	-0.1728	-0.150589
C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.006329	-0.041202	-0.035061	-0.000097	-0.038273	-0.046456
C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.475969	-0.483449	-0.482857	-0.432683	-0.483672	-0.485211
C(DESCCNST)[T.CNST 1/2 Stone Siding]	0.274234	0.320895	0.439667	0.36846	0.346293	0.474046
C(DESCCNST)[T.CNST Block]	-0.016437	-0.038833	-0.003551	0.013489	-0.051904	-0.002632
C(DESCCNST)[T.CNST Brick]	0.008464	-0.017774	-0.004382	0.024513	-0.02521	-0.011285
C(DESCCNST)[T.CNST Frame]	0.009433	-0.021969	-0.007347	0.027819	-0.030854	-0.014799
C(DESCCNST)[T.CNST Shingle Asbestos]	0.057944	0.049259	0.107498	0.094486	0.046052	0.108793
C(DESCCNST)[T.CNST Shingle Wood]	0.014516	-0.008636	0.007668	0.054446	-0.024746	0.014384
C(DESCCNST)[T.CNST Siding]	0.041765	0.027519	0.062658	0.06578	0.019695	0.06378
C(DESCCNST)[T.CNST Stone]	0.187489	0.162329	0.154081	0.211491	0.15744	0.164929
C(DESCCNST)[T.CNST Stucco]	-0.015917	-0.029523	-0.010339	0.011386	-0.03956	-0.009811
np.log(SQFTSTRC)	0.551688	0.530311	0.509277	0.533052	0.542276	0.524796
struc_age	-0.004109	-0.004355	-0.00419	-0.00438	-0.004158	-0.004131
age_square	0.000014	0.000016	0.000016	0.000016	0.000015	0.000016
p_nonhisp_white_persons	0.001861	0.00125	-0.000185	0.001289	0.001509	-0.000279
p_edu_college_greater	0.002241	0.002971	0.004669	0.004436	0.002251	0.005473
log_inc	0.144917	0.125151	0.201339	0.141863	0.14925	0.227612
p_housing_units_multiunit_structures	0.001544	0.001493	0.001889	0.001644	0.001594	0.002245
p_owner_occupied_units	0.003147	0.0036	0.006502	0.004236	0.003837	0.007009
log_jobaccess	0.072764	0.040912	0.031588	0.038429	0.05092	0.036633
W_STRUGRAD	-0.442253	0.035277	NaN	NaN	0.033285	NaN
W_C(RESITYP)[T.TH]	0.115106	0.038051	NaN	NaN	0.071651	NaN
W_C(DESCCNST)[T.CNST 1/2 Brick Siding]	-0.901737	-0.766295	NaN	NaN	-1.101108	NaN
W_C(DESCCNST)[T.CNST 1/2 Stone Frame]	-0.481457	-0.769452	NaN	NaN	-1.461694	NaN
W_C(DESCCNST)[T.CNST 1/2 Stone Siding]	-1.380083	0.485173	NaN	NaN	1.234218	NaN
W_C(DESCCNST)[T.CNST Block]	-0.778951	-0.552524	NaN	NaN	-0.9573	NaN
W_C(DESCCNST)[T.CNST Brick]	-0.787185	-0.575493	NaN	NaN	-0.968864	NaN
W_C(DESCCNST)[T.CNST Frame]	-0.844437	-0.646834	NaN	NaN	-1.06164	NaN
W_C(DESCCNST)[T.CNST Shingle Asbestos]	-0.86509	-0.534881	NaN	NaN	-0.750257	NaN
W_C(DESCCNST)[T.CNST Shingle Wood]	-0.850145	-0.718847	NaN	NaN	-1.154997	NaN
W_C(DESCCNST)[T.CNST Siding]	-0.766143	-0.464928	NaN	NaN	-0.785507	NaN
W_C(DESCCNST)[T.CNST Stone]	-0.85979	-0.649945	NaN	NaN	-1.060029	NaN
W_C(DESCCNST)[T.CNST Stucco]	-0.678597	-0.503749	NaN	NaN	-0.878195	NaN
W_np.log(SQFTSTRC)	-0.572687	-0.127882	NaN	NaN	-0.137022	NaN
W_struc_age	0.002807	0.00062	NaN	NaN	0.002499	NaN
W_age_square	-0.000008	-0.000001	NaN	NaN	-0.00001	NaN
W_p_nonhisp_white_persons	-0.002741	-0.003292	NaN	NaN	-0.004023	NaN
W_p_edu_college_greater	0.003196	0.005894	NaN	NaN	0.005915	NaN
W_log_inc	-0.237469	0.091642	NaN	NaN	0.131797	NaN
W_p_housing_units_multiunit_structures	-0.001234	0.001157	NaN	NaN	0.001803	NaN
W_p_owner_occupied_units	-0.00079	0.007976	NaN	NaN	0.008587	NaN
W_log_jobaccess	-0.085574	-0.017689	NaN	NaN	-0.019147	NaN
W_log_price	0.759424	NaN	0.113879	NaN	NaN	NaN
lambda	NaN	0.504191	NaN	0.664089	NaN	NaN

These models have different sets of terms (e.g. only SEM variants have lambda, only SLX variants have W_ variables, and only Lag variants have W_dep_var), so models missing the necessary coefficients will have NAs. Also remember that we’re joining together only the output from the lag models, not the impacts, so we are ignoring the indirect effects from these models.

Comparing the estimated coefficients shows that median income was doing a lot of heavy lifting in the OLS model. That is, when we ignore spatial effects, the median income variable absorbs a lot of the spatial variance, inflating its coefficient and biasing our interpretation of how much residents value higher-income neighbors. In the spatial models, the estimated coefficient for income is less than half as large. That’s basically true of all the other neighborhood characteristics as well.

Remember that our data are measured at two different spatial levels. Our housing sales are measured at the unit level, but the ‘neighborhood’ variables are measured at the census blockgroup level. Because we have several sales nested inside each census unit, we are building spatial autocorrelation into the model by design. That is, because we don’t know the income, education, race, etc of the home’s previous owner, the “neighborhood” variables coming from the Census are defined in geographic zones which are essentially \(WX\) instead of just \(X\), where \(W\) is defined by discrete membership inside a census unit.

30.4 Specification Search

In general, it is a good idea to fit a series of models and examine them one at a time using both diagnostics and available theory, as we do above. However spreg also offers a set of tools for automated model specification search following the selection criteria outlined in Anselin et al. (2024) in the spsearch submodule. These functions proceed by fitting a series of models, selecting the chosen alternative based on a set of tests defined by each search strategy defined by Anselin et al. (2024), and each returns the name and instance of the best-fitting model.

30.4.1 Specific to General

The specific-to-general strategy (STG) moves through the same process as above, starting with restrictive models that do not include spatial terms, then moving to increasingly complex specifications as necessary (indicated by failing spatial diagnostics for simpler models). The search strategies differ in the starting models or the preferred metric for comparing models. Below we carry out all three search strategies and store the best model inside a pandas Series for presentation. As a small matter of housekeeping, the spsearch functions do not automatically drop the intercept column on our design matrix X before adding its own constant (unlike the classes invoked directly above), so we need to do this step manually.

Code

X = X.drop(columns=["Intercept"])

classic = spsearch.stge_classic(y=y, x=X, w=w_net, mprint=False)
kb = spsearch.stge_kb(y=y, x=X, w=w_net, mprint=False)
pre = spsearch.stge_pre(y=y, x=X, w=w_net, mprint=False)

stge = {
    "Classic": classic[0],
    "Koley-Bera": kb[0],
    'Pre': pre[0]
}
pd.Series(stge).T

Classic       ERROR_Br
Koley-Bera         SDM
Pre                LAG
dtype: object

To make things difficult, the search strategies disagree on their preferred model(!) the classic search suggests (robust) SEM, the ‘pre’ search suggests the SAR model, and the KB search suggests the SDM. In this case, I would argue the results suggest the preferred specification from the STG approach is the Spatial Durbin model, because this specification is preferred by the most modern diagnostic statistics, (Koley & and Bera, 2024; Koley & Bera, 2022), and the SDM specification has some useful conceptual properties as described by LeSage & Pace (2009).

30.4.2 General to Specific

Alternatively, we could adopt a general-to-specific (GTS) search strategy where we begin with over-specified models that include all spatial terms, then systematically simplify these models by removing spatial components deemed unnecessary. In this case, we can begin with either the SDM or the general nesting specification (GNS) (which we do not actually bother fitting above, given its issues with identification and interpretation).

Code

pd.Series(
    {
        "GETS-GNS": spsearch.gets_gns(y=y, x=X, w=w_net, mprint=False)[0],
        "GETS-SDM": spsearch.gets_sdm(y=y, x=X, w=w_net.to_W(), mprint=False)[0],
    }
).T

GETS-GNS    SARSAR
GETS-SDM    SARSAR
dtype: object

Both of the top-down search strategies prefer the SARSAR (spatial combo) model which includes a spatially-lagged dependent variable and spatially-lagged error term (reflecting the fact that there is still significant autocorrelation in the residual of the spatial Durbin model).

30.5 Extensions

The spreg package also includes a wide variety of additional models we do not explore here, including spatial Seemingly Unrelated Regression (SUR), spatial Panel models (Elhorst, 2003, 2014), and diagnostics for Probit and non-linear models.

30.5.1 Regimes

In all of the examples above, we assume the relationship between \(X\) and \(y\) is constant over space. In some applications, however, there may be reason to believe that coefficients vary over the study region, i.e. there are different “treatment regimes” (Imai & van Dyk, 2004). Rather than estimate a separate regression for every data point a-la GWR, one way to handle spatially-varying relationships in a spatial econometric framework is through the use of spatial regimes which separate the study region into discrete portions (defined by a different \(W\) matrix), then fit separate models for each portion (Durlauf & Johnson, 1995). This is a bit like the geographic version of piecewise regression, where geographic zones are used to define the breakpoints rather than ranges of the outcome variable. Each of the models above optionally takes a regime argument that defines the grouping for each observation. Alternatively, spreg also offers a function for determining optimal regimes endogenously by combining SKATER constrained clustering with spatial regression, following Anselin & Amaral (2023).

30.5.2 Quasi Experiments and Causal Designs

A prominent critique of applied spatial econometric modeling is given by Gibbons & Overman (2012), who argue that the suite of models above

does not provide a toolbox that gives satisfactory solutions to these problems for applied researchers interested in causality and the economic processes at work. It is in this sense that we call into question the approach of the burgeoning body of applied economic research that proceeds with mechanical application of spatial econometric techniques to causal questions of “spillovers” and other forms of spatial interaction, or that estimates spatial lag models as a quick fix for endogeneity issues, or that blindly applies spatial econometric models by default without serious consideration of whether this is necessary given the research question in hand.

To overcome these issues, they argue that applied work should adopt what they term ‘experimentalist’ design⁴, where some form of exogenous variation is induced to provide insight into a causal mechanism at play. Spatial econometric models that incorporate quasi experimental designs are an active area of research, with early pioneering work by Dubé et al. (2014), Delgado & Florax (2015), Bardaka et al. (2018), and Diao et al. (2017), who blend different spatial specification above with a difference-in-differences framework. While this work represents an important step forward from traditional designs that ignore spatial dependence altogether, the complexity involved with DID designs that include spatial terms is still under-appreciated, and many important issues in urban policy analysis do not have a satisfactory specification using such an approach (Knaap, 2024).

Alternatively, structural models⁵ should proceed by considering the SLX model as the point of depatrture, or use strong theory to guide the choice of a particular specification as described above (Halleck Vega & Elhorst, 2015). Although there remains debate in urban economics, I believe there is a good justification for assuming a potential autoregressive process in the specific context of hedonic housing price models.

Abadie, A., Athey, S., Imbens, G. W., & Wooldridge, J. M. (2020). Sampling‐Based versus Design‐Based Uncertainty in Regression Analysis. Econometrica, 88(1), 265–296. https://doi.org/10.3982/ECTA12675

Anselin, L. (2007). Spatial econometrics in RSUE: Retrospect and prospect. Regional Science and Urban Economics, 37(4), 450–456. https://doi.org/10.1016/j.regsciurbeco.2006.11.009

Anselin, L., & Amaral, P. (2023). Endogenous spatial regimes. Journal of Geographical Systems. https://doi.org/10.1007/s10109-023-00411-2

Anselin, L., Bera, A., Florax, R. J. G. M., & Yoon, M. (1996). Simple diagnostic tests for spatial dependence. Regional Science and Urban Economics, 26, 77–104. https://doi.org/10.1016/0166-0462(95)02111-6

Anselin, L., & Kelejian, H. H. (1997). Testing for spatial error autocorrelation in the presence of endogenous regressors. International Regional Science Review, 20(1), 153–182. https://doi.org/10.1177/016001769702000109

Anselin, L., & Rey, S. J. (2014). Modern spatial econometrics in practice: A guide to GeoDa, GeoDaSpace and PySAL. GeoDa Press.

Anselin, L., Serenini, R., & Amaral, P. (2024). Spatial Econometric Model Specification Search: Another Look. https://doi.org/10.13140/RG.2.2.10650.86721

Bardaka, E., Delgado, M. S., & Florax, R. J. G. M. (2018). Causal identification of transit-induced gentrification and spatial spillover effects: The case of the Denver light rail. Journal of Transport Geography, 71, 15–31. https://doi.org/10.1016/j.jtrangeo.2018.06.025

Borusyak, K., Hull, P., & Jaravel, X. (2024). Design-based identification with formula instruments: A review. The Econometrics Journal, utae003. https://doi.org/10.1093/ectj/utae003

Delgado, M. S., & Florax, R. J. G. M. (2015). Difference-in-differences techniques for spatial data: Local autocorrelation and spatial interaction. Economics Letters, 137, 123–126. https://doi.org/10.1016/j.econlet.2015.10.035

Diao, M., Leonard, D., & Sing, T. F. (2017). Spatial-difference-in-differences models for impact of new mass rapid transit line on private housing values. Regional Science and Urban Economics, 67, 64–77. https://doi.org/10.1016/j.regsciurbeco.2017.08.006

Dubé, J., Legros, D., Thériault, M., & Des Rosiers, F. (2014). A spatial Difference-in-Differences estimator to evaluate the effect of change in public mass transit systems on house prices. Transportation Research Part B: Methodological, 64, 24–40. https://doi.org/10.1016/j.trb.2014.02.007

Durlauf, S. N., & Johnson, P. A. (1995). Multiple regimes and cross-country growth behavior. Journal of Applied Econometrics, 10, 365–384. https://doi.org/10.1002/jae.3950100404

Elhorst, J. P. (2003). Specification and Estimation of Spatial Panel Data Models. International Regional Science Review, 26(3), 244–268. https://doi.org/10.1177/0160017603253791

Elhorst, J. P. (2014). Spatial Panel Models. In Handbook of Regional Science (pp. 1637–1652). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23430-9_86

Galster, G. C. (2019). Making our neighborhoods, making our selves. University of Chicago Press.

Gibbons, S., & Overman, H. G. (2012). Mostly Pointless Spatial Econometrics? Journal of Regional Science, 52(2), 172–191. https://doi.org/10.1111/j.1467-9787.2012.00760.x

Halleck Vega, S., & Elhorst, J. P. (2015). THE SLX MODEL. Journal of Regional Science, 55(3), 339–363. https://doi.org/10.1111/jors.12188

Imai, K., & van Dyk, D. A. (2004). Causal Inference With General Treatment Regimes. Journal of the American Statistical Association, 99(467), 854–866. https://doi.org/10.1198/016214504000001187

Juhl, S. (2021). The Wald Test of Common Factors in Spatial Model Specification Search Strategies. Political Analysis, 29(2), 193–211. https://doi.org/10.1017/pan.2020.23

Knaap, E. (2024, February 24). Measuring the Purple Line’s Effect on Residential Rents: Spatial Econometrics and Causal Inference in Urban Policy Analysis (SSRN Scholarly Paper No. 4737294). https://doi.org/10.2139/ssrn.4737294

Koley, M., & and Bera, A. K. (2024). To use, or not to use the spatial Durbin model? – that is the question. Spatial Economic Analysis, 19(1), 30–56. https://doi.org/10.1080/17421772.2023.2256810

Koley, M., & Bera, A. (2022, March 4). Testing for Spatial Dependence in a Spatial Autoregressive (SAR) Model in the Presence of Endogenous Regressors (SSRN Scholarly Paper No. 4050145). https://papers.ssrn.com/abstract=4050145

LeSage, J. P. (2014). What Regional Scientists Need to Know About Spatial Econometrics. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2420725

LeSage, J. P., & Pace, R. K. (2014). Interpreting Spatial Econometric Models. In Handbook of Regional Science (pp. 1535–1552). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23430-9_91

LeSage, J., & Pace, R. K. (2009). Introduction to Spatial Econometrics. CRC Press.

Manski, C. F., & McFadden, D. (Eds.). (1981). Structural analysis of discrete data with econometric applications. MIT Press.

Rey, S. J., & Anselin, L. (2010). PySAL: A Python Library of Spatial Analytical Methods. In Handbook of Applied Spatial Analysis (Vol. 37, pp. 175–193). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-03647-7_11

Rey, S. J., Anselin, L., Amaral, P., Arribas-Bel, D., Cortes, R. X., Gaboardi, J. D., Kang, W., Knaap, E., Li, Z., Lumnitz, S., Oshan, T. M., Shao, H., & Wolf, L. J. (2021). The PySAL Ecosystem: Philosophy and Implementation. Geographical Analysis, gean.12276. https://doi.org/10.1111/gean.12276

Rubin, D. B. (2005). Causal Inference Using Potential Outcomes. Journal of the American Statistical Association, 100(469), 322–331. https://doi.org/10.1198/016214504000001880

Wilkinson, G. N., & Rogers, C. E. (1973). Symbolic Description of Factorial Models for Analysis of Variance. Journal of the Royal Statistical Society. Series C (Applied Statistics), 22(3), 392–399. https://doi.org/10.2307/2346786

Witte, A. D., Sumka, H. J., & Erekson, H. (1979). An Estimate of a Structural Hedonic Price Model of the Housing Market: An Application of Rosen’s Theory of Implicit Markets. Econometrica, 47(5), 1151. https://doi.org/10.2307/1911956

Despite the preferences of LeSage & Pace (2009), Anselin et al. (2024) use the reduced form of the unrestricted SDM to show that “there is no solid theoretical basis for preferring one specification over the other, only an empirical one.”↩︎
Using different arguments in the spreg classes allows one to fit all of the models described in previous sections, and the Wilkinson formula need not be changed for these specifications. It is also possible to fit increasingly complex models, e.g. where different explanatory variables are defined by different spatial scopes (\(W_1X_1, W_2 X_2\), etc.) (Galster, 2019), but these kinds of specifications need to be created manually e.g. by creating \(WX\) variables on the dataframe using the Graph.lag(x) convention. In this case, a user would want to specify these terms directly in Wilkinson formulas, and avoid using the slx_vars argument in spreg which applies the same Graph specification to each \(X\) variable.↩︎
For a deep dive on computing specification tests, see Anselin et al. (1996) and the spreg documentation at https://pysal.org/spreg/notebooks/9_specification_tests.html ↩︎
Today this is more commonly called the “potential outcomes” framework (Rubin, 2005).↩︎
Today I think econometricians prefer to call this “‘design-based’ inference” rather than “structural estimation” (Abadie et al., 2020; Borusyak et al., 2024; Manski & McFadden, 1981; Witte et al., 1979).↩︎

30.1 Data Preparation

30.2 ESDA

30.2.1 Global Stats

30.2.2 Local Stats

30.3 Models

30.3.1 Aspatial (OLS)

30.3.1.1 Using statsmodels

30.3.1.2 Using spreg

30.3.2 Spatial Lag-of-X (SLX)

30.3.3 Spatial Error (SEM)

30.3.4 Spatial Lag (SAR)

30.3.5 SLX-Error (SLXE / Durbin-Error)

30.3.6 Spatial Durbin (SDM)

30.3.7 Comparison

30.4 Specification Search

30.4.1 Specific to General

30.4.2 General to Specific

30.5 Extensions

30.5.1 Regimes

30.5.2 Quasi Experiments and Causal Designs

30.3.1.1 Using `statsmodels`

30.3.1.2 Using `spreg`