Cross-Sectional Model Comparison: bayespecon vs spreg

Columbus example

This notebook compares equivalent cross-sectional models between bayespecon and spreg on a real example dataset from the PySAL/spreg ecosystem.

The dependent variable is HOVAL, with INC and CRIME used as explanatory variables.

Model mapping used:

  • SLX (bayespecon) vs spreg.OLS(slx_lags=1)

  • SAR (bayespecon) vs spreg.GM_Lag

  • SEM (bayespecon) vs spreg.GM_Error

  • SDM (bayespecon) vs spreg.GM_Lag(slx_lags=1, w_lags=2)

  • SDEM (bayespecon) vs spreg.GM_Error(slx_lags=1)

Dataset: Columbus neighborhood data from libpysal.examples.

import geopandas as gpd
import libpysal
import numpy as np
import pandas as pd
import spreg
from IPython.display import display
from spreg.sputils import _sp_effects, spmultiplier

from bayespecon import SAR, SDEM, SDM, SEM, SLX
columbus = gpd.read_file(libpysal.examples.get_path("columbus.shp"))
columbus.plot("HOVAL", scheme="quantiles")
<Axes: >
../_images/2ccceea91c4ddf67d5952e0d098c8966a823a3298922a44173c79638bea160cf.png
w = libpysal.graph.Graph.build_contiguity(columbus).transform("r")
y = columbus.HOVAL.values
X = columbus[["INC", "CRIME"]]

spreg_results = {
    "SLX": spreg.OLS(y, X, w=w, slx_lags=1, name_y="HOVAL", name_x=["INC", "CRIME"]),
    "SAR": spreg.ML_Lag(y, X, w=w, name_y="HOVAL", name_x=["INC", "CRIME"]),
    "SEM": spreg.ML_Error(y, X, w=w, name_y="HOVAL", name_x=["INC", "CRIME"]),
    "SDM": spreg.ML_Lag(
        y, X, w=w, slx_lags=1, w_lags=2, name_y="HOVAL", name_x=["INC", "CRIME"]
    ),
    "SDEM": spreg.GM_Error(
        y, X, w=w, slx_lags=1, name_y="HOVAL", name_x=["INC", "CRIME"]
    ),
}

spec = "HOVAL ~  1 + INC + CRIME"
sample_kwargs = dict(draws=1000, tune=1000, chains=4, random_seed=42, progressbar=False)
bayes_models = {
    "SLX": SLX(spec, data=columbus, W=w),
    "SAR": SAR(spec, data=columbus, W=w),
    "SEM": SEM(spec, data=columbus, W=w),
    "SDM": SDM(spec, data=columbus, W=w),
    "SDEM": SDEM(spec, data=columbus, W=w),
}
bayes_idata = {name: model.fit(**sample_kwargs) for name, model in bayes_models.items()}
ML_Lag
ML_Error
ML_Lag
GM_Error
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/spreg/ml_error.py:184: RuntimeWarning: Method 'bounded' does not support relative tolerance in x; defaulting to absolute tolerance.
  res = minimize_scalar(
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 2 jobs)
NUTS: [beta, sigma2]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 16 seconds.
bayes_models["SAR"].summary()
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
rho 0.216 0.169 -0.109 0.517 0.003 0.003 3550.0 2530.0 1.0
sigma 14.997 1.509 12.341 17.907 0.026 0.020 3502.0 3543.0 1.0
sigma2 227.184 46.626 150.051 317.653 0.802 0.716 3502.0 3543.0 1.0
Intercept 38.030 14.114 11.366 64.094 0.225 0.167 3935.0 3973.0 1.0
INC 0.554 0.510 -0.351 1.565 0.008 0.006 3964.0 3770.0 1.0
CRIME -0.453 0.178 -0.793 -0.125 0.003 0.002 3769.0 4058.0 1.0

Sampler adequacy for \(\rho\)

Before trusting the SAR posterior summary above, verify that the spatial parameter \(\rho\) has been sampled efficiently. Wolf, Anselin & Arribas-Bel (2018) [Wolf et al., 2018] show that \(\rho\) tends to mix slowly, so a chain that converges in mean can still under-cover the tails by 10–12 %.

from bayespecon import spatial_mcmc_diagnostic

spatial_mcmc_diagnostic(bayes_models["SAR"], emit_warnings=False).to_frame()
ess_bulk ess_tail r_hat mcse_mean yield_pct hpdi_drift_pct adequate
parameter
rho 3550.256335 2530.415191 1.000928 0.002821 88.756408 5.049753 False
print(spreg_results["SAR"].summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
------------------------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :       HOVAL                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           4
S.D. dependent var  :     18.4661                Degrees of Freedom    :          45
Pseudo R-squared    :      0.3826
Spatial Pseudo R-squared:  0.3271
Log likelihood      :   -200.4394
Sigma-square ML     :    206.3220                Akaike info criterion :     408.879
S.E of regression   :     14.3639                Schwarz criterion     :     416.446

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        37.26813        13.94565         2.67238         0.00753
                 INC         0.56316         0.51316         1.09742         0.27246
               CRIME        -0.45102         0.17770        -2.53815         0.01114
             W_HOVAL         0.23063         0.15496         1.48836         0.13666
------------------------------------------------------------------------------------

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
                 INC         0.5632          0.1688          0.7320
               CRIME        -0.4510         -0.1352         -0.5862
================================ END OF REPORT =====================================
def extract_spreg_params(model):
    names = list(getattr(model, "name_x", []))
    betas = np.asarray(model.betas).flatten()
    if len(betas) == len(names) + 1 and type(model).__name__ == "GM_Lag":
        names = names + ["rho"]
    if len(names) != len(betas):
        names = [f"param_{i}" for i in range(len(betas))]
    return pd.Series(betas, index=names, dtype=float)


def extract_bayespecon_params(model_name, model, idata):
    out = {}
    beta_mean = idata.posterior["beta"].mean(("chain", "draw")).values
    beta_names = (
        model._beta_names()
        if model_name in {"SLX", "SDM", "SDEM"}
        else list(model._feature_names)
    )
    for name, val in zip(beta_names, beta_mean):
        out[name] = float(val)
    if model_name in {"SAR", "SDM"}:
        out["rho"] = float(idata.posterior["rho"].mean())
    if model_name in {"SEM", "SDEM"}:
        out["lambda"] = float(idata.posterior["lam"].mean())
    harmonized = {}
    for k, v in out.items():
        key = k.replace("W*", "W_")
        if key.lower() == "intercept":
            key = "CONSTANT"
        harmonized[key] = v
    return pd.Series(harmonized, dtype=float)


def bayes_effects_df(model):
    eff = model.spatial_effects()
    # spatial_effects() returns a DataFrame indexed by feature names
    return pd.DataFrame(
        {
            "feature": eff.index.tolist(),
            "direct_bayespecon": eff["direct"].values,
            "indirect_bayespecon": eff["indirect"].values,
            "total_bayespecon": eff["total"].values,
        }
    )


def spreg_effects_df(model_name, sp_model):
    if model_name == "SEM":
        # SEM has no spatial multiplier on the systematic component,
        # so direct = coefficient, indirect = 0, total = coefficient.
        # bayespecon excludes the intercept from effects, so we
        # only compare non-constant covariates here.
        sp_params = extract_spreg_params(sp_model)
        features = [f for f in ["INC", "CRIME"] if f in sp_params.index]
        return pd.DataFrame(
            {
                "feature": features,
                "direct_spreg": [float(sp_params[f]) for f in features],
                "indirect_spreg": [0.0 for _ in features],
                "total_spreg": [float(sp_params[f]) for f in features],
            }
        )

    output = sp_model.output.copy()
    output["regime"] = "global"

    def classify_var(var_name):
        if var_name == "CONSTANT":
            return "o"
        if var_name in {"lambda", "W_HOVAL"}:
            return "rho"
        if var_name.startswith("W_"):
            return "wx"
        return "x"

    output["var_type"] = output["var_names"].map(classify_var)
    sp_model.output = output

    vars_x = output.query("var_type == 'x'")
    rho = (
        float(np.squeeze(getattr(sp_model, "rho", 0.0)))
        if hasattr(sp_model, "rho")
        else 0.0
    )
    multipliers = spmultiplier(w, rho)
    slx_lags = getattr(sp_model, "slx_lags", 0) or (
        1 if model_name in {"SLX", "SDEM"} else 0
    )
    slx_vars = getattr(sp_model, "slx_vars", None)
    if slx_vars is None:
        slx_vars = "All"

    total_spreg, direct_spreg, indirect_spreg = _sp_effects(
        sp_model,
        vars_x,
        multipliers,
        slx_lags=slx_lags,
        slx_vars=slx_vars,
    )

    return pd.DataFrame(
        {
            "feature": vars_x["var_names"].tolist(),
            "direct_spreg": np.asarray(direct_spreg).flatten(),
            "indirect_spreg": np.asarray(indirect_spreg).flatten(),
            "total_spreg": np.asarray(total_spreg).flatten(),
        }
    )


coef_rows = []
effect_tables = {}

for model_name in ["SLX", "SAR", "SEM", "SDM", "SDEM"]:
    sp = extract_spreg_params(spreg_results[model_name])
    by = extract_bayespecon_params(
        model_name, bayes_models[model_name], bayes_idata[model_name]
    )

    common = sorted(set(sp.index).intersection(set(by.index)))
    for param in common:
        coef_rows.append(
            {
                "model": model_name,
                "parameter": param,
                "bayespecon_posterior_mean": by[param],
                "spreg_estimate": sp[param],
                "difference": by[param] - sp[param],
            }
        )

    beff = bayes_effects_df(bayes_models[model_name]).copy()
    seff = spreg_effects_df(model_name, spreg_results[model_name]).copy()
    merged = beff.merge(seff, on="feature", how="inner")
    if not merged.empty:
        merged["direct_difference"] = (
            merged["direct_bayespecon"] - merged["direct_spreg"]
        )
        merged["indirect_difference"] = (
            merged["indirect_bayespecon"] - merged["indirect_spreg"]
        )
        merged["total_difference"] = merged["total_bayespecon"] - merged["total_spreg"]
        merged.insert(0, "model", model_name)
    effect_tables[model_name] = merged

comparison = (
    pd.DataFrame(coef_rows).sort_values(["model", "parameter"]).reset_index(drop=True)
)
spatial_effects_comparison = pd.concat(
    [tbl for tbl in effect_tables.values() if not tbl.empty],
    ignore_index=True,
)

Coefficient comparison

display(comparison)
model parameter bayespecon_posterior_mean spreg_estimate difference
0 SAR CONSTANT 38.029925 37.268135 0.761791
1 SAR CRIME -0.452607 -0.451020 -0.001588
2 SAR INC 0.553765 0.563156 -0.009391
3 SDEM CONSTANT 31.767462 29.460538 2.306923
4 SDEM CRIME -0.583475 -0.575430 -0.008044
5 SDEM INC 0.823601 0.820920 0.002682
6 SDEM W_CRIME 0.250069 0.300693 -0.050624
7 SDEM W_INC 0.468414 0.478086 -0.009672
8 SDEM lambda 0.435167 0.358330 0.076837
9 SDM CONSTANT 23.536444 17.045361 6.491083
10 SDM CRIME -0.604761 -0.608430 0.003669
11 SDM INC 0.759630 0.803686 -0.044056
12 SDM W_CRIME 0.396576 0.475451 -0.078876
13 SDM W_INC -0.072949 0.042213 -0.115162
14 SEM CONSTANT 47.545280 48.008252 -0.462971
15 SEM CRIME -0.556544 -0.559458 0.002914
16 SEM INC 0.744724 0.711532 0.033192
17 SEM lambda 0.422006 0.390499 0.031508
18 SLX CONSTANT 31.174816 27.727885 3.446932
19 SLX CRIME -0.586536 -0.590675 0.004139
20 SLX INC 0.712946 0.742442 -0.029496
21 SLX W_CRIME 0.342662 0.390156 -0.047494
22 SLX W_INC 0.385933 0.486179 -0.100246

Direct/Indirect/Total effects comparison

display(spatial_effects_comparison)
model feature direct_bayespecon indirect_bayespecon total_bayespecon direct_spreg indirect_spreg total_spreg direct_difference indirect_difference total_difference
0 SLX INC 0.712946 0.385933 1.098879 0.742442 0.486179 1.228621 -0.029496 -0.100246 -0.129742
1 SLX CRIME -0.586536 0.342662 -0.243874 -0.590675 0.390156 -0.200519 0.004139 -0.047494 -0.043355
2 SAR INC 0.565918 0.171577 0.737495 0.563156 0.168814 0.731970 0.002762 0.002763 0.005525
3 SAR CRIME -0.463210 -0.147184 -0.610394 -0.451020 -0.135200 -0.586219 -0.012190 -0.011985 -0.024175
4 SEM INC 0.744724 0.000000 0.744724 0.711532 0.000000 0.711532 0.033192 0.000000 0.033192
5 SEM CRIME -0.556544 0.000000 -0.556544 -0.559458 0.000000 -0.559458 0.002914 0.000000 0.002914
6 SDM INC 0.787065 0.314866 1.101930 0.803686 0.523510 1.327196 -0.016621 -0.208645 -0.225266
7 SDM CRIME -0.591440 0.257530 -0.333910 -0.608430 0.399790 -0.208640 0.016990 -0.142260 -0.125270
8 SDEM INC 0.823601 0.468414 1.292015 0.820920 0.478086 1.299005 0.002682 -0.009672 -0.006990
9 SDEM CRIME -0.583475 0.250069 -0.333406 -0.575430 0.300693 -0.274737 -0.008044 -0.050624 -0.058669

bayespecon uses Bayesian inference while spreg reports frequentist point estimates, so exact equality is not expected. The coefficient and effects tables are intended to verify directional and numerical agreement under matched model formulas with HOVAL as the dependent variable.

For effects, this notebook now uses spreg’s own spatial-impact utilities: spreg.sputils.spmultiplier and spreg.sputils._sp_effects. SEM is the one exception, since it has no spatial multiplier on the systematic component and therefore uses direct = coefficient, indirect = 0, total = coefficient.

Note on intercept exclusion: bayespecon excludes the intercept from spatial effects for all model types, since the intercept has no meaningful spatial impact interpretation. This is consistent with spreg’s _sp_effects, which also excludes the constant (CONSTANT) from its impact calculations.

Example: Synthetic Data on a 30x30 Grid

This example demonstrates how to generate synthetic spatial data on a 30x30 regular grid, construct a spatial weights matrix, and fit a cross-sectional spatial regression model using the bayespecon package.

from bayespecon import SAR, dgp

# Set random seed for reproducibility
np.random.seed(42)

# True parameters
beta_true = np.array([1.0, -1.0])
rho_true = 0.5
sigma_true = 1.0

gdf = dgp.simulate_sar(
    n=40,
    beta=beta_true,
    rho=rho_true,
    sigma=sigma_true,
    create_gdf=True,
    contiguity="queen",
)

w = libpysal.graph.Graph.build_contiguity(gdf, rook=False).transform("r")

# Fit SAR model
model = SAR("y ~ X_1", W=w, data=gdf)
res = model.fit()

model.summary()
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/rich/live.py:260: UserWarning: install "ipywidgets" 
for Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')

Sampling 4 chains for 1000 tune and 2000 draw iterations, 4 x 3,000 draws total took 2s (7225 draws/s)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
rho 0.490 0.030 0.433 0.546 0.000 0.000 8409.0 6102.0 1.0
sigma 1.077 0.019 1.039 1.111 0.000 0.000 7994.0 7473.0 1.0
sigma2 1.159 0.041 1.080 1.234 0.000 0.000 7994.0 7473.0 1.0
Intercept 1.004 0.064 0.884 1.125 0.001 0.001 8377.0 6549.0 1.0
X_1 -0.987 0.027 -1.039 -0.935 0.000 0.000 8005.0 8085.0 1.0
model.spatial_effects()
direct direct_ci_lower direct_ci_upper direct_pvalue indirect indirect_ci_lower indirect_ci_upper indirect_pvalue total total_ci_lower total_ci_upper total_pvalue
variable
X_1 -1.029497 -1.087445 -0.971668 0.0 -0.912236 -1.154061 -0.710385 0.0 -1.941734 -2.216333 -1.707385 0.0
sim_spreg = spreg.GM_Lag(
    gdf["y"],
    gdf["X_1"],
    w=w,
)
GM_Lag
print(sim_spreg.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
------------------------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :           y                Number of Observations:        1600
Mean dependent var  :      1.9266                Number of Variables   :           3
S.D. dependent var  :      1.5490                Degrees of Freedom    :        1597
Pseudo R-squared    :      0.5152
Spatial Pseudo R-squared:  0.4441

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         1.07676         0.10942         9.84083         0.00000
                 X_1        -0.99025         0.02809       -35.25234         0.00000
                 W_y         0.45223         0.05498         8.22523         0.00000
------------------------------------------------------------------------------------
Instrumented: W_y
Instruments: W_X_1

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                              DF         VALUE           PROB
Anselin-Kelejian Test             1          0.590           0.4426

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
                 X_1        -0.9902         -0.8175         -1.8078
================================ END OF REPORT =====================================