JGSY2024 Replication with bayespecon

This notebook replicates the core panel-SDM analysis in Elhorst, J. P., Tziolas, I., Tan, C., & Milionis, P. (2024) using the replication files from spatial-panels, then compares the original outputs with bayespecon and provides a detailed side-by-side comparison to the published MATLAB outputs (Table 1, columns [1] and [2] from JGSY2024_MainResults.m).

Replication targets in this notebook:

  • Spec A: 6-nearest-neighbor row-normalized weights

  • Spec B: exponential-distance row-normalized weights with fixed decay exp(-distance/100)

Important scope notes:

  • JGSY2024 original estimation uses MATLAB routines (sar_panelE_FE) with classical asymptotics.

  • This notebook uses Bayesian estimation (bayespecon.SDMPanelFE) with posterior means for comparison.

  • We mirror two-way FE behavior via model=3 (unit and time demeaning), which is equivalent to including unit FE and time dummies in balanced panels.

  • Parameterized-weight robustness specifications from parameterized_W are not estimated here because those routines are not currently part of bayespecon.

import arviz as az
import numpy as np
import pandas as pd
from IPython.display import display
from libpysal.graph import Graph

from bayespecon import SDMPanelFE

az.style.use("arviz-white")
pd.options.display.float_format = lambda x: f"{x:,.6f}"
# Paths and constants from JGSY2024
static_data_dir = "../_static/jgsy2024"

N = 266
T = 17
NT = N * T

# Coerce mixed-type spreadsheets to numeric and drop label rows/cols.
A_df = pd.read_excel(f"{static_data_dir}/Data.xlsx", header=None)
A_num = (
    A_df.apply(pd.to_numeric, errors="coerce")
    .dropna(axis=0, how="all")
    .dropna(axis=1, how="all")
)
A = A_num.to_numpy(dtype=float)

D_df = pd.read_excel(f"{static_data_dir}/distance.xlsx", header=None)
D_num = (
    D_df.apply(pd.to_numeric, errors="coerce")
    .dropna(axis=0, how="all")
    .dropna(axis=1, how="all")
)
D = D_num.to_numpy(dtype=float)

print(f"Loaded A shape: {A.shape}")
print(f"Loaded distance shape: {D.shape}")
Loaded A shape: (5054, 10)
Loaded distance shape: (266, 266)
# Reproduce data construction from JGSY2024_MainResults.m
# MATLAB indices are 1-based; Python is 0-based.
A_work = A.copy()

y = np.zeros(NT)
ylag = np.zeros(NT)
for t in range(T):
    t1 = t * N
    t2 = (t + 1) * N
    y[t1:t2] = A_work[t1 + 2 * N : t2 + 2 * N, 3] - A_work[t1 + N : t2 + N, 3]
    ylag[t1:t2] = A_work[t1 + N : t2 + N, 3] - A_work[t1:t2, 3]

lnpopg = A_work[:, 7].copy()
popg = np.exp(lnpopg) - 1.0
popg[popg <= 0] = 0.0001
A_work[:, 7] = np.log(popg)

x_core = np.column_stack(
    [
        A_work[2 * N : 2 * N + NT, 4],
        A_work[2 * N : 2 * N + NT, 7],
        A_work[N : N + NT, 3],
        ylag,
    ]
)

X_core = pd.DataFrame(x_core, columns=["inv", "pop", "initial_gdp", "lagged_growth"])

print("Correlation matrix of core regressors (matches script intent):")
display(X_core.corr())
Correlation matrix of core regressors (matches script intent):
inv pop initial_gdp lagged_growth
inv 1.000000 0.085062 -0.060977 0.137647
pop 0.085062 1.000000 0.300365 -0.083651
initial_gdp -0.060977 0.300365 1.000000 -0.164301
lagged_growth 0.137647 -0.083651 -0.164301 1.000000
def graph_from_dense(W_np: np.ndarray) -> Graph:
    rows, cols = np.where(W_np > 0)
    vals = W_np[rows, cols].astype(float)
    return Graph.from_arrays(rows, cols, vals)


def build_knn6_weights(distance_matrix: np.ndarray, m: int = 6) -> np.ndarray:
    Dm = distance_matrix.copy().astype(float)
    np.fill_diagonal(Dm, 0.0)
    order = np.argsort(Dm, axis=1)
    kth = np.take_along_axis(Dm, order[:, [m]], axis=1).ravel()

    W = np.zeros_like(Dm, dtype=float)
    for i in range(Dm.shape[0]):
        mask = (Dm[i] > 0) & (Dm[i] <= kth[i])
        W[i, mask] = 1.0

    row_sum = W.sum(axis=1, keepdims=True)
    row_sum[row_sum == 0] = 1.0
    return W / row_sum


def build_ed_weights(distance_matrix: np.ndarray) -> np.ndarray:
    DD0 = distance_matrix.astype(float) / 100.0
    W = np.exp(-DD0)
    np.fill_diagonal(W, 0.0)
    row_sum = W.sum(axis=1, keepdims=True)
    row_sum[row_sum == 0] = 1.0
    return W / row_sum


W_knn6 = build_knn6_weights(D, m=6)
W_ed = build_ed_weights(D)

print("W_knn6 row sums:", np.min(W_knn6.sum(axis=1)), np.max(W_knn6.sum(axis=1)))
print("W_ed row sums:", np.min(W_ed.sum(axis=1)), np.max(W_ed.sum(axis=1)))
W_knn6 row sums: 0.9999999999999999 1.0
W_ed row sums: 0.999999999999998 1.0000000000000016
def fit_sdm_fe(y_vec, X_df, W_np, draws=600, tune=600, chains=1, seed=42):
    model = SDMPanelFE(
        y=y_vec,
        X=X_df,
        W=graph_from_dense(W_np),
        N=N,
        T=T,
        model=3,
        logdet_method="eigenvalue",
    )

    idata = model.fit(
        draws=draws,
        tune=tune,
        chains=chains,
        cores=1,
        target_accept=0.9,
        random_seed=seed,
        progressbar=False,
        idata_kwargs={"log_likelihood": True},
    )
    return model, idata


model_knn6, idata_knn6 = fit_sdm_fe(y, X_core, W_knn6)
model_ed, idata_ed = fit_sdm_fe(y, X_core, W_ed)

print("Fit complete for both specifications.")
Fit complete for both specifications.
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [rho, beta, sigma]
Sampling 1 chain for 600 tune and 600 draw iterations (600 + 600 draws total) took 2 seconds.
Only one chain was sampled, this makes it impossible to run some convergence checks
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [rho, beta, sigma]
Sampling 1 chain for 600 tune and 600 draw iterations (600 + 600 draws total) took 3 seconds.
Only one chain was sampled, this makes it impossible to run some convergence checks
# Original JGSY2024 Table 1 values (from readme / script output)
orig_coef = {
    "knn6": {
        "inv": 0.002422,
        "pop": -0.008074,
        "initial_gdp": -0.093832,
        "lagged_growth": 0.013879,
        "W*inv": 0.008933,
        "W*pop": -0.002317,
        "W*initial_gdp": 0.065751,
        "W*lagged_growth": 0.191021,
        "rho": 0.534381,
    },
    "ed": {
        "inv": 0.001616,
        "pop": -0.007904,
        "initial_gdp": -0.088731,
        "lagged_growth": 0.011519,
        "W*inv": 0.009081,
        "W*pop": -0.004958,
        "W*initial_gdp": 0.070415,
        "W*lagged_growth": 0.194065,
        "rho": 0.650187,
    },
}

orig_effects = {
    "knn6": pd.DataFrame(
        {
            "feature": ["inv", "pop", "initial_gdp", "lagged_growth"],
            "direct": [0.0035, -0.0088, -0.0921, 0.0351],
            "indirect": [0.0212, -0.0135, 0.0322, 0.4029],
            "total": [0.0247, -0.0223, -0.0599, 0.4380],
        }
    ),
    "ed": pd.DataFrame(
        {
            "feature": ["inv", "pop", "initial_gdp", "lagged_growth"],
            "direct": [0.0026, -0.0089, -0.0876, 0.0291],
            "indirect": [0.0281, -0.0282, 0.0352, 0.5566],
            "total": [0.0307, -0.0371, -0.0523, 0.5857],
        }
    ),
}


def extract_bayes_coef(model, idata):
    beta_names = model._beta_names()
    beta_mean = idata.posterior["beta"].mean(("chain", "draw")).values
    out = dict(zip(beta_names, beta_mean))
    out["rho"] = float(idata.posterior["rho"].mean())
    keep = [
        "inv",
        "pop",
        "initial_gdp",
        "lagged_growth",
        "W*inv",
        "W*pop",
        "W*initial_gdp",
        "W*lagged_growth",
        "rho",
    ]
    return pd.Series({k: out[k] for k in keep}, name="bayespecon")


def compare_coef(spec_name, model, idata):
    b = extract_bayes_coef(model, idata).rename("bayespecon")
    o = pd.Series(orig_coef[spec_name], name="original")
    cmp = pd.concat([o, b], axis=1)
    cmp["diff"] = cmp["bayespecon"] - cmp["original"]
    cmp["abs_diff"] = cmp["diff"].abs()
    cmp["pct_diff_abs"] = (
        100 * cmp["abs_diff"] / cmp["original"].abs().replace(0, np.nan)
    )
    return cmp


def compare_effects(spec_name, model):
    eff = model.spatial_effects()
    be = pd.DataFrame(
        {
            "feature": eff.index.tolist(),
            "direct": eff["direct"].values,
            "indirect": eff["indirect"].values,
            "total": eff["total"].values,
        }
    )
    merged = orig_effects[spec_name].merge(
        be, on="feature", suffixes=("_orig", "_bayes")
    )
    for col in ["direct", "indirect", "total"]:
        merged[f"{col}_diff"] = merged[f"{col}_bayes"] - merged[f"{col}_orig"]
        merged[f"{col}_abs_diff"] = merged[f"{col}_diff"].abs()
    return merged


coef_cmp_knn6 = compare_coef("knn6", model_knn6, idata_knn6)
coef_cmp_ed = compare_coef("ed", model_ed, idata_ed)

eff_cmp_knn6 = compare_effects("knn6", model_knn6)
eff_cmp_ed = compare_effects("ed", model_ed)

print("Coefficient comparison: 6NN")
display(coef_cmp_knn6)
print("Coefficient comparison: ED")
display(coef_cmp_ed)

print("Effects comparison: 6NN")
display(eff_cmp_knn6)
print("Effects comparison: ED")
display(eff_cmp_ed)
Coefficient comparison: 6NN
Coefficient comparison: ED
Effects comparison: 6NN
Effects comparison: ED
original bayespecon diff abs_diff pct_diff_abs
inv 0.002422 0.002609 0.000187 0.000187 7.717411
pop -0.008074 -0.008147 -0.000073 0.000073 0.909721
initial_gdp -0.093832 -0.094011 -0.000179 0.000179 0.190612
lagged_growth 0.013879 0.014838 0.000959 0.000959 6.907949
W*inv 0.008933 0.008564 -0.000369 0.000369 4.126779
W*pop -0.002317 -0.002305 0.000012 0.000012 0.504599
W*initial_gdp 0.065751 0.065966 0.000215 0.000215 0.326912
W*lagged_growth 0.191021 0.191565 0.000544 0.000544 0.284555
rho 0.534381 0.533018 -0.001363 0.001363 0.254987
original bayespecon diff abs_diff pct_diff_abs
inv 0.001616 0.001481 -0.000135 0.000135 8.338828
pop -0.007904 -0.007923 -0.000019 0.000019 0.236887
initial_gdp -0.088731 -0.088649 0.000082 0.000082 0.092323
lagged_growth 0.011519 0.011876 0.000357 0.000357 3.100515
W*inv 0.009081 0.009298 0.000217 0.000217 2.395079
W*pop -0.004958 -0.004900 0.000058 0.000058 1.171032
W*initial_gdp 0.070415 0.070298 -0.000117 0.000117 0.166408
W*lagged_growth 0.194065 0.192101 -0.001964 0.001964 1.012088
rho 0.650187 0.649639 -0.000548 0.000548 0.084295
feature direct_orig indirect_orig total_orig direct_bayes indirect_bayes total_bayes direct_diff direct_abs_diff indirect_diff indirect_abs_diff total_diff total_abs_diff
0 inv 0.003500 0.021200 0.024700 0.003676 0.020255 0.023931 0.000176 0.000176 -0.000945 0.000945 -0.000769 0.000769
1 pop -0.008800 -0.013500 -0.022300 -0.008860 -0.013523 -0.022383 -0.000060 0.000060 -0.000023 0.000023 -0.000083 0.000083
2 initial_gdp -0.092100 0.032200 -0.059900 -0.092313 0.032246 -0.060067 -0.000213 0.000213 0.000046 0.000046 -0.000167 0.000167
3 lagged_growth 0.035100 0.402900 0.438000 0.036207 0.405759 0.441966 0.001107 0.001107 0.002859 0.002859 0.003966 0.003966
feature direct_orig indirect_orig total_orig direct_bayes indirect_bayes total_bayes direct_diff direct_abs_diff indirect_diff indirect_abs_diff total_diff total_abs_diff
0 inv 0.002600 0.028100 0.030700 0.002412 0.028388 0.030800 -0.000188 0.000188 0.000288 0.000288 0.000100 0.000100
1 pop -0.008900 -0.028200 -0.037100 -0.008834 -0.027812 -0.036647 0.000066 0.000066 0.000388 0.000388 0.000453 0.000453
2 initial_gdp -0.087600 0.035200 -0.052300 -0.087498 0.035119 -0.052379 0.000102 0.000102 -0.000081 0.000081 -0.000079 0.000079
3 lagged_growth 0.029100 0.556600 0.585700 0.029995 0.552818 0.582813 0.000895 0.000895 -0.003782 0.003782 -0.002887 0.002887
def summarize_comparison(coef_cmp, eff_cmp, label):
    return pd.Series(
        {
            "spec": label,
            "coef_MAE": coef_cmp["abs_diff"].mean(),
            "coef_median_abs_pct_diff": np.nanmedian(
                coef_cmp["pct_diff_abs"].to_numpy()
            ),
            "direct_effect_MAE": eff_cmp["direct_abs_diff"].mean(),
            "indirect_effect_MAE": eff_cmp["indirect_abs_diff"].mean(),
            "total_effect_MAE": eff_cmp["total_abs_diff"].mean(),
        }
    )


summary_table = pd.DataFrame(
    [
        summarize_comparison(coef_cmp_knn6, eff_cmp_knn6, "Table 1 col[1] 6NN"),
        summarize_comparison(coef_cmp_ed, eff_cmp_ed, "Table 1 col[2] ED"),
    ]
)

display(summary_table)

print("Top absolute coefficient gaps (6NN):")
display(coef_cmp_knn6.sort_values("abs_diff", ascending=False).head(5))

print("Top absolute coefficient gaps (ED):")
display(coef_cmp_ed.sort_values("abs_diff", ascending=False).head(5))
spec coef_MAE coef_median_abs_pct_diff direct_effect_MAE indirect_effect_MAE total_effect_MAE
0 Table 1 col[1] 6NN 0.000433 0.504599 0.000389 0.000968 0.001246
1 Table 1 col[2] ED 0.000389 1.012088 0.000313 0.001135 0.000880
Top absolute coefficient gaps (6NN):
Top absolute coefficient gaps (ED):
original bayespecon diff abs_diff pct_diff_abs
rho 0.534381 0.533018 -0.001363 0.001363 0.254987
lagged_growth 0.013879 0.014838 0.000959 0.000959 6.907949
W*lagged_growth 0.191021 0.191565 0.000544 0.000544 0.284555
W*inv 0.008933 0.008564 -0.000369 0.000369 4.126779
W*initial_gdp 0.065751 0.065966 0.000215 0.000215 0.326912
original bayespecon diff abs_diff pct_diff_abs
W*lagged_growth 0.194065 0.192101 -0.001964 0.001964 1.012088
rho 0.650187 0.649639 -0.000548 0.000548 0.084295
lagged_growth 0.011519 0.011876 0.000357 0.000357 3.100515
W*inv 0.009081 0.009298 0.000217 0.000217 2.395079
inv 0.001616 0.001481 -0.000135 0.000135 8.338828

Detailed Comparison Notes

How to interpret discrepancies between bayespecon and JGSY2024 MATLAB output:

  1. Estimation paradigm

    • JGSY2024 reports classical point estimates and asymptotic t-statistics.

    • This notebook reports posterior means from Bayesian MCMC.

  2. Uncertainty computation

    • JGSY2024 uses delta-method and bootstrap variants for effects in some columns.

    • Here effects come from bayespecon’s posterior-mean impact formulas.

  3. Weight matrix treatment

    • The 6NN and ED matrix constructions are matched to the script logic.

    • Minor numeric differences can still arise from floating-point and log-determinant evaluation methods.

  4. Fixed-effects implementation

    • JGSY2024 uses explicit time dummies plus FE controls.

    • We use two-way demeaning (model=3) which is equivalent in balanced panels but may not be numerically identical in finite precision.

  5. Practical next steps for tighter matching

    • Increase draws, tune, and chains.

    • Tighten priors around diffuse ML-like settings only when justified.

    • Compare posterior medians and HDIs against classical confidence intervals.

This notebook provides a reproducible bridge from the original MATLAB replication package to a bayespecon workflow while preserving the key specification choices in Table 1 columns [1]-[2].