JGSY2024 Replication with bayespecon¶

This notebook replicates the core panel-SDM analysis in Elhorst, J. P., Tziolas, I., Tan, C., & Milionis, P. (2024) using the replication files from spatial-panels, then compares the original outputs with bayespecon and provides a detailed side-by-side comparison to the published legacy outputs (Table 1, columns [1] and [2] from JGSY2024_MainResults.m).

Replication targets in this notebook:

Spec A: 6-nearest-neighbor row-normalized weights
Spec B: exponential-distance row-normalized weights with fixed decay exp(-distance/100)

Important scope notes:

JGSY2024 original estimation uses legacy routines (sar_panelE_FE) with classical asymptotics.
This notebook uses Bayesian estimation (bayespecon.SDMPanelFE) with posterior means for comparison.
We mirror two-way FE behavior via model=3 (unit and time demeaning), which is equivalent to including unit FE and time dummies in balanced panels.
Parameterized-weight robustness specifications from parameterized_W are not estimated here because those routines are not currently part of bayespecon.

import arviz as az
import numpy as np
import pandas as pd
from IPython.display import display
from libpysal.graph import Graph

from bayespecon import SDMPanelFE

az.style.use("arviz-white")
pd.options.display.float_format = lambda x: f"{x:,.6f}"

# Paths and constants from JGSY2024
static_data_dir = "../_static/jgsy2024"

N = 266
T = 17
NT = N * T

# Coerce mixed-type spreadsheets to numeric and drop label rows/cols.
A_df = pd.read_excel(f"{static_data_dir}/Data.xlsx", header=None)
A_num = (
    A_df.apply(pd.to_numeric, errors="coerce")
    .dropna(axis=0, how="all")
    .dropna(axis=1, how="all")
)
A = A_num.to_numpy(dtype=float)

D_df = pd.read_excel(f"{static_data_dir}/distance.xlsx", header=None)
D_num = (
    D_df.apply(pd.to_numeric, errors="coerce")
    .dropna(axis=0, how="all")
    .dropna(axis=1, how="all")
)
D = D_num.to_numpy(dtype=float)

print(f"Loaded A shape: {A.shape}")
print(f"Loaded distance shape: {D.shape}")

Loaded A shape: (5054, 10)
Loaded distance shape: (266, 266)

# Reproduce data construction from JGSY2024_MainResults.m
# legacy indices are 1-based; Python is 0-based.
A_work = A.copy()

y = np.zeros(NT)
ylag = np.zeros(NT)
for t in range(T):
    t1 = t * N
    t2 = (t + 1) * N
    y[t1:t2] = A_work[t1 + 2 * N : t2 + 2 * N, 3] - A_work[t1 + N : t2 + N, 3]
    ylag[t1:t2] = A_work[t1 + N : t2 + N, 3] - A_work[t1:t2, 3]

lnpopg = A_work[:, 7].copy()
popg = np.exp(lnpopg) - 1.0
popg[popg <= 0] = 0.0001
A_work[:, 7] = np.log(popg)

x_core = np.column_stack(
    [
        A_work[2 * N : 2 * N + NT, 4],
        A_work[2 * N : 2 * N + NT, 7],
        A_work[N : N + NT, 3],
        ylag,
    ]
)

X_core = pd.DataFrame(x_core, columns=["inv", "pop", "initial_gdp", "lagged_growth"])

print("Correlation matrix of core regressors (matches script intent):")
display(X_core.corr())

Correlation matrix of core regressors (matches script intent):

	inv	pop	initial_gdp	lagged_growth
inv	1.000000	0.085062	-0.060977	0.137647
pop	0.085062	1.000000	0.300365	-0.083651
initial_gdp	-0.060977	0.300365	1.000000	-0.164301
lagged_growth	0.137647	-0.083651	-0.164301	1.000000

def graph_from_dense(W_np: np.ndarray) -> Graph:
    rows, cols = np.where(W_np > 0)
    vals = W_np[rows, cols].astype(float)
    return Graph.from_arrays(rows, cols, vals)


def build_knn6_weights(distance_matrix: np.ndarray, m: int = 6) -> np.ndarray:
    Dm = distance_matrix.copy().astype(float)
    np.fill_diagonal(Dm, 0.0)
    order = np.argsort(Dm, axis=1)
    kth = np.take_along_axis(Dm, order[:, [m]], axis=1).ravel()

    W = np.zeros_like(Dm, dtype=float)
    for i in range(Dm.shape[0]):
        mask = (Dm[i] > 0) & (Dm[i] <= kth[i])
        W[i, mask] = 1.0

    row_sum = W.sum(axis=1, keepdims=True)
    row_sum[row_sum == 0] = 1.0
    return W / row_sum


def build_ed_weights(distance_matrix: np.ndarray) -> np.ndarray:
    DD0 = distance_matrix.astype(float) / 100.0
    W = np.exp(-DD0)
    np.fill_diagonal(W, 0.0)
    row_sum = W.sum(axis=1, keepdims=True)
    row_sum[row_sum == 0] = 1.0
    return W / row_sum


W_knn6 = build_knn6_weights(D, m=6)
W_ed = build_ed_weights(D)

print("W_knn6 row sums:", np.min(W_knn6.sum(axis=1)), np.max(W_knn6.sum(axis=1)))
print("W_ed row sums:", np.min(W_ed.sum(axis=1)), np.max(W_ed.sum(axis=1)))

W_knn6 row sums: 0.9999999999999999 1.0
W_ed row sums: 0.999999999999998 1.0000000000000016

def fit_sdm_fe(y_vec, X_df, W_np, draws=600, tune=600, chains=1, seed=42):
    model = SDMPanelFE(
        y=y_vec,
        X=X_df,
        W=graph_from_dense(W_np),
        N=N,
        T=T,
        model=3,
        logdet_method="eigenvalue",
    )

    idata = model.fit(
        draws=draws,
        tune=tune,
        chains=chains,
        cores=1,
        target_accept=0.9,
        random_seed=seed,
        progressbar=False,
        idata_kwargs={"log_likelihood": True},
    )
    return model, idata


model_knn6, idata_knn6 = fit_sdm_fe(y, X_core, W_knn6)
model_ed, idata_ed = fit_sdm_fe(y, X_core, W_ed)

print("Fit complete for both specifications.")

Fit complete for both specifications.

Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [rho, beta, sigma]
Sampling 1 chain for 600 tune and 600 draw iterations (600 + 600 draws total) took 2 seconds.
Only one chain was sampled, this makes it impossible to run some convergence checks
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [rho, beta, sigma]
Sampling 1 chain for 600 tune and 600 draw iterations (600 + 600 draws total) took 2 seconds.
Only one chain was sampled, this makes it impossible to run some convergence checks

# Original JGSY2024 Table 1 values (from readme / script output)
orig_coef = {
    "knn6": {
        "inv": 0.002422,
        "pop": -0.008074,
        "initial_gdp": -0.093832,
        "lagged_growth": 0.013879,
        "W*inv": 0.008933,
        "W*pop": -0.002317,
        "W*initial_gdp": 0.065751,
        "W*lagged_growth": 0.191021,
        "rho": 0.534381,
    },
    "ed": {
        "inv": 0.001616,
        "pop": -0.007904,
        "initial_gdp": -0.088731,
        "lagged_growth": 0.011519,
        "W*inv": 0.009081,
        "W*pop": -0.004958,
        "W*initial_gdp": 0.070415,
        "W*lagged_growth": 0.194065,
        "rho": 0.650187,
    },
}

orig_effects = {
    "knn6": pd.DataFrame(
        {
            "feature": ["inv", "pop", "initial_gdp", "lagged_growth"],
            "direct": [0.0035, -0.0088, -0.0921, 0.0351],
            "indirect": [0.0212, -0.0135, 0.0322, 0.4029],
            "total": [0.0247, -0.0223, -0.0599, 0.4380],
        }
    ),
    "ed": pd.DataFrame(
        {
            "feature": ["inv", "pop", "initial_gdp", "lagged_growth"],
            "direct": [0.0026, -0.0089, -0.0876, 0.0291],
            "indirect": [0.0281, -0.0282, 0.0352, 0.5566],
            "total": [0.0307, -0.0371, -0.0523, 0.5857],
        }
    ),
}


def extract_bayes_coef(model, idata):
    beta_names = model._beta_names()
    beta_mean = idata.posterior["beta"].mean(("chain", "draw")).values
    out = dict(zip(beta_names, beta_mean))
    out["rho"] = float(idata.posterior["rho"].mean())
    keep = [
        "inv",
        "pop",
        "initial_gdp",
        "lagged_growth",
        "W*inv",
        "W*pop",
        "W*initial_gdp",
        "W*lagged_growth",
        "rho",
    ]
    return pd.Series({k: out[k] for k in keep}, name="bayespecon")


def compare_coef(spec_name, model, idata):
    b = extract_bayes_coef(model, idata).rename("bayespecon")
    o = pd.Series(orig_coef[spec_name], name="original")
    cmp = pd.concat([o, b], axis=1)
    cmp["diff"] = cmp["bayespecon"] - cmp["original"]
    cmp["abs_diff"] = cmp["diff"].abs()
    cmp["pct_diff_abs"] = (
        100 * cmp["abs_diff"] / cmp["original"].abs().replace(0, np.nan)
    )
    return cmp


def compare_effects(spec_name, model):
    eff = model.spatial_effects()
    be = pd.DataFrame(
        {
            "feature": eff.index.tolist(),
            "direct": eff["direct"].values,
            "indirect": eff["indirect"].values,
            "total": eff["total"].values,
        }
    )
    merged = orig_effects[spec_name].merge(
        be, on="feature", suffixes=("_orig", "_bayes")
    )
    for col in ["direct", "indirect", "total"]:
        merged[f"{col}_diff"] = merged[f"{col}_bayes"] - merged[f"{col}_orig"]
        merged[f"{col}_abs_diff"] = merged[f"{col}_diff"].abs()
    return merged


coef_cmp_knn6 = compare_coef("knn6", model_knn6, idata_knn6)
coef_cmp_ed = compare_coef("ed", model_ed, idata_ed)

eff_cmp_knn6 = compare_effects("knn6", model_knn6)
eff_cmp_ed = compare_effects("ed", model_ed)

print("Coefficient comparison: 6NN")
display(coef_cmp_knn6)
print("Coefficient comparison: ED")
display(coef_cmp_ed)

print("Effects comparison: 6NN")
display(eff_cmp_knn6)
print("Effects comparison: ED")
display(eff_cmp_ed)

Coefficient comparison: 6NN
Coefficient comparison: ED
Effects comparison: 6NN
Effects comparison: ED

	original	bayespecon	diff	abs_diff	pct_diff_abs
inv	0.002422	0.002530	0.000108	0.000108	4.467111
pop	-0.008074	-0.007941	0.000133	0.000133	1.646857
initial_gdp	-0.093832	-0.093792	0.000040	0.000040	0.042355
lagged_growth	0.013879	0.013969	0.000090	0.000090	0.651750
W*inv	0.008933	0.008728	-0.000205	0.000205	2.289809
W*pop	-0.002317	-0.002559	-0.000242	0.000242	10.465575
W*initial_gdp	0.065751	0.065748	-0.000003	0.000003	0.004343
W*lagged_growth	0.191021	0.190794	-0.000227	0.000227	0.118594
rho	0.534381	0.532574	-0.001807	0.001807	0.338140

	original	bayespecon	diff	abs_diff	pct_diff_abs
inv	0.001616	0.001604	-0.000012	0.000012	0.753406
pop	-0.007904	-0.007959	-0.000055	0.000055	0.699889
initial_gdp	-0.088731	-0.088795	-0.000064	0.000064	0.072024
lagged_growth	0.011519	0.011197	-0.000322	0.000322	2.796275
W*inv	0.009081	0.009177	0.000096	0.000096	1.062423
W*pop	-0.004958	-0.004899	0.000059	0.000059	1.197204
W*initial_gdp	0.070415	0.070342	-0.000073	0.000073	0.103109
W*lagged_growth	0.194065	0.194513	0.000448	0.000448	0.230774
rho	0.650187	0.648796	-0.001391	0.001391	0.213997

	feature	direct_orig	indirect_orig	total_orig	direct_bayes	indirect_bayes	total_bayes	direct_diff	direct_abs_diff	indirect_diff	indirect_abs_diff	total_diff	total_abs_diff
0	inv	0.003500	0.021200	0.024700	0.003609	0.020486	0.024094	0.000109	0.000109	-0.000714	0.000714	-0.000606	0.000606
1	pop	-0.008800	-0.013500	-0.022300	-0.008668	-0.013802	-0.022469	0.000132	0.000132	-0.000302	0.000302	-0.000169	0.000169
2	initial_gdp	-0.092100	0.032200	-0.059900	-0.092105	0.032071	-0.060034	-0.000005	0.000005	-0.000129	0.000129	-0.000134	0.000134
3	lagged_growth	0.035100	0.402900	0.438000	0.035187	0.402998	0.438185	0.000087	0.000087	0.000098	0.000098	0.000185	0.000185

	feature	direct_orig	indirect_orig	total_orig	direct_bayes	indirect_bayes	total_bayes	direct_diff	direct_abs_diff	indirect_diff	indirect_abs_diff	total_diff	total_abs_diff
0	inv	0.002600	0.028100	0.030700	0.002529	0.028237	0.030767	-0.000071	0.000071	0.000137	0.000137	0.000067	0.000067
1	pop	-0.008900	-0.028200	-0.037100	-0.008870	-0.027783	-0.036653	0.000030	0.000030	0.000417	0.000417	0.000447	0.000447
2	initial_gdp	-0.087600	0.035200	-0.052300	-0.087646	0.035061	-0.052585	-0.000046	0.000046	-0.000139	0.000139	-0.000285	0.000285
3	lagged_growth	0.029100	0.556600	0.585700	0.029447	0.556942	0.586389	0.000347	0.000347	0.000342	0.000342	0.000689	0.000689

def summarize_comparison(coef_cmp, eff_cmp, label):
    return pd.Series(
        {
            "spec": label,
            "coef_MAE": coef_cmp["abs_diff"].mean(),
            "coef_median_abs_pct_diff": np.nanmedian(
                coef_cmp["pct_diff_abs"].to_numpy()
            ),
            "direct_effect_MAE": eff_cmp["direct_abs_diff"].mean(),
            "indirect_effect_MAE": eff_cmp["indirect_abs_diff"].mean(),
            "total_effect_MAE": eff_cmp["total_abs_diff"].mean(),
        }
    )


summary_table = pd.DataFrame(
    [
        summarize_comparison(coef_cmp_knn6, eff_cmp_knn6, "Table 1 col[1] 6NN"),
        summarize_comparison(coef_cmp_ed, eff_cmp_ed, "Table 1 col[2] ED"),
    ]
)

display(summary_table)

print("Top absolute coefficient gaps (6NN):")
display(coef_cmp_knn6.sort_values("abs_diff", ascending=False).head(5))

print("Top absolute coefficient gaps (ED):")
display(coef_cmp_ed.sort_values("abs_diff", ascending=False).head(5))

	spec	coef_MAE	coef_median_abs_pct_diff	direct_effect_MAE	indirect_effect_MAE	total_effect_MAE
0	Table 1 col[1] 6NN	0.000317	0.651750	0.000083	0.000311	0.000274
1	Table 1 col[2] ED	0.000280	0.699889	0.000124	0.000259	0.000372

Top absolute coefficient gaps (6NN):
Top absolute coefficient gaps (ED):

	original	bayespecon	diff	abs_diff	pct_diff_abs
rho	0.534381	0.532574	-0.001807	0.001807	0.338140
W*pop	-0.002317	-0.002559	-0.000242	0.000242	10.465575
W*lagged_growth	0.191021	0.190794	-0.000227	0.000227	0.118594
W*inv	0.008933	0.008728	-0.000205	0.000205	2.289809
pop	-0.008074	-0.007941	0.000133	0.000133	1.646857

	original	bayespecon	diff	abs_diff	pct_diff_abs
rho	0.650187	0.648796	-0.001391	0.001391	0.213997
W*lagged_growth	0.194065	0.194513	0.000448	0.000448	0.230774
lagged_growth	0.011519	0.011197	-0.000322	0.000322	2.796275
W*inv	0.009081	0.009177	0.000096	0.000096	1.062423
W*initial_gdp	0.070415	0.070342	-0.000073	0.000073	0.103109

Detailed Comparison Notes¶

How to interpret discrepancies between bayespecon and JGSY2024 legacy output:

Estimation paradigm
- JGSY2024 reports classical point estimates and asymptotic t-statistics.
- This notebook reports posterior means from Bayesian MCMC.
Uncertainty computation
- JGSY2024 uses delta-method and bootstrap variants for effects in some columns.
- Here effects come from bayespecon’s posterior-mean impact formulas.
Weight matrix treatment
- The 6NN and ED matrix constructions are matched to the script logic.
- Minor numeric differences can still arise from floating-point and log-determinant evaluation methods.
Fixed-effects implementation
- JGSY2024 uses explicit time dummies plus FE controls.
- We use two-way demeaning (model=3) which is equivalent in balanced panels but may not be numerically identical in finite precision.
Practical next steps for tighter matching
- Increase draws, tune, and chains.
- Tighten priors around diffuse ML-like settings only when justified.
- Compare posterior medians and HDIs against classical confidence intervals.

This notebook provides a reproducible bridge from the original legacy replication package to a bayespecon workflow while preserving the key specification choices in Table 1 columns [1]-[2].