bayespecon.dgp.generate_flow_data

bayespecon.dgp.generate_flow_data(n=None, G=None, rho_d=0.3, rho_o=0.2, rho_w=0.1, beta_d=None, beta_o=None, sigma=1.0, X=None, col_names=None, dist=None, gamma_dist=-0.5, alpha=0.0, seed=None, gdf=None, err_hetero=False, knn_k=4, distribution='lognormal')[source]

Simulate flow data from a SAR flow model.

Generates \(N = n^2\) flow observations. The latent SAR-filtered process is

\[\eta = A^{-1}(X\beta + \varepsilon), \quad A = I_N - \rho_d W_d - \rho_o W_o - \rho_w W_w, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_N)\]

and the observed flows are either

\[y = \exp(\eta) \quad \text{(default, } \texttt{distribution="lognormal"})\]

so that \(y > 0\) and \(\mathbb{E}[y] = \exp(\eta + \sigma^2/2)\), or \(y = \eta\) when distribution="normal" (legacy Gaussian-on-y behaviour).

To recover the SAR parameters with the existing SARFlow / SARFlowSeparable, fit on np.log(y_vec) (which by construction equals eta_vec).

Parameters:
n : int

Number of spatial units. Must match the size of G.

G : libpysal.graph.Graph

Row-standardised spatial graph on n units.

rho_d : float

Destination spatial autoregressive parameter.

rho_o : float

Origin spatial autoregressive parameter.

rho_w : float

Network (origin-destination) spatial autoregressive parameter.

beta_d : array-like, shape (k_d,)

Destination-side regression coefficients.

beta_o : array-like, shape (k_o,)

Origin-side regression coefficients. When k_o != k_d, separate destination and origin attribute matrices are generated or required.

sigma : float, default 1.0

Standard deviation of the error term.

X : np.ndarray, shape (n, k) or (n, k_d + k_o), optional

Regional attribute matrix. If None, draws X_d and X_o separately from N(0, 1). If a single matrix is provided with k_d == k_o, it is used for both destination and origin blocks. If it has k_d + k_o columns, the first k_d are used as destination attributes and the remaining k_o as origin attributes.

col_names : list[str], optional

Names for the k columns of X.

dist : np.ndarray, shape (n, n), optional

Distance / cost matrix. If None (default), one is computed automatically from gdf (or from a synthetic point grid when gdf is also None) and entered as log(1 + d) in the design matrix. Pass an array explicitly to override.

gamma_dist : float, default -0.5

True coefficient on the (log-) distance column in the DGP. Defaults to -0.5 to mimic gravity-model distance decay; set to 0.0 to neutralize the effect.

alpha : float, default 0.0

Intercept term added uniformly to all latent flow cells. Under distribution="lognormal" (default) this multiplies the observed flows by exp(alpha); under distribution="normal" it is an additive shift on y.

seed : int, optional

Random seed for reproducibility.

gdf : geopandas.GeoDataFrame, optional

Geometry source used to derive distance. If None and dist is also None, a synthetic point grid is built via synth_point_geodataframe().

err_hetero : bool, default False

If True, generate heteroskedastic innovations: each flow cell \((i,j)\) has standard deviation \(\sigma \sqrt{1 + \|x_i\|^2 + \|x_j\|^2}\) where \(x_i\), \(x_j\) are the destination and origin attribute vectors for that cell.

knn_k : int, default 4

Number of nearest neighbours used when synthesising a default graph from a synthetic point grid (see _resolve_flow_geometry()).

distribution : {"lognormal", "normal"}, default "lognormal"

Observation-scale family. "lognormal" returns y = exp(eta) (strictly positive flows, the default). "normal" returns y = eta (legacy Gaussian-on-y behaviour). In both cases "eta_vec"/"eta_mat" is also exposed in the return dict.

Returns:

Dictionary with keys:

  • "y_vec" (N,): vectorised flows on the observation scale.

  • "y_mat" (n, n): flow matrix form.

  • "eta_vec" (N,): latent SAR-filtered linear predictor (equals log(y_vec) when distribution="lognormal").

  • "eta_mat" (n, n): eta_vec reshaped.

  • "distribution" str: the value of the distribution arg.

  • "X" (N, p): full O-D design matrix (for model fitting).

  • "X_regional" (n, k_d): destination-side regional attribute matrix.

  • "X_regional_d" (n, k_d): destination-side regional attribute matrix.

  • "X_regional_o" (n, k_o): origin-side regional attribute matrix.

  • "design" FlowDesignMatrix: full design.

  • "W" scipy.sparse.csr_matrix: n×n weight matrix.

  • "G" libpysal.graph.Graph: spatial graph.

  • "rho_d", "rho_o", "rho_w", "sigma": true parameters.

  • "beta_d", "beta_o": true coefficient vectors.

Return type:

dict

Raises:

ValueError – If the A matrix is singular (invalid parameter combination).