11  Space-Time Event Clustering

Code
%load_ext autoreload
%autoreload 2

%load_ext watermark
%watermark -a 'eli knaap & serge rey' -v -d -u -p pointpats,geopandas,libpysal
Author: eli knaap & serge rey

Last updated: 2023-09-28

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

pointpats: 2.3.0
geopandas: 0.13.2
libpysal : 4.7.0

A classic tradition in spatial analysis examines the clustering of events in time and space. A set of “events” in this case is a dataset with both a geographic location (i.e. a point with X and Y coordinates) and a temporal location (i.e. a timestamp or some record of when something occurred). Events that cluster togther in time and space can often be substantively important across a wide variety of social, behavioral, and natural sciences. For example in epidemiology, a cluster of events may help identify a disease outbreak, but understanding clustering in space-time event data is widely applicable in many fields. Further, the data are often already present in the most useful format. Example research questions and geocoded/timestamped data might include:

public health & epidemiology, e.g. to describe outbreaks or help identify inflection points in an epidemic

transportation, e.g. to identify dangerous intersections or high-demand transit locations

housing e.g. to identify transitioning land markets or residential displacement

marketing e.g. to understand which campaigns do better in which locations

criminology e.g. to understand inequality in police enforcement or victimization

earth & climate science e.g. to examine impacts of climate change on catastrophic events

One of the most commonly-used approaches is the Knox statistic, first formulated in epidemiology to study disease outbreaks. The Knox statistic looks for clustering of events in space and time using two distance thresholds that define a set of “spatial neighbors” who are nearby in the geographical sense, and a set of “temporal neighbors” who are nearby in time. These thresholds are commonly called delta (\(\delta\), for distance) and tau (\(\tau\), for time) respectively.

To use a Knox statistic, we adopt the null hypothesis that there is no space-time interaction between events, and we look for non-random groups in the data. That is, if we find there are more events than expected within the two thresholds (i.e. more space-time neighbors than expected), then there is evidence to reject the null in favor of space-time clustering.

Code
import geopandas as gpd
import pandas as pd
from pointpats import Knox, KnoxLocal, plot_density

11.1 Traffic Collisions in San Diego County

To demonstrate the Knox and KnoxLocal functionality in pointpats, we will use the example of traffic collisions in San Diego using an open dataset collected from the CA Highway Patrol

These data contain an extract from https://iswitrs.chp.ca.gov/Reports/jsp/SampleReports.jsp from dates 1/1/2001 - 8/1/2023. More information on the dataset including field codes is available from https://iswitrs.chp.ca.gov/Reports/jsp/samples/RawData_template.pdf

Code
# currently, you can download this file from
# https://www.dropbox.com/scl/fi/ddduihxo5mnhbtkhsp3vb/sd_collisions.parquet?rlkey=cy19kaxu0zalcpewd3u8gzdwr&dl=0

sd_collisions = gpd.read_parquet("sd_collisions.parquet")
Code
sd_collisions.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 183525 entries, 0 to 183524
Data columns (total 80 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   index                    183525 non-null  int64         
 1   CASE_ID                  183525 non-null  int64         
 2   ACCIDENT_YEAR            183525 non-null  int64         
 3   PROC_DATE                183525 non-null  int64         
 4   JURIS                    183525 non-null  object        
 5   COLLISION_DATE           183525 non-null  datetime64[us]
 6   COLLISION_TIME           183525 non-null  int64         
 7   OFFICER_ID               183424 non-null  object        
 8   REPORTING_DISTRICT       11628 non-null   object        
 9   DAY_OF_WEEK              183525 non-null  int64         
 10  CHP_SHIFT                183525 non-null  int64         
 11  POPULATION               183525 non-null  int64         
 12  CNTY_CITY_LOC            183525 non-null  int64         
 13  SPECIAL_COND             183525 non-null  int64         
 14  BEAT_TYPE                183525 non-null  int64         
 15  CHP_BEAT_TYPE            183525 non-null  object        
 16  CITY_DIVISION_LAPD       0 non-null       object        
 17  CHP_BEAT_CLASS           183525 non-null  int64         
 18  BEAT_NUMBER              182139 non-null  object        
 19  PRIMARY_RD               183525 non-null  object        
 20  SECONDARY_RD             183522 non-null  object        
 21  DISTANCE                 183525 non-null  float64       
 22  DIRECTION                169765 non-null  object        
 23  INTERSECTION             183525 non-null  object        
 24  WEATHER_1                183525 non-null  object        
 25  WEATHER_2                183525 non-null  object        
 26  STATE_HWY_IND            183520 non-null  object        
 27  CALTRANS_COUNTY          41699 non-null   object        
 28  CALTRANS_DISTRICT        41699 non-null   float64       
 29  STATE_ROUTE              41699 non-null   float64       
 30  ROUTE_SUFFIX             41699 non-null   object        
 31  POSTMILE_PREFIX          41699 non-null   object        
 32  POSTMILE                 41699 non-null   float64       
 33  LOCATION_TYPE            41699 non-null   object        
 34  RAMP_INTERSECTION        41699 non-null   object        
 35  SIDE_OF_HWY              41698 non-null   object        
 36  TOW_AWAY                 183105 non-null  object        
 37  COLLISION_SEVERITY       183525 non-null  int64         
 38  NUMBER_KILLED            183525 non-null  int64         
 39  NUMBER_INJURED           183525 non-null  int64         
 40  PARTY_COUNT              183525 non-null  object        
 41  PRIMARY_COLL_FACTOR      183525 non-null  object        
 42  PCF_CODE_OF_VIOL         183525 non-null  object        
 43  PCF_VIOL_CATEGORY        183525 non-null  object        
 44  PCF_VIOLATION            175576 non-null  float64       
 45  PCF_VIOL_SUBSECTION      52938 non-null   object        
 46  HIT_AND_RUN              183525 non-null  object        
 47  TYPE_OF_COLLISION        183525 non-null  object        
 48  MVIW                     183525 non-null  object        
 49  PED_ACTION               183525 non-null  object        
 50  ROAD_SURFACE             183525 non-null  object        
 51  ROAD_COND_1              183525 non-null  object        
 52  ROAD_COND_2              183525 non-null  object        
 53  LIGHTING                 183525 non-null  object        
 54  CONTROL_DEVICE           183525 non-null  object        
 55  CHP_ROAD_TYPE            183525 non-null  float64       
 56  PEDESTRIAN_ACCIDENT      2484 non-null    object        
 57  BICYCLE_ACCIDENT         1787 non-null    object        
 58  MOTORCYCLE_ACCIDENT      12177 non-null   object        
 59  TRUCK_ACCIDENT           8740 non-null    object        
 60  NOT_PRIVATE_PROPERTY     183525 non-null  object        
 61  ALCOHOL_INVOLVED         20612 non-null   object        
 62  STWD_VEHTYPE_AT_FAULT    183525 non-null  object        
 63  CHP_VEHTYPE_AT_FAULT     183011 non-null  object        
 64  COUNT_SEVERE_INJ         183525 non-null  int64         
 65  COUNT_VISIBLE_INJ        183525 non-null  int64         
 66  COUNT_COMPLAINT_PAIN     183525 non-null  int64         
 67  COUNT_PED_KILLED         183525 non-null  int64         
 68  COUNT_PED_INJURED        183525 non-null  int64         
 69  COUNT_BICYCLIST_KILLED   183525 non-null  int64         
 70  COUNT_BICYCLIST_INJURED  183525 non-null  int64         
 71  COUNT_MC_KILLED          183525 non-null  int64         
 72  COUNT_MC_INJURED         183525 non-null  object        
 73  PRIMARY_RAMP             183525 non-null  object        
 74  SECONDARY_RAMP           183525 non-null  object        
 75  LATITUDE                 183525 non-null  float64       
 76  LONGITUDE                183525 non-null  float64       
 77  year                     183525 non-null  object        
 78  geometry                 183525 non-null  geometry      
 79  time_in_days             183525 non-null  int64         
dtypes: datetime64[us](1), float64(8), geometry(1), int64(24), object(46)
memory usage: 112.0+ MB
Code
sd_collisions.plot()

Because these points are so dense, a common way to search for “hotspots” is to plot a map using a kernel density estimator (KDE), which we can do for all events, or for subsets of events like pedestrian or bicycle collisions

Code
# note you need statsmodels installed to run this line
ax = plot_density(sd_collisions, bandwidth=2000)

Code
ped_collisions = sd_collisions[sd_collisions["PEDESTRIAN_ACCIDENT"] == "Y"]
bike_collisions = sd_collisions[sd_collisions["BICYCLE_ACCIDENT"] == "Y"]
Code
plot_density(ped_collisions, bandwidth=2000)

Code
plot_density(bike_collisions, bandwidth=2000)

Plotting a heatmap of these points using a kernel-density estimator can give us a sense for how the events cluster in space, but it ignores the time dimension entirely. We can see that the heatmaps for pedestriand and bicycle collisions are a little different, but it is not clear why. It could be because there are more cycle commuters in some regions of the county versus others (so more opportunities for a collision to occur) or because collision risk is higher in some places, or both. Comparisons across kernel density maps for these two classes of events is complicated because we don’t know whether they share the same population-at-risk surface underneath or risks surfaces.

11.2 Global Clustering

The classic Knox statistic is a test for global clustering. It examines whether the pattern of spatio-temporal interaction in the events is random or not.

.The two critical parameters for the Knox statistic are delta and tau which define our neighborhood thresholds in space and time. The delta argument is measured in the units of the geodataframe CRS, and tau can be measured using any time measurement represented by an integer.

Here, we’ll measure time by the number of days elapsed since the start of data collection. Since the collision date is stored as a string, we can convert to a pandas datetime dtype, then take the difference from the minium date, stored in days (so our time is measured in days).

Code
# convert to datetime
sd_collisions.COLLISION_DATE = pd.to_datetime(
    sd_collisions.COLLISION_DATE, format="%Y%m%d"
)

min_date = sd_collisions.COLLISION_DATE.min()

sd_collisions["time_in_days"] = sd_collisions.COLLISION_DATE.apply(
    lambda x: x - min_date
).dt.days

11.2.1 Pedestrian Collisions

Here we set delta to 2000 and tau to 30, meaning our space-time neighborhood is looking for clusters of events that occur within one month and two kilometers of one another.

Code
ped_knox = Knox.from_dataframe(
    ped_collisions, time_col="time_in_days", delta=2000, tau=30
)
Code
ped_knox.p_poisson
5.88418203051333e-15
Code
ped_knox.p_sim
0.01

11.2.2 Bike Collisions

Code
bike_knox = Knox.from_dataframe(
    bike_collisions, time_col="time_in_days", delta=2000, tau=30
)
Code
bike_knox.p_poisson
0.0
Code
bike_knox.p_sim
0.01

In both the pedestrian and bicycle collision data the p-values (for both analytical and permutation-based inference) are significant, demonstrating evidence of space-time clustering in the collisions. That is, some times and places appear to be more dangerous than others. Using a local Knox statistic, we can start investigating where and when these places might be.

11.3 Local Clustering

As with a statistic like Moran’s \(I\), the Knox statistic can be decomposed into a local version that describes, at an observation level, whether the event at a given location and time is a member of a significant space-time cluster. This can be considered a kind of “hotspot” analysis where the locally significant values identify particular pockets of the study region (and time period) where events are clustered.

11.3.1 Pedestrian Collisions

Code
ped_knox_local = KnoxLocal.from_dataframe(
    ped_collisions.set_index("CASE_ID"), time_col="time_in_days", delta=2000, tau=30
)

As a local statistic, the resulting KnoxLocal class no longer has a single \(p\)-value, but a value for each observation. Since permutations is included as a default argument in the KnoxLocal class, p-values from both the analytical and simulation-based inference are available as attributes on the fitted class

Code
ped_knox_local.p_hypergeom
array([1.        , 0.17142085, 1.        , ..., 1.        , 1.        ,
       1.        ])
Code

ped_knox_local.p_sims
array([1.  , 0.19, 1.  , ..., 1.  , 1.  , 1.  ])

Together with the space-time neighbor relationships, we can use these p-values to identify local “hotspots”

Code
ped_knox_local.hotspots()
pvalue focal_time focal neighbor orientation
0 0.05 3760 9263257 9250758 lag
1 0.03 4431 9547792 9546979 lag
2 0.03 4431 9547792 9560633 lead
3 0.01 2399 90533779 8424929 lead
4 0.01 2399 90533779 90604293 lag
... ... ... ... ... ...
478 0.01 702 5950645 5925674 lag
479 0.01 691 5925674 5901518 lag
480 0.01 691 5925674 5970599 lead
481 0.01 691 5925674 5849678 lag
482 0.01 691 5925674 5950645 lead

483 rows × 5 columns

Observations are identified as hotspots if the observed number of space-time neighbors at that location exceeds the expected number. This means that every observation in the dataset is assigned a p-value that determines whether it is a significant locus of space-time interaction. From a graph theoretic perspective, the number of space-time neighbors for a unit is the number of incident edges for that node, with the graph being formed as the intersection of two other graphs: one for temporal neighbors and one for spatial neighbors. p-values for each node are determined as the probability of exceeding the number of incident edges in the space-time graph given the number of incident edges in each of the temporal and spatial graphs for that unit under the assumption of no space-time interaction.

Since we can use either computational/permutation-based inference or an analyical approximation, we can also use the analytical p-values to defined hotspots

Code
ped_knox_local.hotspots(inference="analytic")
pvalue focal_time focal neighbor orientation
0 0.017825 4431 9547792 9546979 lag
1 0.017825 4431 9547792 9560633 lead
2 0.001876 2399 90533779 8424929 lead
3 0.001876 2399 90533779 90604293 lag
4 0.001876 2399 90533779 8411306 lag
... ... ... ... ... ...
517 0.000168 702 5950645 5925674 lag
518 0.000417 691 5925674 5901518 lag
519 0.000417 691 5925674 5970599 lead
520 0.000417 691 5925674 5849678 lag
521 0.000417 691 5925674 5950645 lead

522 rows × 5 columns

And finally we can plot the hotspots to see where they are. By default:

  • any observation that is itself significant shows up as red
  • any observation that is not itself significant, but is a space-time neighbor of a significant observation shows up as yellow
  • any observation that is insignificant (at the defined crit value) shows up as gray
Code
ped_knox_local.plot()

You can also customize the plot if you like

Code
import matplotlib.pyplot as plt
import contextily as ctx

f, ax = plt.subplots(figsize=(9, 9))
ax = ped_knox_local.plot(point_kwargs={"alpha": 0.2}, plot_edges=False, ax=ax)
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron, crs=sd_collisions.crs)
ax.axis("off")

For convenience, there is also an interactive version of the hot spot map that uses the same defaults. In the interactive version, an edge (line) is drawn between members of space-time neighborhoods. This makes it easy to identify the “neighborhood” of significant events

Code
ped_knox_local.explore(plot_edges=True)
Make this Notebook Trusted to load map: File -> Trust Notebook

11.3.2 Bike Collisions

Code
bike_knox_local = KnoxLocal.from_dataframe(
    bike_collisions, time_col="time_in_days", delta=2000, tau=30
)
Code
bike_knox_local.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

Comparing these maps, it looks like with these delta and tau parameters, the maps show similarities but also differences. Places like downtown, La Presa, or Escondido can be collision hotspots for both cyclists and pedestrians. But for cyclists, Coronado and Carlsbad show up as particularly dangerous locations.

Continuing with the analysis, the difference between these two hotspot locations could be because of differences in the local environment, such as poorer cycling infrastructure in Coronado and Carlsbad. But these hotspots could also just reflect that these places have greater numbers of cyclists, and thus, tend to have more frequent collisions.

Note: these data span a long time horizon, and in this simplified demonstration, we haven’t considered the population shift bias. In the case of vehicle collisions shown here, there is reason to believe that the underlying population at risk may have shifted over time, which could introduce a known bias into our inference. One important assumption of the Knox statistic is that population growth is constant over the geographic subareas over time, but San Diego county is a large, sprawling region and it’s possible there is more traffic activity in some parts of the region in later time periods due to uneven urban development. If so, this increase in activity will lead bias out count of observed space-time neighbors upwards. It is important to consider the performance of the Knox statistic (and its local variant) any time the underlying population surface may vary over time.

:::