Geospatial Network Analysis of Homicide Data from the Cook County Medical Examiner

Introduction

Cook County, Illinois reached a 5-year peak of homicides in 2021 at 1079 [@{https://maps.cookcountyil.gov/medexammaps/}]. Although homicide does not capture the full extent of violence in any given region, it is one of the most reliable proxy measures because of relative completeness of reporting and cross-validation from two different sources–those being the law enforcement community and the medical examiner community [@{https://publications.iadb.org/en/publication/11626/how-violence-measured}].

Practitioners wishing to prevent homicides will often model risk with random effects derived from spatial autocorrelation, the phenomenon where related events tend to cluster around the same geographic areas. However, we know that spatial autocorrelation does not capture the full story. Work by Papachristos et al. (citation) demonstrates how social networks can provide more precise estimates of risk in the aftermath of a given shooting by tracing connections between individuals.

It may be useful to hybridize both the geospatial and network frameworks for determining risk. Here we explore data from the Cook County Medical Examiner’s office to inform future modeling work.

The Cook County Medical Examiner’s Office serves a jurisdiction of 5.2 million people. It performs autopsies on any death pertaining to a broad set of categories, including “criminal violence”. Records of its findings can be obtained from the Cook County Open Data Portal.

Broadly speaking, my analysis will calculate common network metrics, then overlay them onto maps to identify patterns that otherwise might not be apparent. The common data model involves connecting zip codes as the nodes and cases of homicide acting as edges directing from the zip code where homicide took place to the zip code of residence for the victim. In network parlance, the out-degree represents the number of homicides taking place in a zip code (which I will call homicide-degree to avoid ambiguity), and the out-degree represents the number of homicide victims that call a zip code home (residence-degree). These connections may represent ties between different zip codes that aren’t apparent from spatial proximity alone. Below, I list metrics of interest and my initial hypotheses:

Node Centrality Measures:
- Degree: Both homicide- and residence-degree will increase with proximity to the Loop (due to higher population density)
- Closeness: Will increase with proximity to highways or transit lines (due to their use for traveling between zip codes)
- Betweenness: Will increase with proximity to highway interchanges or transit transfer points (due to their use as hubs for travel)
Node Cluster Assignment: Will reflect patterns of racial geographic segregation (due to high racial disparities in homicide rates within Chicago)

Methods

Analysis

I carried out analysis in R with the following dependencies:

# Orchestration
library(targets)
# Generic Data Wrangling
library(dplyr)
library(readr)
library(stringr)
library(tidyr)
# Visualization
library(ggplot2)
# Mapping
library(ggmap)
library(leaflet)
library(mapdeck)
library(sf)
# Network Analysis
library(tidygraph)
# User-defined functions
source("R/describing_ccme.R")
source("R/visualizing_ccme.R")
source("R/wrangling_archive.R")
source("R/wrangling_esri.R")

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidygraph_1.2.1 sf_1.0-7        mapdeck_0.3.4   leaflet_2.1.1  
##  [5] ggmap_3.0.0     ggplot2_3.3.6   tidyr_1.2.0     stringr_1.4.0  
##  [9] readr_2.1.2     dplyr_1.0.9     targets_0.12.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3        lattice_0.20-45     class_7.3-20       
##  [4] png_0.1-7           ps_1.7.0            assertthat_0.2.1   
##  [7] digest_0.6.29       utf8_1.2.2          R6_2.5.1           
## [10] plyr_1.8.7          backports_1.4.1     e1071_1.7-9        
## [13] evaluate_0.15       httr_1.4.3          pillar_1.7.0       
## [16] RgoogleMaps_1.4.5.3 rlang_1.0.2         rstudioapi_0.13    
## [19] data.table_1.14.2   callr_3.7.0         jquerylib_0.1.4    
## [22] rmarkdown_2.14      htmlwidgets_1.5.4   igraph_1.3.1       
## [25] munsell_0.5.0       proxy_0.4-26        compiler_4.2.0     
## [28] xfun_0.31           pkgconfig_2.0.3     htmltools_0.5.2    
## [31] tidyselect_1.1.2    tibble_3.1.7        codetools_0.2-18   
## [34] fansi_1.0.3         crayon_1.5.1        tzdb_0.3.0         
## [37] withr_2.5.0         bitops_1.0-7        grid_4.2.0         
## [40] jsonlite_1.8.0      gtable_0.3.0        lifecycle_1.0.1    
## [43] DBI_1.1.2           magrittr_2.0.3      units_0.8-0        
## [46] scales_1.2.0        KernSmooth_2.23-20  cli_3.3.0          
## [49] stringi_1.7.6       sp_1.4-7            bslib_0.3.1        
## [52] ellipsis_0.3.2      generics_0.1.2      vctrs_0.4.1        
## [55] rjson_0.2.21        tools_4.2.0         glue_1.6.2         
## [58] purrr_0.3.4         hms_1.1.1           crosstalk_1.2.0    
## [61] jpeg_0.1-9          processx_3.5.3      fastmap_1.1.0      
## [64] yaml_2.3.5          colorspace_2.0-3    classInt_0.4-3     
## [67] base64url_1.4       knitr_1.39          sass_0.4.1

I orchestrated my analytic pipeline using the {targets} package. Here is a visualization of the full pipeline:

tar_visnetwork()

I obtained the following raw data sources:

Dataset	Source	URL
CCME Case Archive	Cook County Medical Examiner	https://datacatalog.cookcountyil.gov/Public-Safety/Medical-Examiner-Case-Archive/cjeq-bs86
Cook County Boundary	Cook County GIS	https://hub-cookcountyil.opendata.arcgis.com/datasets/ea127f9e96b74677892722069c984198_1/explore
US Zip Code Areas	Esri	https://www.arcgis.com/home/item.html?id=8d2012a2016e484dafaac0451f9aea24

Wrangling CCME Cases

This section describes how I transformed a table of CCME cases into a graph with zip codes acting as nodes and cases serving as edges. First, I read in CCME Case Archive using my function read_ccme_archive_raw(). It is so verbose because I had to provide a full type specification due to readr::read_csv() incorrectly auto-inferring the types:

read_ccme_archive_raw

## function (archive_raw_csv) 
## {
##     parse_spec <- cols(`Case Number` = col_character(), `Date of Incident` = col_character(), 
##         `Date of Death` = col_character(), Age = col_double(), 
##         Gender = col_character(), Race = col_character(), Latino = col_logical(), 
##         `Manner of Death` = col_character(), `Primary Cause` = col_character(), 
##         `Primary Cause Line A` = col_character(), `Primary Cause Line B` = col_character(), 
##         `Primary Cause Line C` = col_character(), `Secondary Cause` = col_character(), 
##         `Gun Related` = col_logical(), `Opioid Related` = col_logical(), 
##         `Cold Related` = col_logical(), `Heat Related` = col_logical(), 
##         `Commissioner District` = col_double(), `Incident Address` = col_character(), 
##         `Incident City` = col_character(), `Incident Zip Code` = col_character(), 
##         longitude = col_double(), latitude = col_double(), location = col_character(), 
##         `Residence City` = col_character(), `Residence Zip` = col_character(), 
##         OBJECTID = col_double(), `Chicago Ward` = col_double(), 
##         `Chicago Community Area` = col_character(), `COVID Related` = col_logical())
##     df1 <- read_csv(archive_raw_csv, col_types = parse_spec)
##     return(df1)
## }

The raw case archive looks like:

glimpse(tar_read(ccme_archive_raw))

## Rows: 66,950
## Columns: 30
## $ `Case Number`            <chr> "ME2015-02039", "ME2015-04003", "ME2015-00307…
## $ `Date of Incident`       <chr> NA, NA, "01/21/2015 01:17:00 AM", "12/20/2014…
## $ `Date of Death`          <chr> "05/18/2015 11:30:00 AM", NA, "01/21/2015 02:…
## $ Age                      <dbl> NA, NA, 82, 43, 72, 52, 42, 55, 52, 82, 52, 0…
## $ Gender                   <chr> "Male", NA, "Male", "Female", "Male", "Male",…
## $ Race                     <chr> NA, NA, "White", "Black", "White", "White", "…
## $ Latino                   <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALS…
## $ `Manner of Death`        <chr> "NATURAL", NA, "NATURAL", "ACCIDENT", "NATURA…
## $ `Primary Cause`          <chr> "UNDETERMINED NATURAL CAUSES", "NONHUMAN REMA…
## $ `Primary Cause Line A`   <chr> NA, NA, NA, NA, NA, "FALL", NA, "ASSAULT", "S…
## $ `Primary Cause Line B`   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `Primary Cause Line C`   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `Secondary Cause`        <chr> NA, NA, NA, NA, NA, "CHRONIC ALCOHOLISM", NA,…
## $ `Gun Related`            <lgl> FALSE, NA, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ `Opioid Related`         <lgl> FALSE, NA, FALSE, TRUE, FALSE, FALSE, FALSE, …
## $ `Cold Related`           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ `Heat Related`           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ `Commissioner District`  <dbl> NA, NA, NA, NA, NA, NA, NA, 16, NA, NA, NA, 1…
## $ `Incident Address`       <chr> NA, NA, NA, NA, NA, NA, NA, "8516 47TH ST  #4…
## $ `Incident City`          <chr> NA, NA, NA, NA, NA, NA, NA, "LYONS", "CHICAGO…
## $ `Incident Zip Code`      <chr> NA, NA, NA, NA, NA, NA, NA, "60534", "60609",…
## $ longitude                <dbl> NA, NA, NA, NA, NA, NA, NA, -87.83478, NA, NA…
## $ latitude                 <dbl> NA, NA, NA, NA, NA, NA, NA, 41.80606, NA, NA,…
## $ location                 <chr> NA, NA, NA, NA, NA, NA, NA, "(41.80606362, -8…
## $ `Residence City`         <chr> NA, NA, "Elk Grove Village", "Bolingbrook", "…
## $ `Residence Zip`          <chr> NA, NA, "60007", "60440", "60616", "60632", "…
## $ OBJECTID                 <dbl> 60, 121, 248, 311, 381, 467, 573, 872, 918, 9…
## $ `Chicago Ward`           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2…
## $ `Chicago Community Area` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "…
## $ `COVID Related`          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

Next, I converted the raw cases into a table of edges with my function wrangle_ccme_homicide_edges(). This function uses the {janitor} package’s clean_names() function to remove spaces from variable names, filters out cases without zip codes listed or without valid zip codes, filters just for cases in the HOMICIDE category, and converts any 9-digit zip codes into corresponding 5-digit codes.

wrangle_ccme_homicide_edges

## function (archive_raw_df) 
## {
##     df1 <- separate(separate(filter(filter(filter(filter(filter(filter(filter(filter(filter(filter(filter(janitor::clean_names(archive_raw_df), 
##         !is.na(incident_zip_code)), !is.na(residence_zip)), str_detect(incident_zip_code, 
##         "[:digit:]{5}")), !str_detect(incident_zip_code, "[:digit:]{6,}")), 
##         str_detect(residence_zip, "[:digit:]{5}")), !str_detect(residence_zip, 
##         "[:digit:]{6,}")), incident_zip_code != "00000"), incident_zip_code != 
##         "99999"), residence_zip != "00000"), residence_zip != 
##         "99999"), manner_of_death == "HOMICIDE"), incident_zip_code, 
##         into = "from", sep = "-", remove = FALSE, extra = "drop"), 
##         residence_zip, into = "to", sep = "-", remove = FALSE, 
##         extra = "drop")
##     return(df1)
## }

Next, I converted the edge table into a graph object with my function wrangle_ccme_homicide_graph(). My function also added measures of centrality and a Louvain grouping classification using functions from the {tidygraph} package. Notice that degree and betweenness are determined using the directed version of the graph, while closeness and grouping are determined with the undirected version. The harmonic variation of the centrality_closeness function allows for graphs that have unconnected components. Finally, my function calculated new variables including percentile ranks for centrality measures as well as a standardized difference between each zip code’s residence_degree and homicide_degree.

wrangle_ccme_homicide_graph

## function (ccme_homicide_edges_df) 
## {
##     df1 <- unmorph(mutate(morph(mutate(activate(as_tbl_graph(ccme_homicide_edges_df, 
##         directed = TRUE), nodes), homicide_degree = centrality_degree(mode = "out"), 
##         residence_degree = centrality_degree(mode = "in"), std_residence_homicide_diff = (residence_degree - 
##             homicide_degree)/(residence_degree + homicide_degree), 
##         homicide_degree_perc_rank = percent_rank(homicide_degree), 
##         residence_degree_perc_rank = percent_rank(residence_degree), 
##         betweenness = centrality_betweenness(), betweenness_perc_rank = percent_rank(betweenness)), 
##         to_undirected), closeness = centrality_closeness_harmonic(), 
##         neighborhood = group_louvain()))
##     return(df1)
## }

tar_read(ccme_homicide_graph)

## # A tbl_graph: 408 nodes and 6117 edges
## #
## # A directed multigraph with 18 components
## #
## # Node Data: 408 × 10 (active)
##   name  homicide_degree residence_degree std_residence_h… homicide_degree…
##   <chr>           <dbl>            <dbl>            <dbl>            <dbl>
## 1 60534               5                4          -0.111             0.762
## 2 60402              21               43           0.344             0.894
## 3 60411              75               77           0.0132            0.946
## 4 60120               4                7           0.273             0.722
## 5 60617             235              204          -0.0706            0.983
## 6 60123               3                4           0.143             0.695
## # … with 402 more rows, and 5 more variables: residence_degree_perc_rank <dbl>,
## #   betweenness <dbl>, betweenness_perc_rank <dbl>, closeness <dbl>,
## #   neighborhood <int>
## #
## # Edge Data: 6,117 × 32
##    from    to case_number date_of_incident date_of_death   age gender race 
##   <int> <int> <chr>       <chr>            <chr>         <dbl> <chr>  <chr>
## 1     1     1 ME2016-033… 07/08/2016 09:3… 07/08/2016 1…    55 Male   Black
## 2     2     2 ME2015-041… 11/16/2012 06:3… 09/26/2015 1…    37 Male   Black
## 3     3    69 ME2016-041… 08/23/2016 07:4… 08/23/2016 0…    41 Male   Black
## # … with 6,114 more rows, and 24 more variables: latino <lgl>,
## #   manner_of_death <chr>, primary_cause <chr>, primary_cause_line_a <chr>,
## #   primary_cause_line_b <chr>, primary_cause_line_c <chr>,
## #   secondary_cause <chr>, gun_related <lgl>, opioid_related <lgl>,
## #   cold_related <lgl>, heat_related <lgl>, commissioner_district <dbl>,
## #   incident_address <chr>, incident_city <chr>, incident_zip_code <chr>,
## #   longitude <dbl>, latitude <dbl>, location <chr>, residence_city <chr>,
## #   residence_zip <chr>, objectid <dbl>, chicago_ward <dbl>,
## #   chicago_community_area <chr>, covid_related <lgl>

Wrangling Geographic Features

In order to map our network, I needed data on the geographic boundaries of each zip code, which I obtained from the Esri dataset. However, I didn’t want to store this entire dataset in persistent memory since it encompasses every zip code in the entire country. Instead I filtered just to Cook County zip codes with my function wrangle_cook_county_zip_code_boundaries(), which first filters to zip codes in Illinois, then filters to zip code boundaries that geographically intersect with Cook County. The first filtering step is not strictly necessary, but reduces the computational cost of the second step.

wrangle_cook_county_zip_code_boundaries

## function (esri_zip_code_boundaries_gdb, cook_county_boundary_sf) 
## {
##     sf1 <- filter(st_read(esri_zip_code_boundaries_gdb), STATE == 
##         "IL")
##     sf2 <- filter(sf1, st_intersects(sf1, cook_county_boundary_sf, 
##         sparse = FALSE)[, 1])
##     return(sf2)
## }

Wrangling nodes and edges for visualization:

The full homicide graph includes zip code nodes that do not intersect with Cook County. I have included these nodes and their edges when calculating graph metrics, but for simplified visualization, we will only include those intersecting with Cook County. The {mapdeck} visualization library requires graph data to be structured differently than {tidygraph}. We’ll generate the corresponding edge table with my function wrangle_cook_county_homicide_vis_edges(), which collapses all edges with the same from-to pair (weighting to their overall sum), joins edges to the coordinates for the geographic centers of each from/to zip code, and adds the from/to nodes’ graph metrics to the edge table:

wrangle_ccme_homicide_vis_edges

## function (ccme_homicide_zip_code_boundaries_sf, ccme_homicide_edges_df, 
##     ccme_homicide_nodes_df) 
## {
##     zip_cent_coords <- st_as_sf(st_drop_geometry(mutate(select(ccme_homicide_zip_code_boundaries_sf, 
##         ZIP_CODE), centroid = st_centroid(Shape))))
##     zip_cent_coords <- st_drop_geometry(bind_cols(zip_cent_coords, 
##         st_coordinates(zip_cent_coords)))
##     vis_edges <- rename(inner_join(rename(inner_join(rename(inner_join(rename(inner_join(summarize(group_by(ccme_homicide_edges_df, 
##         from, to), weight = n()), zip_cent_coords, by = c(from = "ZIP_CODE")), 
##         from_lon = X, from_lat = Y), zip_cent_coords, by = c(to = "ZIP_CODE")), 
##         to_lon = X, to_lat = Y), ccme_homicide_nodes_df, by = c(from = "name")), 
##         from_residence_degree = residence_degree, from_homicide_degree = homicide_degree, 
##         from_residence_homicide_diff = std_residence_homicide_diff, 
##         from_betweenness = betweenness, from_closeness = closeness, 
##         from_neighborhood = neighborhood), ccme_homicide_nodes_df, 
##         by = c(to = "name")), to_residence_degree = residence_degree, 
##         to_homicide_degree = homicide_degree, to_residence_homicide_diff = std_residence_homicide_diff, 
##         to_betweenness = betweenness, to_closeness = closeness, 
##         to_neighborhood = neighborhood)
##     return(vis_edges)
## }

Mapping

We’ll generate maps of centrality measures using my function map_ccme_centrality(), where argument centrality_var is the string name for the metric of interest. Sidenote: If wanting to reproduce these maps in your own environment, you will need to follow {mapdeck} instructions for setting up an API token with Mapbox.

map_ccme_centrality

## function (centrality_var, cook_county_homicide_vis_nodes_df, 
##     cook_county_homicide_vis_edges_df) 
## {
##     m <- add_arc(add_polygon(mapdeck(style = "mapbox://styles/mapbox/dark-v10", 
##         pitch = 45), data = cook_county_homicide_vis_nodes_df, 
##         fill_colour = centrality_var, fill_opacity = 175, stroke_colour = "#FFFFFFFF", 
##         stroke_width = 100, legend = list(fill_colour = TRUE, 
##             stroke_colour = FALSE)), data = cook_county_homicide_vis_edges_df, 
##         origin = c("from_lon", "from_lat"), destination = c("to_lon", 
##             "to_lat"), stroke_from = paste0("from_", centrality_var), 
##         stroke_from_opacity = 175, stroke_to = paste0("to_", 
##             centrality_var), stroke_to_opacity = 175, stroke_width = "weight")
##     return(m)
## }

We’ll generate a map of algorithm-identified node groups with my function map_ccme_neighborhoods(), which pre-filters for the top 5 largest groups for simplified interpretation:

map_ccme_neighborhoods

## function (cook_county_homicide_vis_nodes_df) 
## {
##     neighborhoods <- filter(cook_county_homicide_vis_nodes_df, 
##         neighborhood %in% c(1:7))
##     m <- add_polygon(mapdeck(style = "mapbox://styles/mapbox/dark-v10", 
##         location = c(-87.7495, 41.816544), zoom = 5), data = neighborhoods, 
##         fill_colour = "neighborhood", fill_opacity = 175, stroke_colour = "#FFFFFFFF", 
##         stroke_width = 100, legend = list(fill_colour = TRUE, 
##             stroke_colour = FALSE), update_view = FALSE, tooltip = "neighborhood")
##     return(m)
## }

Results

Homicide Summary Statistics

When summarizing the cases of homicide in the CCME archive, I found that the majority involve 20-30-something, Black males.

tar_read(ccme_homicide_edges) |> 
        select(age, gender, race, latino) |> 
        gtsummary::tbl_summary()

Characteristic	N = 6,117¹
age	27 (21, 36)
Unknown	1
gender
Female	708 (12%)
Male	5,409 (88%)
race
Am. Indian	3 (<0.1%)
Asian	25 (0.4%)
Black	4,765 (78%)
Other	47 (0.8%)
White	1,271 (21%)
Unknown	6
latino	970 (16%)
¹ Median (IQR); n (%)

Degree

Homicide-Degree

Here is the distribution of homicide-degree of zip codes:

tar_load(ccme_homicide_nodes)
tar_load(ccme_homicide_edges)
tar_load(ccme_homicide_vis_nodes)
tar_load(cook_county_boundary)

ccme_homicide_nodes |> 
    ggplot(aes(x = homicide_degree)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here is a map of the homicide network with coloring by homicide-degree. To zoom the map, use your scroll wheel. To pan the map, click and drag. To rotate the map, click and drag while holding the ctrl key:

The top 10 zip codes with the highest homicide-degree were:

top_10_homicide_degree <-
    ccme_homicide_nodes |> 
    arrange(desc(homicide_degree)) |> 
    head(10)

top_10_homicide_degree

## # A tibble: 10 × 12
##    PO_NAME STATE name  homicide_degree residence_degree std_residence_homicide_…
##    <chr>   <chr> <chr>           <dbl>            <dbl>                    <dbl>
##  1 Chicago IL    60644             351              264                 -0.141  
##  2 Chicago IL    60628             335              317                 -0.0276 
##  3 Chicago IL    60624             332              249                 -0.143  
##  4 Chicago IL    60620             297              309                  0.0198 
##  5 Chicago IL    60623             290              266                 -0.0432 
##  6 Chicago IL    60619             280              245                 -0.0667 
##  7 Chicago IL    60651             239              238                 -0.00210
##  8 Chicago IL    60617             235              204                 -0.0706 
##  9 Chicago IL    60621             229              162                 -0.171  
## 10 Chicago IL    60636             222              211                 -0.0254 
## # … with 6 more variables: homicide_degree_perc_rank <dbl>,
## #   residence_degree_perc_rank <dbl>, betweenness <dbl>,
## #   betweenness_perc_rank <dbl>, closeness <dbl>, neighborhood <int>

Here they are on the map. Hover over the zip code to see its ID:

Residence-Degree

The top 10 zip codes with the highest residence-degree were:

top_10_residence_degree <-
    ccme_homicide_nodes |> 
    arrange(desc(residence_degree)) |> 
    head(10)

top_10_residence_degree

## # A tibble: 10 × 12
##    PO_NAME STATE name  homicide_degree residence_degree std_residence_homicide_…
##    <chr>   <chr> <chr>           <dbl>            <dbl>                    <dbl>
##  1 Chicago IL    60628             335              317                 -0.0276 
##  2 Chicago IL    60620             297              309                  0.0198 
##  3 Chicago IL    60623             290              266                 -0.0432 
##  4 Chicago IL    60644             351              264                 -0.141  
##  5 Chicago IL    60624             332              249                 -0.143  
##  6 Chicago IL    60619             280              245                 -0.0667 
##  7 Chicago IL    60651             239              238                 -0.00210
##  8 Chicago IL    60636             222              211                 -0.0254 
##  9 Chicago IL    60617             235              204                 -0.0706 
## 10 Chicago IL    60629             183              189                  0.0161 
## # … with 6 more variables: homicide_degree_perc_rank <dbl>,
## #   residence_degree_perc_rank <dbl>, betweenness <dbl>,
## #   betweenness_perc_rank <dbl>, closeness <dbl>, neighborhood <int>

Here they are on the map:

Correlation between Homicide- and Residence-Degree

Looking at these tables and maps, we intuit that homicide-degree and residence-degree have a strong, positive correlation, which can be formally tested:

correlation::cor_test(
    ccme_homicide_nodes, 
    "homicide_degree", "residence_degree"
) |> 
plot()

However, outliers to this correlation might provide useful intelligence. To help identify discrepancies, I calculated a standardized difference, which ranges from -1 to 1. If the standardized difference is negative, it means homicide_degree is relatively greater; if it is positive, it means residence_degree is relatively greater; if it is zero, it means the degrees are equal. I also calculated percentile ranks for homicide- and residence-degrees.

High Homicide-Degree, Relatively Low Residence-Degree

Iteratively adjusting thresholds, I identified the following zip codes that have both a high homicide-degree and relatively low residence-degree.

top_homicide_discrepancies <-
    ccme_homicide_nodes |> 
    filter(homicide_degree_perc_rank > 0.75) |> 
    filter(std_residence_homicide_diff < -0.50) 

top_homicide_discrepancies

## # A tibble: 4 × 12
##   PO_NAME          STATE name  homicide_degree residence_degree std_residence_h…
##   <chr>            <chr> <chr>           <dbl>            <dbl>            <dbl>
## 1 Chicago          IL    60611              12                2           -0.714
## 2 Chicago          IL    60606               5                1           -0.667
## 3 Chicago          IL    60654              12                3           -0.6  
## 4 Elk Grove Villa… IL    60007               5                1           -0.667
## # … with 6 more variables: homicide_degree_perc_rank <dbl>,
## #   residence_degree_perc_rank <dbl>, betweenness <dbl>,
## #   betweenness_perc_rank <dbl>, closeness <dbl>, neighborhood <int>

Here are those zip codes on the map:

High Residence-Degree, Relatively Low Homicide-Degree

We need to be careful when evaluating the converse phenomenon with high residence-degree and relatively low homicide-degree. That’s because we do not have a full accounting of homicide incidents taking place outside of Cook County. Therefore, the relative amount of residence-degree compared to homicide-degree will look artificially high in outside zip codes. We can only get an accurate assessment of the relative difference for zip codes with high residence-degree inside of the county:

tar_load(ccme_homicide_zip_code_boundaries)

cook_county_zip_codes <-
    ccme_homicide_zip_code_boundaries |> 
    filter(
        st_intersects(
            ccme_homicide_zip_code_boundaries, cook_county_boundary,
            sparse = FALSE
        )[,1]
    )

top_residence_discrepancies <-
    ccme_homicide_nodes |> 
    filter(name %in% cook_county_zip_codes$ZIP_CODE) |> 
    filter(residence_degree_perc_rank > 0.75) |> 
    filter(std_residence_homicide_diff > 0.50)

top_residence_discrepancies

## # A tibble: 6 × 12
##   PO_NAME       STATE name  homicide_degree residence_degree std_residence_homi…
##   <chr>         <chr> <chr>           <dbl>            <dbl>               <dbl>
## 1 Hillside      IL    60162               2                8               0.6  
## 2 Richton Park  IL    60471               5               17               0.545
## 3 Burbank       IL    60459               2                7               0.556
## 4 Justice       IL    60458               4               14               0.556
## 5 Broadview     IL    60155               4               23               0.704
## 6 Chicago Ridge IL    60415               1                8               0.778
## # … with 6 more variables: homicide_degree_perc_rank <dbl>,
## #   residence_degree_perc_rank <dbl>, betweenness <dbl>,
## #   betweenness_perc_rank <dbl>, closeness <dbl>, neighborhood <int>

Here are those zip codes on the map:

Top Residence-Degree Outside Cook County

Since we can’t do a relative comparison for zip codes outside the county. Let’s just take a look at the outside zip codes with the top 10 absolute residence-degree.

top_outside_residences <-
    ccme_homicide_nodes |> 
    filter(!(name %in% cook_county_zip_codes$ZIP_CODE)) |> 
    arrange(desc(residence_degree)) |> 
    head(10)

top_outside_residences

## # A tibble: 10 × 12
##    PO_NAME      STATE name  homicide_degree residence_degree std_residence_homi…
##    <chr>        <chr> <chr>           <dbl>            <dbl>               <dbl>
##  1 Merrillville IN    46410               1               14               0.867
##  2 East Chicago IN    46312               5               12               0.412
##  3 Aurora       IL    60505               1               10               0.818
##  4 Bolingbrook  IL    60440               0                9               1    
##  5 Gary         IN    46404               8                8               0    
##  6 Gary         IN    46403               6                8               0.143
##  7 Gary         IN    46406               2                7               0.556
##  8 Joliet       IL    60433               3                7               0.4  
##  9 Kankakee     IL    60901               7                7               0    
## 10 Gary         IN    46408               4                7               0.273
## # … with 6 more variables: homicide_degree_perc_rank <dbl>,
## #   residence_degree_perc_rank <dbl>, betweenness <dbl>,
## #   betweenness_perc_rank <dbl>, closeness <dbl>, neighborhood <int>

Closeness

The closeness of each zip code node represents the average of the number of edges to traverse to reach each of the other zip codes.

Here is the distribution of closeness:

ccme_homicide_nodes |> 
    ggplot(aes(x = closeness)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here is a map of the homicide network with coloring by closeness:

Betweenness

The betweenness of each zip code node represents how many shortest paths between all zip code pairs will travel through a given zip code. High betweenness represents high social traffic.

Here is the distribution of betweenness:

ccme_homicide_nodes |> 
    ggplot(aes(x = betweenness)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here is a map of the homicide network with coloring by betweenness:

High Betweenness, Low Homicide-Degree

Zip codes with high betweenness and low homicide-degree can be thought of bottle-necks of social traffic in the homicide network:

top_bottlenecks <-
    ccme_homicide_nodes |> 
    filter(betweenness_perc_rank > 0.5) |> 
    filter(homicide_degree_perc_rank < 0.5) |> 
    select(
        PO_NAME, STATE, name, 
        homicide_degree, homicide_degree_perc_rank,
        betweenness, betweenness_perc_rank
    )

top_bottlenecks

## # A tibble: 18 × 7
##    PO_NAME          STATE name  homicide_degree homicide_degree_per… betweenness
##    <chr>            <chr> <chr>           <dbl>                <dbl>       <dbl>
##  1 Prospect Heights IL    60070               1                0.484     286    
##  2 Northbrook       IL    60062               1                0.484     426    
##  3 Chicago          IL    60603               1                0.484       0.143
##  4 Morton Grove     IL    60053               1                0.484       0.165
##  5 Schererville     IN    46375               1                0.484       0.158
##  6 Merrillville     IN    46410               1                0.484     143    
##  7 Addison          IL    60101               1                0.484     144    
##  8 Lincolnwood      IL    60712               1                0.484     142    
##  9 Joliet           IL    60431               1                0.484       3.43 
## 10 Aurora           IL    60504               1                0.484       8.01 
## 11 Crestwood        IL    60418               1                0.484       1    
## 12 Olympia Fields   IL    60461               1                0.484      21.2  
## 13 Naperville       IL    60563               1                0.484       2.28 
## 14 Franklin Park    IL    60131               1                0.484       4.71 
## 15 Hammond          IN    46327               1                0.484       0.154
## 16 Mundelein        IL    60060               1                0.484     366    
## 17 Villa Park       IL    60181               1                0.484       2.22 
## 18 Zion             IL    60099               1                0.484       0.539
## # … with 1 more variable: betweenness_perc_rank <dbl>

Louvain-Clustered Neighborhoods

The Louvain algorithm is a method for grouping nodes in a network. Its goal is for connectivity within each group to be optimally high relative to outside each group.

ccme_homicide_nodes |> 
    group_by(neighborhood) |> 
    summarize(n = n()) |> 
    arrange(desc(n))

## # A tibble: 33 × 2
##    neighborhood     n
##           <int> <int>
##  1            1    72
##  2            2    49
##  3            3    48
##  4            4    39
##  5            5    36
##  6            6    34
##  7            7    33
##  8            8    28
##  9            9    13
## 10           10    13
## # … with 23 more rows

Here is a map of the top 7 largest Louvain neighborhoods (our brains have trouble distinguishing any more than 7 groups apart). If having trouble differentiating two colors, you can hover over a given zip code to get its neighborhood number.

tar_read(ccme_homicide_neighborhood_map)

Conclusions

Overall, my hypotheses were subjectively inaccurate regarding measures of centrality within the network, but accurate regarding cluster-assignment.

Degree

In the context of my network model, the out-degree of each zip code’s node represents the number of homicide incidents that took place there (homicide-degree). The in-degree represents the number of times the zip code was listed as place of residence for a victim of a homicide case (residence-degree).

Contrary to expectations, homicide- and residence-degree were not greatest in zip codes with the highest population density in the downtown region. Rather they concentrated on the South and West sides of Chicago with extremely strong correlation to each other. That said, outliers from that correlation showed interesting patterns. Three zip codes with high homicide-degree, but relatively low residence-degree clustered in a straddle shape around the loop. These zip codes may represent areas where residents themselves are at low risk of experiencing violence, but environmental conditions enable outsiders to be vulnerable to attack. Zip codes with high residence-degree, but relatively low homicide-degree tended to be on the Western and Southern outer margins of the county. These zip codes may represent areas where residents are vulnerable to violence through social ties, but only when they travel to other zip codes where those social ties congregate.

Closeness

On the map, the magnitude of zip code closeness to others seemed less associated with proximity to transportation throughways, and more associated with proximity to areas with the highest population density. This could be more formally tested by pulling in data on the transportation geospatial network, then calculating distances between each zip code’s center and the closest highway ramp or transit stop. I hypothesize that there may be a raw inverse correlation between a zip code’s network closeness score and distance to nearest throughway, but it will disappear when adjusting for the population density of each zip code.

Betweenness

The patterns for overall betweenness did not reflect my hypotheses regarding proximity to transportation interchanges. Rather, they seem similar to the association seen for degree, but with higher specificity. That is, a select number of zip codes on the South and West Sides show high betweenness. These zip codes may represent areas of high social intermediary traffic for propagating other homicides. I am unable to appreciate a geographic pattern in zip codes with high betweenness and low homicide-degree (bottlenecks), but these zip codes may represent regions of social bridging between different neighborhood groups.

Louvain-Clustered Neighborhoods

The algorithm-assigned, zip-code clusters closely mirror the 9 broad categories of neighborhood areas defined by the City of Chicago. The separation of these neighborhood areas is deeply rooted in patterns of racial segregation imposed by systemic racist policies. One could more formally test racial segregation in these clusters by pulling in aggeregate stats about the age-, race-, and gender- composition of each zip code. Using this information I would test the following hypotheses:

Race-composition of each zip code will act as a significant predictor of homicide-degree as outcome, even when age, gender, and spatial random effects are included in a multivariate model.
Race-composition of each zip code will act as a significant predictor in a classifier model of Chicago-defined neighborhood area, even when age, gender, and spatial random effects are included as covariates.
In aggregate, the number zip code where Louvain cluster assignment and Chicago-defined neighborhood area assignment are overlapping will be high–more than expected by chance alone.

Strengths

Marries analytic strengths of network and geospatial analyses
Intuitive, visual results
Subject to few analytic assumptions that might introduce more bias
Analysis could be easily adapted to other subcategories of CCME data

Weaknesses

Purely observational and descriptive - although means of formal modeling and testing are described above
Could be more spatially precise if using a point pattern analysis (as opposed to using areal zip codes) - although this would necessitate modeling a different network structure

Summary

Homicides in Chicago predominantly affect young, Black males
Zip codes with high degree and betweenness reflect areas with lower socioeconomic status and higher segregation of young, Black males
These patterns crystallize even when not adjusting for population density
Using a modular community neighborhood algorithm, you can approximate the geospatial community areas of the city

Geospatial Network Analysis of Homicide Data from the Cook County Medical Examiner

Daniel P. Hall Riggins, MD

2022-06-15

Introduction

Methods

Analysis

Wrangling CCME Cases

Wrangling Geographic Features

Wrangling nodes and edges for visualization:

Mapping

Results

Homicide Summary Statistics

Degree

Homicide-Degree

Residence-Degree

Correlation between Homicide- and Residence-Degree

High Homicide-Degree, Relatively Low Residence-Degree

High Residence-Degree, Relatively Low Homicide-Degree

Top Residence-Degree Outside Cook County

Closeness

Betweenness

High Betweenness, Low Homicide-Degree

Louvain-Clustered Neighborhoods

Conclusions

Degree

Closeness

Betweenness

Louvain-Clustered Neighborhoods

Strengths

Weaknesses

Summary