Hall Riggins Blog - Estimating Risk for Rare Events in Small Areas where the Denominator is Unknown

This blog post is an adaptation of methods taught by David Robinson in his book Introduction to Empirical Bayes: Examples from Baseball Statistics.

My research group has compiled a dataset that counts the number of cases of Sudden Unexpected Infant Death (SUID) in each census tract of Cook County, IL from 2015-2019. We want an accurate way to estimate the risk of SUID in each tract using this data. SUID is defined as any case of death in a baby less than age 1 year old that does not have a known cause before a medical autopsy is performed. While the state of Illinois releases statistics about SUID for the county as a whole, we think it’s important to understand this phenomenon at a more granular geographic level. That way, we have a better understanding of geographic disparities and service agencies can more precisely target their interventions.

The goal of this post is to convince you that Bayesian techniques offer a more principled way to estimate a risk process when the amount of observed data has limitations by itself.

Intro to the Data

Here’s what our dataset looks like:

library(tidyverse)
library(leaflet)

suid_base_table <- 
    read_csv(here::here("data", "suid_snapshot_2022_08_31.csv")) |>
    # Remove rows with NA values for count of SUID cases or for population under five
    filter(!is.na(suid_count) & !is.na(pop_under_five)) |> 
    select(fips, suid_count, pop_under_five)

suid_base_table

# A tibble: 1,284 × 3
          fips suid_count pop_under_five
         <dbl>      <dbl>          <dbl>
 1 17031010100          0            364
 2 17031010201          0            592
 3 17031010202          0            257
 4 17031010300          0            312
 5 17031010400          0             38
 6 17031010501          0            277
 7 17031010502          0            139
 8 17031010503          0             65
 9 17031010600          1            177
10 17031010701          0            288
# … with 1,274 more rows

Each row is an individual census tract. The fips column lists each tract’s unique identifier, suid_count is the number of SUID cases that took place from 2015-2019, and pop_under_five is the population of children under age five as estimated by the U.S. Census’ American Community Survey.

It’s important to note that SUID is a rare event. The vast majority of census tracts did not suffer any cases of SUID during this time period. Here’s a histogram to illustrate:

The first challenge in estimating the risk is we don’t have the proper denominator. SUID incidence is typically reported as cases per 100,000 live births. However, we don’t have disaggregated counts of live births in each census tract. The best approximation I have found is pop_under_five, which I hope reasonably mirrors the number of live births over a five-year period. Here are approximated incidences for each tract using pop_under_five as denominator:

suid_incidence_table <-
    suid_base_table |> 
    mutate(suid_incidence = round(suid_count / pop_under_five * 1E5, 2))

suid_incidence_table

# A tibble: 1,284 × 4
          fips suid_count pop_under_five suid_incidence
         <dbl>      <dbl>          <dbl>          <dbl>
 1 17031010100          0            364             0 
 2 17031010201          0            592             0 
 3 17031010202          0            257             0 
 4 17031010300          0            312             0 
 5 17031010400          0             38             0 
 6 17031010501          0            277             0 
 7 17031010502          0            139             0 
 8 17031010503          0             65             0 
 9 17031010600          1            177           565.
10 17031010701          0            288             0 
# … with 1,274 more rows

Let’s calculate the mean of these values:

mean(suid_incidence_table$suid_incidence)

[1] NaN

Hmm, it’s seems like there are some “Not a number” values messing up the calculation. Let’s filter them out:

mean(suid_incidence_table$suid_incidence, na.rm = TRUE)

[1] Inf

That’s also not what we were looking for. What’s going on?

A problem that emerges from using pop_under_five as the denominator is that five tracts are estimated to have zero children under the age of five. In such cases, R calculates the incidence as NaN (not a number) when suid_count is zero and as Inf (infinity) when suid_count is anything greater than zero:

suid_incidence_table |> 
    filter(pop_under_five == 0)

# A tibble: 5 × 4
         fips suid_count pop_under_five suid_incidence
        <dbl>      <dbl>          <dbl>          <dbl>
1 17031081000          0              0            NaN
2 17031280800          0              0            NaN
3 17031380200          0              0            NaN
4 17031670100          1              0            Inf
5 17031834900          1              0            Inf

We know that the true risk for these tracts can’t be a non-existent number or infinitely large, so we’ll do our best adjust for these circumstances:

suid_incidence_table <-
    suid_incidence_table |> 
    mutate(
        suid_incidence = 
            case_when(
                # If incidence is NaN, change to 0
                is.nan(suid_incidence) ~ 0,
                # If it's Inf, change to 100,000
                is.infinite(suid_incidence) ~ 1E5,
                # Otherwise, keep as is
                TRUE ~ suid_incidence
            )
    )

Now let’s try and calculate the mean:

mean(suid_incidence_table$suid_incidence)

[1] 339.1101

A much more sensible number, although it still seems like an overestimate compared to the overall incidence for Cook County, which is 88.3 cases per 100,000 births per Illinois Department of Health.

Let’s take a look at the census tracts with lowest and highest incidences:

suid_incidence_table |> 
    arrange(suid_incidence) |> 
    head(10)

# A tibble: 10 × 4
          fips suid_count pop_under_five suid_incidence
         <dbl>      <dbl>          <dbl>          <dbl>
 1 17031010100          0            364              0
 2 17031010201          0            592              0
 3 17031010202          0            257              0
 4 17031010300          0            312              0
 5 17031010400          0             38              0
 6 17031010501          0            277              0
 7 17031010502          0            139              0
 8 17031010503          0             65              0
 9 17031010701          0            288              0
10 17031010702          0            530              0

suid_incidence_table |> 
    arrange(desc(suid_incidence)) |> 
    head(10)

# A tibble: 10 × 4
          fips suid_count pop_under_five suid_incidence
         <dbl>      <dbl>          <dbl>          <dbl>
 1 17031670100          1              0        100000 
 2 17031834900          1              0        100000 
 3 17031530502          2             15         13333.
 4 17031081401          3             30         10000 
 5 17031831300          1             10         10000 
 6 17031480300          1             13          7692.
 7 17031834700          5             92          5435.
 8 17031612000          2             37          5405.
 9 17031670200          4             74          5405.
10 17031690300          2             46          4348.

On both extremes, the incidence values seem intuitively implausible when used to characterize the underlying risk for SUID. On one end, we shouldn’t expect that a tract’s risk for SUID was zero just because there were no observed cases. SUID is rare, so it’s expected that many tracts will count zero cases just by luck of the draw.

On the other end, notice how the tracts with highest incidence of SUID also tend to have low populations under five. Let’s take an aside for a minute and think about flipping a coin and using the results to estimate whether that coin is weighted to favor a certain side. If you flip a coin three times and get heads every single time, do you think it’s actually weighted to favor heads? What about if you flip it 500 times and still get heads every single time? You should be a lot more confident after 500 coin flips that the coin is truly weighted because you’ve accumulated more evidence. Similarly, think of every live birth as a coin flip that accumulates evidence about the underlying risk of SUID. The more live births there are, the more confident we can be that the observed incidence matches the true risk for SUID. Said another way, census tracts with high incidence but low counts of pop_under_five might just be very unlucky, rather than truly at higher risk.

Given these limitations of approximating incidence, we are going to use a Bayesian approach to adjust estimations to incorporate our “prior” expectations of what the underlying risk process should look like.

Step 1: Set your prior expectations

We are going to use the “Beta distribution” to represent our prior expectations. The Beta distribution is not nearly as well known as others like the normal distribution (aka Bell Curve), but the Beta is particularly well suited to represent a plausible values for a risk process. Check out David Robinson’s post here for more explanation. The Beta distribution has two “parameters”. The number of observed events is represented by \(\alpha\) (SUID cases) and the number of trials that don’t result in an event is represented by \(\beta\) (live births that don’t result in SUID, aka survivals). Let’s try using Cook County’s overall incidence of SUID and count of live births to set our prior expectations:

# Sourced from Illinois Vital Statistics
# https://dph.illinois.gov/topics-services/life-stages-populations/infant-mortality/sids/sleep-related-death-statistics.html
overall_incidence <- 88.3 / 1E5 # per live birth for Cook County in 2014
overall_incidence

[1] 0.000883

# https://dph.illinois.gov/data-statistics/vital-statistics/birth-statistics.html
total_live_births <- 139398 # for Cook County from 2015-2019
extrapolated_cases <- overall_incidence * total_live_births
extrapolated_cases

[1] 123.0884

extrapolated_survivals <- total_live_births - extrapolated_cases
extrapolated_survivals

[1] 139274.9

A cool thing about the Beta distribution is it will adjust it’s shape to reflect the number of observations (aka evidence) we give it. I’ll try and illustrate by rescaling the parameters to reflect the relative number of live births (coin flips of evidence) in the county versus in a typical census tract.

scaling_factor <- median(suid_incidence_table$pop_under_five) / total_live_births
scaling_factor

[1] 0.001531586

Let’s simulate two Beta distributions of 1,284 census tracts. One will use parameters reflecting the amount of evidence accumulated for a whole county’s worth of live births, versus just a census tract’s worth of live births.

Both distributions have the same ratio of cases (\(\alpha\)) to survivals (\(\beta\)), so their mean expected SUID Risk is about the same at 88.3 cases per 100,000 live births, but the blue distribution observed a county’s worth of live birth evidence (139,398), so its range of expected values is much more compact (about 70 to 120) than a census tract’s worth of live birth evidence (estimating risk anywhere from 0 to 500).

In Bayesian analysis, we combine our prior expectations with observed data to get a “posterior” estimate. In this case, I want the prior expectation to weigh about the same as the observed data, so I’ll use census-tract-scaled parameters in my prior.

Step 2: Combine the prior expectation and observed data

To get each posterior estimate of risk from the Beta distribution, we are going to perform the following calculation:

\[ \frac{cases_{prior} + cases_{observed}}{cases_{prior} + cases_{observed} + survivals_{prior} + survivals_{observed}} = \frac{0.189 + cases_{observed}}{0.189 + cases_{observed} + 213.311 + survivals_{observed}} \]

prior_alpha <- extrapolated_cases * scaling_factor
prior_beta <- extrapolated_survivals * scaling_factor
suid_posterior_table <-
    suid_incidence_table |> 
    mutate(
        posterior_alpha = prior_alpha + suid_count,
        posterior_beta = prior_beta + pop_under_five - suid_count
    ) |> 
    mutate(
        posterior_risk = posterior_alpha / (posterior_alpha + posterior_beta) * 1E5,
        # Use the Beta Distribution Quantile function to also get 95% credible intervals
        posterior_risk_low = qbeta(0.025, posterior_alpha, posterior_beta) * 1E5,
        posterior_risk_high = qbeta(0.975, posterior_alpha, posterior_beta) * 1E5,
        .before = posterior_alpha
    )

suid_posterior_table

# A tibble: 1,284 × 9
          fips suid_co…¹ pop_u…² suid_…³ poste…⁴ poste…⁵ poste…⁶ poste…⁷ poste…⁸
         <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 17031010100         0     364      0     32.6 3.56e-7    253.   0.189    577.
 2 17031010201         0     592      0     23.4 2.55e-7    181.   0.189    805.
 3 17031010202         0     257      0     40.1 4.38e-7    310.   0.189    470.
 4 17031010300         0     312      0     35.9 3.92e-7    278.   0.189    525.
 5 17031010400         0      38      0     75.0 8.20e-7    580.   0.189    251.
 6 17031010501         0     277      0     38.4 4.20e-7    298.   0.189    490.
 7 17031010502         0     139      0     53.5 5.84e-7    414.   0.189    352.
 8 17031010503         0      65      0     67.7 7.40e-7    524.   0.189    278.
 9 17031010600         1     177    565.   304.  1.27e+1   1041.   1.19     389.
10 17031010701         0     288      0     37.6 4.11e-7    291.   0.189    501.
# … with 1,274 more rows, and abbreviated variable names ¹suid_count,
#   ²pop_under_five, ³suid_incidence, ⁴posterior_risk, ⁵posterior_risk_low,
#   ⁶posterior_risk_high, ⁷posterior_alpha, ⁸posterior_beta

Let’s visualize how our posterior estimates of risk relate to incidence calculations and prior expectations:

On the Y axis, we have 9 census tracts for which we’ve estimated risk. We label each tract with the observed data used to calculate approximate incidence. For example, the top row shows a census tract that observed zero cases of SUID and had a population under five of 1,011 children. Each approximate incidence is marked on the x axis as a hollow circle for reference. Each filled black circle marks our posterior estimate of SUID risk and is flanked by a 95% credible interval (the Bayesian cousin to the confidence interval). The dashed vertical line marks our prior expectation of risk as observed for the whole county.

What I want you to notice is how the posterior estimate represents a tug-of-war between the prior expectation (dashed line) and the observed data (hollow circle). The prior expectation is able to pull our estimates away from extreme observed incidence values (like 0 in the top three rows, or over 5000 in the bottom three rows) to more plausible values. This balance between the forces of expectation and observation is what makes Bayesian estimation so powerful!

Geographic Pattern?

When we map census tracts with the 100 highest and 100 lowest estimates of SUID risk, we notice a pattern starting to emerge. The lowest estimates are concentrating on the West/South sides of Chicago and the Southern suburbs. These areas are where historic trends of segregation have caused concentration of socioeconomic vulnerability. When you think about it, we might expect these areas to have higher risk of SUID even before we observe the data. In a future blog post, I’ll show how we can use auxiliary information about tracts (like location or SES) to fine tune our prior expectations of risk for a better posterior estimate.