Un-counting – Ken’s blog

Description

Why you would want to do the opposite of counting.

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction

You probably know about count, which tells you how many observations you have in each group:

d <- tribble(
  ~g, ~y,
  "a", 10,
  "a", 13,
  "a", 14, 
  "a", 14,
  "b", 6,
  "b", 7,
  "b", 9
)

There are four observations in group a and three in group b:

d %>% count(g) -> counts
counts

I didn’t know about this until fairly recently. Until then, I thought you had to do this:

d %>% group_by(g) %>% 
  summarize(count=n())

which works, but is a lot more typing.

Going the other way

The other day, I had the opposite problem. I had a table of frequencies, and I wanted to get it back to one row per observation (I was fitting a model in Stan, and I didn’t know how to deal with frequencies). I had no idea how you might do that (without something ugly like loops), and I was almost embarrassed to stumble upon this:

counts %>% uncount(n)

My situation was a bit less trivial than that. I had disease category counts of coal miners with different exposures to coal dust:

my_url="https://www.utsc.utoronto.ca/~butler/d29/miners-tab.txt"
miners0 <- read_table(my_url)


── Column specification ────────────────────────────────────────────────────────
cols(
  Exposure = col_double(),
  None = col_double(),
  Moderate = col_double(),
  Severe = col_double()
)

miners0

This needs tidying to get the frequencies all into one column:

miners0 %>% 
  gather(disease, freq, -Exposure) -> miners
miners

So I wanted to fit an ordered logistic regression in Stan, predicting disease category from exposure, but I didn’t know how to handle the frequencies. If I had one row per miner, I thought…

miners %>% uncount(freq) %>% rmarkdown::paged_table()

… and so I do. (I scrolled down to check, and eventually got past the 98 miners with 5.8 years of exposure and no disease).

From there, you can use this to fit the model, though I would rather have weakly informative priors for their beta and c. c is tricky, since it is ordered, but I used the idea here (near the bottom) and it worked smoothly.