What subway station has the most passengers getting on the train? What subway line has the most riders? What subway station has the most passengers getting off the train? These should be easy questions for the MTA to answer, with all the data we collect from the turnstiles, right?
Nope. We do get subway entry data for every rider swiping their MetroCard or tapping their OMNY card or device, but since riders aren’t swiping or tapping out at their destination, we don’t directly know where they’re going. Furthermore, we only know what station the rider entered; if there are multiple lines servicing the station, we don’t know which one they boarded. Adding to the mystery, our fare data can’t tell us if a rider changed trains during their subway trip. At some stations, those mysterious transfers are a big portion of ridership! Do we have any hope of making sense of subway ridership patterns in NYC, given all these unknowns?
Actually, we do, through algorithmic magic called ridership modeling. The fundamental “trick” of MTA subway ridership modeling is that by applying a set of simplifying assumptions about subway rider behavior, we can make a reasonable guess of each passenger journey’s destination, and from that we can estimate the trains they took to get there. This guess for each individual trip will be wrong a lot of the time: as much as the MTA Data & Analytics team would like to be, we’re not omniscient. But—this is the trick—the errors in the journey inference should be random, so when aggregated across thousands and millions of journeys, these errors should cancel out, resulting in reasonably accurate estimates of line- and station-level ridership patterns.
Let’s break down how this works.