Part 1
Andrew Lilley, Gianluca Rinaldi, and Matthew Lilley write:
You might be familiar with a recent paper by Correira, Luck, and Verner who argued that cities that enacted non-pharmaceutical interventions earlier / for longer during the Spanish Flu of 1918 had higher subsequent economic growth. The paper has had extensive media coverage – e.g. it has been covered by the NYT, Washington Post, The Economist, The Boston Globe, Bloomberg, and of course NPR.
We were surprised by their results and tried to replicate them. To investigate the question further, we collected additional data to extend their sample from 1899 to 1927. Unfortunately, from extending their sample backwards, it seems that their headline results are driven by pre-existing differential growth patterns across US cities. We found that NPI measures in 1918 explain growth between 1899 and 1914 just as well as 1914 to 1919 growth. Further, we also found that the 1914-1919 growth result is mostly driven by estimated city level population growth that happened from 1910-1917. The approaches that we have taken to deal with these spurious results leave us finding only a noisy zero; we cannot rule out either substantial positive or negative economic effects from NPIs.
Here you can find our comment as well as the data we used and the replication code.
First off, I don’t find this depressing. We all make mistakes. If Correira et al. made a mistake on a newsworthy topic and it got lots of press coverage, then, sure, that’s too bad, but such things happen. It’s a sign of cheer, not depression, that a paper appearing on 26 Mar 2020 gets shot down on 2 May. A solid review in 5 weeks—that’s pretty good!
Part 2
Now it’s time for me to look at the articles. First the Correira et al. paper:
It seems funny that they use 1914 as a baseline for an intervention in 1918. Wouldn’t it make more sense to use the change from 1917 to 1921, or something like that? I guess that they had to make use of the data that were available—but I’m already concerned, given that more than half of any changes are happening before the treatment even occurred.
Beyond that, the pattern in the graph seems driven by a combination of relatively low mortality and high growth in western cities. I could imagine lots of reasons for both of these factors, having little or nothing to do with social distancing etc.
And check this out:
Just pretend they did the right thing and had the y-axis go down to 0. Then you’ll notice two things: First, yeah, the flu in 1918 really was a big deal—almost 4 times the death rate compared to early years. Second, it was only 4 times the death rate. I mean, yeah, that’s horrible, but only a factor of 4, not a factor of 10. I guess what I’m saying is, I hadn’t realized how much of a scourge flu/pneumonia was even in non-“pandemic” years. Interesting.
OK, now to the big thing. They have an observational study, comparing cities with more or less social distancing policies (“school closure, public gathering bans, and isolation and quarantine”). They make use of an existing database from 2007 with information on flu-control policies in 43 cities. And here are the pre-treatment variables they adjust for:
State agriculture and manufacturing employment shares, state and city population, and the urban population share are from the 1910 census. . . . State 1910 income per capita . . . State-level WWI servicement mortality . . . annual state-level population and density estimates from the Census Bureau. City-level public spending, including health spending . . . City and state-level exports. . . . the importance of agriculture in each city’s state. . . . we also present tests controlling for health spending and total public spending per capita in a city. . . . 1910 [state] agriculture employment share, the [state] 1910 urban population share, and the 1910 income per-capita at the state level. . . . log of 1910 population and the 1914 manufacturing employment to population ratio. However, unlike in our analysis on the effect of the 1918 Flu on the real economy, here we control for past city-level mortality as of 1917
They discuss and dismiss the possibility of endogeneity:
For instance, local officials may be more inclined to intervene if the local exposure to the flu is higher . . . An alternative concern is that interventions reflect the quality of local institutions, including the local health care system. Places with better institutions may have a lower cost of intervening, as well as higher growth prospects. There are, however, important details that suggest that the variation across cities is unrelated to economic fundamentals and is instead largely explained by city location. First, local responses were not driven by a federal response . . . Second . . . the second wave of the 1918 flue pandemic spread from east to west . . . the distance to the east coast explains a large part of the variation in NPIs across cities . . .
That’s the pattern that I noticed in the top graph above. I’m still concerned that they cities in the west coast were just growing faster already. Remember, their outcome is growth from 1914 to 1919, so more than half the growth had already happened before the pandemic came.
They also look at changes in city-level bank assets from October 1918 to October 1919. But I’m not so thrilled with this measure. First, I’m not quite sure what to make of it. Second, using a baseline of October 1918 seems a bit too late, as the pandemic had already started by then. So I’ll set that analysis aside, which I think is fine. The abstract only mentions manufacturing output, so let’s just stick with that one. But then I really am concerned about differences between east and west.
They report:
Reassuringly for our purposes, other than differences in the longitude and the variation in the local industry structure, there are no observable differences across cities with different NPIs.
Anyway, here are their main city-level findings. I think we’re supposed to focus on the estimates for 1919, but I guess you can look at 1921 and 1923 also:
The confidence intervals for manufacturing output exclude zero and the intervals for manufacturing employment in 1919 include zero; I guess that’s why they included manufacturing output but not manufacturing employment in the abstract. Fair enough: they reported all their findings in the paper but focused on the statistically significant results in the abstract. But then I’m confused why in their pretty graph (see above) they showed manufacturing employment instead of manufacturing output. Shouldn’t the graph match what’s in the abstract and the news reports?
In any case, I remain concerned that the cities in the west had more time to prepare for the pandemic, so they implemented social distancing etc., and they were growing anyway. This would not say that the authors’ substantive hypothesis—social distancing is good for the economy—is wrong; it just says that the data are hopelessly confounded so they don’t really answer the question as implied in the paper.
They do some robustness checks:
These controls are included to account for other time-varying shocks that may be correlated with local NPI policies. . . . city-level manufacturing employment and output growth from 1909 to 1914 to control for potential city-level pre-trends. . . . census region fixed effects . . . longitude. . . . state WWI servicemen mortality per capita . . . city population density . . . health and total public spending per capita in 1917 . . . city-level estimates of total exports and exports per capita in 1915.
They don’t adjust for all of these at once. That would be difficult, of course. I’m just thinking that maybe linear regression isn’t the right way to do this. Linear regression is great, but maybe before throwing these cities in the regression, you need to do some subsetting, maybe separate analyses for slower and faster-growing cities? Or some matching. Maybe Lowell and Fall River weren’t going to increase their manufacturing base anyway?
At this point you might say that I’m just being an asshole. These researchers wrote this nice paper on an important problem, then I had a question about geography but they already addressed in their robustness check, so what’s my complaint? Sure, not all the numbers in the robustness table have p-values less than 0.05, but I’m the last person to be a statistical significance nazi.
All I can say is . . . sorry, I’m still concerned! Again, I’m not saying the authors’ economic story is wrong—I have no idea—I just question the idea that this analysis is anything more than correlational. To put it another way—I could also easily imagine the authors’ economic story being correct but the statistical analysis going in the opposite direction. The tide of the correlation is so much larger than the swells of the treatment effect.
To put it yet another way, they’re not looking for a needle in a haystack, they’re looking for a needle in a pile of needles.
Again, this is not meant as a slam on the authors of the article in question. They’re using standard methods, they’ve obviously put in a lot of work (except for those bizarre stretchy maps in Figure 3; I don’t know how that happened!), the paper is clearly written, etc. etc. Nothing improper here; I just don’t think this method will solve their problem.
Interlude
And that’s where the story would usually end. A controversial paper is published, gets splashed all over NPR but people have questions. I read the paper and I’m unsatisfied too. Half the world concludes that if I’m not convinced by the paper, that it’s no good; the other half believes in baseball, apple pie, Chevrolet, and econometrics and concludes that I’m just a crabapple; and the authors of the original paper don’t know what to think so they wait for their referee reports to decide how to buttress their conclusions (or maybe to reassess their claims, but, over the years, I’ve seen a lot more buttressing than I’ve seen reassessing).
Part 3
But then there was that email from Lilley et al., who in their paper write:
Using data from 43 US cities, Correia, Luck, and Verner [2020] find that the 1918 Flu pandemic had strong negative effects on economic growth, but that Non Pharmaceutical Interventions (NPIs) mitigated these adverse economic effects. . . . We collect additional data which shows that those results are driven by population growth between 1910 to 1917, before the pandemic. We also extend their difference in differences analysis to earlier periods, and find that once we account for pre-existing differential trends, the estimated effect of NPIs on economic growth are a noisy zero; we can neither rule out substantial positive nor negative effects of NPIs on employment growth.
“A noisy zero” . . . I kinda like that. But I don’t like the “zero” part. Better, I think, to just say that any effects are hard to ascertain from these data.
Anyway, here’s a key graph:
We see those western cities with high growth. But check out San Francisco. Yes, it was in the west. But it had already filled out by 1900 and was not growing fast like Oakland, Seattle, etc. So maybe no need to use lack of social distancing policies to explain San Francisco’s relatively slow post-1917 growth.
Lilley et al. run some regressions of their own and conclude, “locations which implemented NPIs more aggressively grew faster than those which did not both before the policy implementation, and afterward.” That makes sense to me, given the structure of the problem.
I have not examined Lilley et al.’s analysis in detail, and it’s possible they missed something important. It’s hard for me to see how the original strong conclusions of Correia et al. could be salvaged, but who knows. As always, I’m open to evidence and argument.
There’s the potential for bias in my conclusions, if for no other reason than that the authors of the first article didn’t contact me; the authors of the second author did. It’s natural for me to take the side of people who email me things. I don’t always, but a bias in that direction would be expected. On the other hand, there could be selection bias in the other direction: I expect that people who send me things are going to expect some scrutiny.
The larger issue is that there seem to be more outlets for positive claims than negative claims. “We found X” can get you publication in a top journal and major media coverage—in this case, even before publication. “We didn’t find X” . . . that’s a tougher sell. Sure, every once in awhile there’s some attention for a non-replication of some well-known finding, but (a) such events are not so common, and (b) they still require the prior existence of a celebrated positive finding to react to. You could say that this is fine—positive claims should be more newsworthy than negative claims—and that’s fair enough, as long as we come in with the expectation that those positive claims will often be wrong, or, at least, not supported by the data.