Skip to content

Get your research project reviewed by The Red Team: this seems like a good idea!

Ruben Arslan writes:

A colleague recently asked me to be a neutral arbiter on his Red Team challenge. He picked me because I was skeptical of his research plans at a conference and because I recently put out a bug bounty program for my blog, preprints, and publications (where people get paid if they find programming errors in my scientific code).

I’m writing to you of course because I’m hoping you’ll find the challenge interesting enough to share with your readers, so that we can recruit some of the critical voices from your commentariat. Unfortunately, it’s time-sensitive (they are recruiting until May 14th) and I know you have a long backlog on the blog.

OK, OK, I’ll post it now . . .

Arslan continues:

The Red Team approach is a bit different to my bounty program. Their challenge recruits five people who are given a $200 stipend to examine data, code, and manuscript. Each critical error they find yields a donation to charity, but it’s restricted to about a month of investigation. I have to arbitrate what is and isn’t critical (we set out some guidelines beforehand).

I [Arslan] am very curious to see how this goes. I have had only small submissions to my bug bounty program, but I have not put out many highly visible publications since starting the program and I don’t pay a stipend for people to take a look. Maybe the Red Team approach yields a more focused effort. In addition, he will know how many have actually looked, whereas I probably only hear from people who find errors.

My own interest in this comes from my work as a reviewer and supervisor, where I often find errors, especially if people share their data cleaning scripts and not just their modelling scripts, but also from my own work. When I write software, I have some best practices to rely on and still make tons of mistakes. I’m trying to import these best practices to my scientific code. I’ve especially tried to come up with ways to improve after I recently corrected a published paper twice after someone found coding errors during a reanalysis (I might send you that debate too since you blogged the paper, it was about menstrual cycles and is part of the aftermath of dealing with the problems you wrote about so often).

Here’s some text from the blog post introducing the challenge:

We are looking for five individuals to join “The Red Team”. Unlike traditional peer review, this Red Team will receive financial incentives to identify problems. Each Red Team member will receive a $200 stipend to find problems, including (but not limited to) errors in the experimental design, materials, code, analyses, logic, and writing. In addition to these stipends, we will donate $100 to a GoodWell top ranked charity (maximum total donations: $2,000) for every new “critical problem” detected by a Red Team member. Defining a “critical problem” is subjective, but a neutral arbiter—Ruben Arslan—will make these decisions transparently. At the end of the challenge, we will release: (1) the names of the Red Team members (if they wish to be identified), (2) a summary of the Red Team’s feedback, (3) how much each Red Team member raised for charity, and (4) the authors’ responses to the Red Team’s feedback.

Daniël has also written a commentary about the importance of recruiting good critics, especially now for fast-track pandemic research (although I still think Anne Scheels blog post on our 100% CI blog made the point even clearer).

OK, go for it! Seems a lot better than traditional peer review, the incentives are better aligned, etc. Too bad Perspectives on Psychological Science didn’t decide to do this when they were spreading lies about people.

This “red team” thing could be the wave of the future. For one thing, it seems scalable. Here are some potential objections, along with refutations to these objections:

– You need to find five people who will review your paper—but for most topics that are interesting enough to publish on in the first place, you should be able to find five such people. If not, your project must be pretty damn narrow.

– You need to find up to $3000 to pay your red team members and make possible charitable donations. $3000 is a lot, not everyone has $3000. But I think the approach would also work with smaller payments. Also, journal refereeing isn’t free! 3 referee reports, the time of an editor and an associate editor . . . put it all together, and the equivalent cost could be well over $1000. For projects that are grant funded, the red team budget could be incorporated into the funding plan. And for unfunded projects, you could find people like Alexey Guzey or Ulrich Schimmack who might “red team” your paper for free—if you’re lucky!

2 perspectives on the relevance of social science to our current predicament: (1) social scientists should back off, or (2) social science has a lot to offer

Perspective 1: Social scientists should back off

This is what the political scientist Anthony Fowler wrote the other day:

The public appetite for more information about Covid-19 is understandably insatiable. Social scientists have been quick to respond. . . . While I understand the impulse, the rush to publish findings quickly in the midst of the crisis does little for the public and harms the discipline of social science. Even in normal times, social science suffers from a host of pathologies. Results reported in our leading scientific journals are often unreliable because researchers can be careless, they might selectively report their results, and career incentives could lead them to publish as many exciting results as possible, regardless of validity. A global crisis only exacerbates these problems. . . . and the promise of favorable news coverage in a time of crisis further distorts incentives. . . .

Perspective 2: Social science has a lot to offer

42 people published an article that begins:

The COVID-19 pandemic represents a massive global health crisis. Because the crisis requires large-scale behaviour change and places significant psychological burdens on individuals, insights from the social and behavioural sciences can be used to help align human behaviour with the recommendations of epidemiologists and public health experts. Here we discuss evidence from a selection of research topics relevant to pandemics, including work on navigating threats, social and cultural influences on behaviour, science communication, moral decision-making, leadership, and stress and coping.

The author list includes someone named Nassim, but not Taleb, and someone named Fowler, but not Anthony. It includes someone named Sander but not Greenland. Indeed it contains no authors with names of large islands. It includes someone named Zion but no one who, I’d guess, can dunk. Also no one from Zion. It contains someone named Dean and someone named Smith but . . . ok, you get the idea. It includes someone named Napper but no sleep researchers named Walker. It includes someone named Rand but no one from Rand. It includes someone named Richard Petty but not the Richard Petty. It includes Cass Sunstein but not Richard Epstein. Make of all this what you will.

As befits an article with 42 authors, there are a lot of references: 6.02 references per author, to be precise. But, even with all these citations, I’m not quite sure where this research can be used to “support COVID-19 pandemic response,” as promised in the title of the article.

The trouble is that so much of the claims are so open-ended that they don’t tell us much about policy. For example, I’m not sure what we can do with a statement such as this:

Negative emotions resulting from threat can be contagious, and fear can make threats appear more imminent. A meta-analysis found that targeting fears can be useful in some situations, but not others: appealing to fear leads people to change their behaviour if they feel capable of dealing with the threat, but leads to defensive reactions when they feel helpless to act. The results suggest that strong fear appeals produce the greatest behaviour change only when people feel a sense of efficacy, whereas strong fear appeals with low-efficacy messages produce the greatest levels of defensive responses.

Beyond the very indirect connection to policy, I’m also concerned because, of the three references cited in the above passage, one is from PNAS in 2014 and one was from Psychological Science in 2013. That’s not a good sign!

Looking at the papers in more detail . . . The PNAS study found that if you manipulate people’s Facebook news feeds by increasing the proportion of happy or sad stories, people will post more happy or sad things themselves. The Psychological Science study is based on two lab experiments: 101 undergraduates who “participated in a study ostensibly measuring their thoughts about “island life,” and 48 undergraduates who were “randomly assigned to watch one of three videos” of a shill. Also a bunch of hypothesis tests with p-values like 0.04. Anyway, the point here is not to relive the year 2013 but rather to note that the relevance of these p-hacked lab experiments to policy is pretty low.

Also, the abstract of the 40-author paper says, “In each section, we note the nature and quality of prior research, including uncertainty and unsettled issues.” But then the paper goes on to unqualified statements that the authors don’t even seem to agree with.

For example, from the article, under the heading, “Disaster and ‘panic’” [scare quotes in original]:

There is a common belief in popular culture that, when in peril, people panic, especially when in crowds. That is, they act blindly and excessively out of self-preservation, potentially endangering the survival of all. . . .However, close inspection of what happens in disasters reveals a different picture. . . . Indeed, in fires and other natural hazards, people are less likely to die from over-reaction than from under-reaction, that is, not responding to signs of danger until it is too late. In fact, the concept of ‘panic’ has largely been abandoned by researchers because it neither describes nor explains what people usually do in disaster. . . . use of the notion of panic can be actively harmful. News stories that employ the language of panic often create the very phenomena that they purport to condemn. . . .

But, just a bit over two moths ago, one of the authors of this article wrote an op-ed titled, “The Cognitive Bias That Makes Us Panic About Coronavirus”—and he cited lots of social-science research in making that argument.

Now, I don’t think social science research has changed so much between 28 Feb 2020 (when this pundit wrote about panic and backed it up with citations) and 30 Apr 2020 (when this same pundit coauthored a paper saying that researchers shouldn’t be talking about panic). And, yes, I know that the author of an op-ed doesn’t write the headline. But, for a guy who thinks that “the concept of ‘panic'” is not useful in describing behavior, it’s funny how quickly he leaps to use that word. A quick google turned up this from 2016: “How Pro Golf Explains the Stock Market Panic.”

All joking aside, this just gets me angry. These so-called behavioral scientists are so high and mighty, with big big plans for how they’re going to nudge us to do what they want. Bullfight tickets all around! Any behavior they see, they can come up with an explanation for. They have an N=100 lab experiment for everything. They can go around promoting themselves and their friends with the PANIC headline whenever they want. But then in their review article, they lay down the law and tell us how foolish we are to believe in “‘panic.'” They get to talk about panic whenever they want, but when we want to talk about it, the scare quotes come out.

Don’t get me wrong. I’m sure these people mean well. They’re successful people who’ve climbed to the top of the greasy academic pole; their students and colleagues tell them, week after week and month after month, how brilliant they are. We’re facing a major world event, they want to help, so they do what they can do.

Fair enough. If you’re an interpretive dancer like that character from Jules Feiffer, and you want to help with a world crisis, you do an interpretive dance. If you’re a statistician, you fit models and make graphs. If you’re a blogger, you blog. If you’re a pro athlete, you want until you’re allowed to play again, and then you go out and entertain people. You do what you can do.

The problem is not with social scientists doing their social science thing; the problem is with them overclaiming, overselling, and then going around telling people what to do.

A synthesis?

Can we find any overlap between the back-off recommendation of Fowler and we-can-do-it attitude of the 42 authors? Maybe.

Back to Fowler:

Social scientists have for decades studied questions of great importance for pandemics and beyond: How should we structure our political system to best respond to crises? How should responses be coordinated between local, state and federal governments? How should we implement relief spending to have the greatest economic benefits? How can we best communicate health information to the public and maximize compliance with new norms? To the extent that we have insights to share with policy makers, we should focus much of our energy on that.

Following Fowler, maybe the 42 authors and their brothers and sisters in the world of social science should focus not on “p less than 0.05” psychology experiments, Facebook experiments, and ANES crosstabs, but on some more technical work on political and social institutions, tracing where people are spending their money, and communicating health information.

On the plus side, I didn’t notice anything in that 42-authored article promoting B.S. social science claims such as beauty and sex ratio, ovulation and voting, himmicanes, Cornell students with ESP, the critical positivity ratio, etc etc. I choose these particular claims as examples because they weren’t just mistakes—like, here’s a cool idea, too bad it didn’t replicate—but were they were quantitatively wrong, and no failed replication was needed to reveal their problems. A little bit of thought and real-world knowledge, was enough. Also, these were examples with no strong political content, so there’s no reason to think the journals involved were “doing a Lancet” and publishing fatally flawed work because it pushed a political agenda.

So, yeah, it’s good that they didn’t promote any of these well-publicized bits of bad science. On the other hand, then it’s not so clear from reading the article that not all the science that they do promote, can be trusted.

Also, remember the problems with the scientist-as-hero narrative.

P.S. More here from Simine Vazire.

“Stay-at-home” behavior: A pretty graph but I have some questions

Or, should I say, a pretty graph and so have some questions. It’s a positive property of a graph that it makes you want to see more.

Clare Malone and Kyle Bourassa write:

Cuebiq, a private data company, assessed the movement of people via GPS-enabled mobile devices across the U.S. If you look at movement data in a cross-section of states President Trump won in the southeast in 2016 — Tennessee, Georgia, Louisiana, North Carolina, South Carolina and Kentucky — 23 percent of people were staying home on average during the first week of March. That proportion jumped to 47 percent a month later across these six states.

And then they display this graph by Julia Wolfe:

So here are my questions:

1. Why did they pick those particular states to focus on? If they’re focusing on the south, why leave out Mississippi and Alabama? If they’re focusing on Republican-voting states, why leave out Idaho and Wyoming?

2. I’m surprised that it says that the proportion of New Yorkers staying at home increased by only 30 percentage points compared to last year. I would’ve thought it was higher. Maybe it’s a data issue? People like me are not in their database at all!

3. It’s weird how all the states show a pink line—fewer people staying at home compared to last year—at the beginning of the time series (I can’t quite tell when that is, maybe early March?). I’m guessing this is an artifact of measurement, that the number of GPS-enabled mobile devices has been gradually increasing over time, so the company that gathered these data would by default show an increase in movement (an apparent “Fewer people stayed home”) even in the absence of any change in behavior.

I’m thinking it would make sense to shift the numbers, or the color scheme, accordingly. As it is, the graph shows a dramatic change at the zero point, but if this zero is artifactual, then this could be misleading.

I guess what I’d like to see is a longer time series. Show another month at the beginning of each series, and that will give us a baseline.

Again, it’s not a slam on this graph to say that it makes me want to learn more.

Stay-at-home orders

The above-linked article discusses the idea that people were already staying at home, before any official stay-at-home orders were issued. And, if you believe the graphs, it looks like stay-at-home behavior did not even increase following the orders. This raises the question of why issue stay-at-home orders at all, and it also raises statistical questions about estimating the effects of such orders.

An argument against stay-at-home or social-distancing orders is that, even in the absence of any government policies on social distancing, at some point people would’ve become so scared that they would’ve socially distanced themselves, canceling trips, no longer showing up to work and school, etc., so the orders are not necessary.

Conversely, an argument in favor of governmentally mandated social distancing is that it coordinates expectations. I remember in early March that we had a sense that there were big things going on but we weren’t sure what to do. If everyone is deciding on their own whether to go to work etc., things can be a mess. Yes, there is an argument in favor of decentralized decision making, but what do you do, for example, if schools are officially open but half the kids are too scared to show up?

P.S. In comments, Brent points out a problem with framing this based on “stay-at-home orders”:

In my [Hutto’s] state the order closing schools was on March 15. The “stay at home” order came on April 7.

As best as I can interpret the x-axis of the graphs, they have the April 7 order marked with the vertical line.

It’s no puzzle why mobility data showed more people staying at home three weeks earlier. Mobility became limited on Monday, March 16 when a million or so families suddenly had children to take care of at home instead of going off to school.

This also raises questions about estimates of the effects of interventions such as lockdowns and school closings. Closing schools induces some social distancing and staying home from work, even beyond students and school employees.

“Young Lions: How Jewish Authors Reinvented the American War Novel”

I read this book by Leah Garrett and I liked it a lot. Solid insights on Joseph Heller, Saul Bellow, and Norman Mailer, of course, but also the now-forgotten Irwin Shaw (see here and here) and Herman Wouk. Garrett’s discussion of The Caine Mutiny was good: she takes it seriously, enough to point out its flaws, showing respect to it as a work of popular art.

I’d read many of the novels that Garrett wrote about, but when I’d read them I’d not thought much about the Jewish element (except in The Young Lions, where it’s central to the book, no close reading necessary). Garrett’s book gave me insight into the Jewish themes but also helped me reinterpret the novels in their social and political context.

P.S. Garrett is speaking online next week about her new book, X Troop: The Secret Jewish Commandos of World War II.

“1919 vs. 2020”

We had this discussion the other day about a questionable claim regarding the effects of social distancing policies during the 1918/1919 flu epidemic, and then I ran across this post by Erik Loomis who compares the social impact of today’s epidemic to what happened 102 years ago:

It’s really remarkable to me [Loomis] that the flu of a century killed 675,000 Americans out of a population of 110 million, meaning that roughly works out to the 2.2 million upper range guess of projections for COVID-19 by proportion of the population. And yet, the cultural response to it was primarily to shrug our collective shoulders and get on with our lives. . . . Some communities did engage in effective quarantining, for instance, and there were real death rate differentials between them. But to my knowledge anyway, sports weren’t cancelled. The World Series went on as normal (and quite famously in 1919!). There was no effective government response at the federal level.

Moreover, when it ended, the Spanish flu had almost no impact on American culture. There’s a very few references to it in American literature. Katherine Anne Porter’s Pale Horse Pale Rider. Hemingway mentions it in Death in the Afternoon. There’s a good John O’Hara story about it. And….that’s basically it? . . .

Now, yes it is true that the years of 1918 and 1919 were fast-paced years in the U.S. Over 100,000 people died in World War I . . . [but] while the war and its aftermath obviously were dominant features of American life at the time, there’s hardly anything in there that would erase the memory of a situation where nearly 7 times as many people died as in the war.

So what is going on here? . . . Americans were simply more used to death in 1919 than in 2020. People died younger and it was a more common fact of life then. Now, don’t underestimate the science in 1919. The germ theory was pretty well-established. Cities were being cleaned up. People knew that quarantining worked. The frequent pandemics of the 16th-19th centuries were largely in the past. But still….between deaths in pregnancy and deaths on the job, deaths from poisonings of very sorts and deaths from any number of accidents in overcrowded and dangerous cities, people died young. . . .

I remember thinking about this in the 1970s and 1980s, when we were all scared of being blown up in a nuclear war. (Actually, I’m still scared about that.) My reasoning went like this: (1) The post-1960s period was the first time in human history that we had the ability to destroy our civilization. (2) This seemed particular horrifying for my generation because we had grown up with the assumption that we’d all live long and full lives. (3) If it wasn’t nuclear war, it would be biological weapons: the main reason that the U.S. and the Soviet Union didn’t have massive bioweapons programs was that nuclear weapons were more effective at mass killing. (4) It made sense that the ability to develop devastating biological weapons came at around the same time as we could cure so many diseases. So, immortality and potential doom came together.

Regarding 1918, remember this graph:

Just pretend they did the right thing and had the y-axis go down to 0. Then you’ll notice two things: First, yeah, the flu in 1918 really was a big deal—almost 4 times the death rate compared to early years. Second, it was only 4 times the death rate. I mean, yeah, that’s horrible, but only a factor of 4, not a factor of 10. I guess what I’m saying is, I hadn’t realized how much of a scourge flu/pneumonia was even in non-“pandemic” years. Interesting.

These are all just scattered thoughts. There must be some books on the 1918/1919 flu that would give some more perspective on all this.

Coronavirus Grab Bag: deaths vs qalys, safety vs safety theater, ‘all in this together’, and more.

This post is by Phil Price, not Andrew.

This blog’s readership has a very nice wind-em-up-and-watch-them-go quality that I genuinely appreciate: a thought-provoking topic provokes some actual thoughts. So here are a few things I’ve been thinking about, without necessarily coming to firm conclusions. Help me think about some of these. This post is rather long so I’m putting most of it below the fold.

Continue reading ‘Coronavirus Grab Bag: deaths vs qalys, safety vs safety theater, ‘all in this together’, and more.’ »

Uncertainty and variation as distinct concepts

Jake Hofman, Dan Goldstein, and Jessica Hullman write:

Scientists presenting experimental results often choose to display either inferential uncertainty (e.g., uncertainty in the estimate of a population mean) or outcome uncertainty (e.g., variation of outcomes around that mean). How does this choice impact readers’ beliefs about the size of treatment effects? We investigate this question in two experiments comparing 95% confidence intervals (means and standard errors) to 95% prediction intervals (means and standard deviations). The first experiment finds that participants are willing to pay more for and overestimate the effect of a treatment when shown confidence intervals relative to prediction intervals. The second experiment evaluates how alternative visualizations compare to standard visualizations for different effect sizes. We find that axis rescaling reduces error, but not as well as prediction intervals or animated hypothetical outcome plots (HOPs), and that depicting inferential uncertainty causes participants to underestimate variability in individual outcomes.

These results make sense. Sometimes I try to make this point by distinguishing between uncertainty and variation. I’ve always thought these two concepts were conceptually distinct (we can speak of uncertainty in the estimate of a population average, or variation across the population), but then I started quizzing students, and I learned that, to them, “uncertainty” and ‘variation” were not distinct concepts. Part of this is wording—there’s an idea that these two words are roughly synonyms—but I think part of it is that most people don’t think of these as being two different ideas. And if lots of students don’t get this distinction, it’s no surprise that researchers and consumers of research also get stuck on this.

I’m reminded of the example from a few months ago where someone published a paper including graphs that revealed the sensitivity of its headline conclusions on some implausible assumptions. The question then arose: what if the paper had not included the graph, then maybe no one would’ve realized the problem. I argued that, had the graph not been there, I would’ve wanted to see the data. But a lot of people would just accept the estimate and standard error and not want to know more.

They want open peer review for their paper, and they want it now. Any suggestions?

Someone writes:

We’re in the middle of what feels like a drawn out process of revise and resubmit with one of the big journals (though by pre-pandemic standards everything has moved quite quickly), and what’s most frustrating is that the helpful criticisms and comments from the reviewers, plus our extensive responses and new sensitivity analyses and model improvements, are all happening not in public. (We could post them online, but I think we’re not allowed to share the reviews! And we didn’t originally put our report on a preprint server, which I now regret, so a little hard to get an update disseminated.)

For our next report, I wonder if you know of any platforms that’d allow us to do the peer review out in the open. Medrxiv/arxiv.org are great for getting the preprints out there, but not collecting reviews. Something like OpenReview.net (used for machine learning conferences) might work. Maybe there’s something else out there you know about? Do any journals do public peer review?

My reply:

Can PubPeer include reviews of preprints? And there is a site called Researchers One that does open review.

Also, you could send me your paper, I’ll post it here and people can give open reviews in the comments section!

Standard deviation, standard error, whatever!

Ivan Oransky points us to this amusing retraction of a meta-analysis. The problem: “Standard errors were used instead of standard deviations when using data from one of the studies”!

Actually, I saw something similar happen in a consulting case once. The other side had a report with estimates and standard errors . . . the standard errors were suspiciously low . . . I could see that the numbers were wrong right away, but it took me a couple hours to figure out that what they’d done was to divide by sqrt(N) rather than sqrt(n)—that is, they used the population size rather than the sample size when computing their standard errors.

As Bob Carpenter might say, it doesn’t help that statistics uses such confusing jargon. Standard deviation, standard error, variance, bla bla bla.

But what really amused me about this Retraction Watch article was the this quote at the end:

As Ingram Olkin stated years ago, “Doing a meta-analysis is easy . . . Doing one well is hard.”

Whenever I see the name Ingram Olkin, I think of this story from the cigarette funding archives:

Much of the cancer-denial work was done after the 1964 Surgeon General’s report. For example,

The statistician George L. Saiger from Columbia University received [Council for Tobacco Research] Special Project funds “to seek to reduce the correlation of smoking and diseases by introduction of additional variables”; he also was paid $10,873 in 1966 to testify before Congress, denying the cigarette-cancer link.

. . .

Ingram Olkin, chairman of Stanford’s Department of Statistics, received $12,000 to do a similar job (SP-82) on the Framingham Heart Study . . . Lorillard’s chief of research okayed Olkin’s contract, commenting that he was to be funded using “considerations other than practical scientific merit.”

So maybe doing a meta-analysis badly is hard, too!

It’s “a single arena-based heap allocation” . . . whatever that is!

After getting 80 zillion comments on that last post with all that political content, I wanted to share something that’s purely technical.

It’s something Bob Carpenter wrote in a conversation regarding implementing algorithms in Stan:

One thing we are doing is having the matrix library return more expression templates rather than copying on return as it does now. This is huge in that it avoids a lot of intermediate copies. Some of these don’t look so big when running a single chain, but stand out more when running multiple chains in parallel when there’s overall more memory pressure.

Another current focus for fusing operations is the GPU so that we don’t need to move data on and off GPU between operations.

Stan only does a single arena-based heap allocation other than for local vector and Eigen::Matrix objects (which are standard RAII). Actually, it’s more of an expanding thing, since it’s not pre-sized. But each bit only gets allocated once in exponentially increasing chunk sizes, so there’ll be at most log(N) chunks. It then reuses that heap memory across iterations and only frees at the end.

We’ve found that using a standard functional map operation internally that partially evaluates reverse-mode autodiff for each block over which the function is mapped. This reduces overall memory size and keeps the partial evaluations more memory local, all of which speeds things up at the expense of clunkier Stan code.

The other big thing we’re doing now is looking at static matrix types, of the sort used by Autograd and JAX. Stan lets you assign into a matrix, which destroys any hope of memory locality. If matrices never have their entries modified after creation (for example through a comprehension or other operation), then the values and adjoints can be kept as separate double matrices. Our early experiments are showing a 5–50-fold speedup depending on the order of data copying reduced and operations provided. Addition’s great at O(N^2) copies currently for O(N^2) operations (on N x N matrices). Multiplication with an O(N^2) copy cost and O(N^3) operation cost is less optimizable when matrices get big.

I love this because I don’t understand half of what Bob’s talking about, but I know it’s important. To make decisions under uncertainty, we want to fit hierarchical models. This can increase computational cost, etc. In short: Technical progress on computing allows us to fit better models, so we can learn more.

Recall the problems with this study, which could’ve been avoided by a Bayesian analysis of uncertainty in specificity and sensitivity, and multilevel regression and poststratification for adjusting for differences between sampling and population. In that particular example, no new computational developments are required—Stan will work just fine as it is, and to the extent that Stan improvements would help, it would be in documentation and interfaces. But, moving forward, computational improvements will help us fit bigger and better models. This stuff can make a real difference in our lives, so I wanted to highlight how technical it can all get.

“So the real scandal is: Why did anyone ever listen to this guy?”

John Fund writes:

[Imperial College epidemiologist Neil] Ferguson was behind the disputed research that sparked the mass culling of eleven million sheep and cattle during the 2001 outbreak of foot-and-mouth disease. He also predicted that up to 150,000 people could die. There were fewer than 200 deaths. . . .

In 2002, Ferguson predicted that up to 50,000 people would likely die from exposure to BSE (mad cow disease) in beef. In the U.K., there were only 177 deaths from BSE.

In 2005, Ferguson predicted that up to 150 million people could be killed from bird flu. In the end, only 282 people died worldwide from the disease between 2003 and 2009.

In 2009, a government estimate, based on Ferguson’s advice, said a “reasonable worst-case scenario” was that the swine flu would lead to 65,000 British deaths. In the end, swine flu killed 457 people in the U.K.

Last March, Ferguson admitted that his Imperial College model of the COVID-19 disease was based on undocumented, 13-year-old computer code that was intended to be used for a feared influenza pandemic, rather than a coronavirus. Ferguson declined to release his original code so other scientists could check his results. He only released a heavily revised set of code last week, after a six-week delay.

So the real scandal is: Why did anyone ever listen to this guy?

I don’t know. It’s a good question. When Ferguson was in the news a few months ago, why wasn’t there more discussion of his atrocious track record? Or was his track record not so bad? A google search turned up this op-ed by Bob Ward referring to Ferguson’s conclusions as “evidence that Britain’s political-media complex finds too difficult to accept.” Regarding the foot-and-mouth-disease thing, Ward writes, “Ferguson received an OBE in recognition for his important role in the crisis, or that he was afterwards elected a fellow of the prestigious Academy of Medical Sciences.” Those sorts of awards don’t cut much ice with me—they remind me too much of the U.S. National Academy of Sciences—but maybe there’s more of the story I haven’t heard.

I guess I’d have to see the exact quotes that are being referred to in the paragraphs excerpted above. For example, what did Ferguson exactly say when he “predicted that up to 150,000 people could die” of foot-and-mouth disease. Did he say, “I expect it will be under 200 deaths if we cull the herds, but otherwise it could be up to 2000 or more, and worst case it could even be as high as 150,000?” Or did he flat out say, “150,000, baby! Buy your gravestone now while supplies last.”? I wanna see the quotes.

But, if Ferguson really did have a series of previous errors, then, yeah, Why did anyone ever listen to this guy?

In the above-linked article, Fund seems to be asking the question rhetorically.

But it’s a good question, so let’s try to answer it. Here are a few possibilities:

1. Ferguson didn’t really make all those errors; if you look at his actual statements, he was sane and reasonable.

Could be. I can’t evaluate this one based on the information available to me right now, so let’s move on.

[Indeed, there seems to be some truth to this explanation; see P.S. below.]

2. Nobody realized Ferguson had made all those errors. That’s true of me—I’d never heard of the guy before all this coronavirus news.

We may be coming to a real explanation here. If a researcher has success, you can find evidence of it—you’ll see lots of citations, a prestigious position, etc. But if a researcher makes mistakes, it’s more of a secret. Google the name and you’ll find some criticism, but it’s hard to know what to make of it. Online criticism doesn’t seem like hard evidence. Even published papers criticizing published work typically don’t have the impact of the original publications.

3. Ferguson played a role in the system. He told people what they wanted to hear—or, at least, what some people wanted to hear. Maybe he played the role of professional doomsayer.

There must be something to this. You might say: Sure, but if they wanted a doomsayer, why not find someone who hadn’t made all those bad predictions? But that misses the point. If someone’s job is to play a role, to speak from the script no matter what the data say, then doing bad work is a kind of positive qualification, in that it demonstrates one’s willingness to play that role.

But this only takes us part of the way there. OK, so Ferguson played a role. But why would the government want him to play that role. If you buy the argument of Fund (the author of the above-quoted article), the shutdowns were a mistake, destructive economically and unnecessary from the standpoint of public health. For the government to follow such advice—presumably, someone must have been convinced of Ferguson’s argument from a policy perspective. So that brings us back to points 1 and 2 above.

4. A reputational incumbency effect. Once someone is considered an expert, they stay an expert, absent unusual circumstances. Consider Dr. Oz, who’s an expert because people consider him an expert.

5. Low standards. We’ve talked about this before. Lots of tenured and accoladed professors at top universities do bad work. I’m not just talking about scandals such as pizzagate or that ESP paper or epic embarrassments such as himmicanes; I’m talking more about everyday mediocrity: bestselling books or papers in top journals that are constructed out of weak evidence. See for example here, here, and here.

The point is, what it takes to be a celebrated academic is to have some successes. You’re defined by the best thing you did, not the worst.

And maybe that’s a good thing. After all, lots of people can do bad work: doing bad work doesn’t make you special. I proved a false theorem once! But doing good work, that’s something. Now, some of these celebrity academics have never done any wonderful work, at least as far as I can tell. But they’re benefiting from the general principle.

On the other hand, if the goal is policy advice, maybe it’s better to judge people by their worst. I’m not sure.

Not that we’re any better here in the U.S., where these academics have had influence in government.

Taking the long view, organizations continue to get staffed with knaves and fools. Eternal vigilance etc. Screaming at people in the press isn’t a full solution, but it’s a start.

P.S. There seems to some truth to explanation 1 above, “Ferguson didn’t really make all those errors; if you look at his actual statements, he was sane and reasonable.” From Tom in comments:

Mad Cow paper:
https://www.ncbi.nlm.nih.gov/pubmed/11786878

From abstract:
“Extending the analysis to consider absolute risk, we estimate the 95% confidence interval for future vCJD mortality to be 50 to 50,000 human deaths considering exposure to bovine BSE alone, with the upper bound increasing to 150,000 once we include exposure from the worst-case ovine BSE scenario examined.”

Consistent with the “up to 50,000” quote but the quote fails to mention the lower bound.

See also Vidur’s comment which discusses some of the other forecasts.

Laplace’s Demon: A Seminar Series about Bayesian Machine Learning at Scale

David Rohde points us to this new seminar series that has the following description:

Machine learning is changing the world we live in at a break neck pace. From image recognition and generation, to the deployment of recommender systems, it seems to be breaking new ground constantly and influencing almost every aspect of our lives. In ths seminar series we ask distinguished speakers to comment on what role Bayesian statistics and Bayesian machine learning have in this rapidly changing landscape. Do we need to optimally process information or borrow strength in the big data era? Are philosophical concepts such as coherence and the likelihood principle relevant when you are running a large scale recommender system? Are variational approximations, MCMC or EP appropriate in a production environment? Can I use the propensity score and call myself a Bayesian? How can I elicit a prior over a massive dataset? Is Bayes a reasonable theory of how to be perfect but a hopeless theory of how to be good? Do we need Bayes when we can just A/B test? What combinations of pragmatism and idealism can be used to deploy Bayesian machine learning in a large scale live system? We ask Bayesian believers, Bayesian pragmatists and Bayesian sceptics to comment on all of these subjects and more.
The audience is machine learning practitioners and statisticians from academia and industry.

The first talk is Christian Robert, at 11am on Wed 13 May.

Make Andrew happy with one simple ggplot trick

By default, ggplot expands the space above and below the x-axis (and to the left and right of the y-axis). Andrew has made it pretty clear that he thinks the x axis should be drawn at y = 0. To remove the extra space around the axes when you have continuous (not discrete or log scale) axes, add the following to a ggplot plot,

plot <-
  plot + 
  scale_x_continuous(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0))

Maybe it could even go in a theme.

Hats off to A5C1D2H2I1M1N2O1R2T1 (I can't make these handles up) for posting the solution on Stack Overflow.

We need better default plots for regression.

Robin Lee writes:

To check for linearity and homoscedasticity, we are taught to plot residuals against y fitted value in many statistics classes.

However, plotting residuals against y fitted value has always been a confusing practice that I know that I should use but can’t quite explain why.
It is not until this week I understood the concept of “Y fitted values as a function of all explanatory variables”

Since homoscedasticity means that the conditional variance of the error term is constant and does not vary as a function of the explanatory variable, it seems logical that we compare residuals (error term) with x (explanatory).

Why don’t we plot for residuals against x observed values? Or why don’t we teach students to plot residuals against x variable?

One reason I can think of to explain this practice is that while it is possible to plot residuals against x observed in a one variate regression model, it becomes tricky when there are two or more X variables. It might require multiple scatter plots, one per X variable.
Since Y fitted values is a function of all explanatory variables, it becomes convenient to plot one plot

Another reason is that this practice is reinforced by the tool availability.
plot(a lm model) in R returns four plots for model diagnostics. The first of which is plotting residuals against y fitted value.
To plot residuals against each of the X variables, it requires users to know how to extract residuals from lm model object.

Any other reasons from statistical, communication or teaching angle to explain the conventional practice of residuals against y fitted value?

My reply: Yes, we do recommend plotting residuals vs. individual x variables. Jennifer and I give an example in the logistic regression chapter of our book. But I guess this is not so well known, given that you asked this question! Hence this post.

My other reason for posting this is to echo Robin’s point about the importance of defaults. We need better default plots for regression so that practitioners will do a better job learning about their data. Otherwise we’re looking at endless vistas of useless q-q plots.

“Positive Claims get Publicity, Refutations do Not: Evidence from the 2020 Flu”

Part 1

Andrew Lilley, Gianluca Rinaldi, and Matthew Lilley write:

You might be familiar with a recent paper by Correira, Luck, and Verner who argued that cities that enacted non-pharmaceutical interventions earlier / for longer during the Spanish Flu of 1918 had higher subsequent economic growth. The paper has had extensive media coverage – e.g. it has been covered by the NYT, Washington Post, The Economist, The Boston Globe, Bloomberg, and of course NPR.

We were surprised by their results and tried to replicate them. To investigate the question further, we collected additional data to extend their sample from 1899 to 1927. Unfortunately, from extending their sample backwards, it seems that their headline results are driven by pre-existing differential growth patterns across US cities. We found that NPI measures in 1918 explain growth between 1899 and 1914 just as well as 1914 to 1919 growth. Further, we also found that the 1914-1919 growth result is mostly driven by estimated city level population growth that happened from 1910-1917. The approaches that we have taken to deal with these spurious results leave us finding only a noisy zero; we cannot rule out either substantial positive or negative economic effects from NPIs.

Here you can find our comment as well as the data we used and the replication code.

First off, I don’t find this depressing. We all make mistakes. If Correira et al. made a mistake on a newsworthy topic and it got lots of press coverage, then, sure, that’s too bad, but such things happen. It’s a sign of cheer, not depression, that a paper appearing on 26 Mar 2020 gets shot down on 2 May. A solid review in 5 weeks—that’s pretty good!

Part 2

Now it’s time for me to look at the articles. First the Correira et al. paper:

It seems funny that they use 1914 as a baseline for an intervention in 1918. Wouldn’t it make more sense to use the change from 1917 to 1921, or something like that? I guess that they had to make use of the data that were available—but I’m already concerned, given that more than half of any changes are happening before the treatment even occurred.

Beyond that, the pattern in the graph seems driven by a combination of relatively low mortality and high growth in western cities. I could imagine lots of reasons for both of these factors, having little or nothing to do with social distancing etc.

And check this out:

Just pretend they did the right thing and had the y-axis go down to 0. Then you’ll notice two things: First, yeah, the flu in 1918 really was a big deal—almost 4 times the death rate compared to early years. Second, it was only 4 times the death rate. I mean, yeah, that’s horrible, but only a factor of 4, not a factor of 10. I guess what I’m saying is, I hadn’t realized how much of a scourge flu/pneumonia was even in non-“pandemic” years. Interesting.

OK, now to the big thing. They have an observational study, comparing cities with more or less social distancing policies (“school closure, public gathering bans, and isolation and quarantine”). They make use of an existing database from 2007 with information on flu-control policies in 43 cities. And here are the pre-treatment variables they adjust for:

State agriculture and manufacturing employment shares, state and city population, and the urban population share are from the 1910 census. . . . State 1910 income per capita . . . State-level WWI servicement mortality . . . annual state-level population and density estimates from the Census Bureau. City-level public spending, including health spending . . . City and state-level exports. . . . the importance of agriculture in each city’s state. . . . we also present tests controlling for health spending and total public spending per capita in a city. . . . 1910 [state] agriculture employment share, the [state] 1910 urban population share, and the 1910 income per-capita at the state level. . . . log of 1910 population and the 1914 manufacturing employment to population ratio. However, unlike in our analysis on the effect of the 1918 Flu on the real economy, here we control for past city-level mortality as of 1917

They discuss and dismiss the possibility of endogeneity:

For instance, local officials may be more inclined to intervene if the local exposure to the flu is higher . . . An alternative concern is that interventions reflect the quality of local institutions, including the local health care system. Places with better institutions may have a lower cost of intervening, as well as higher growth prospects. There are, however, important details that suggest that the variation across cities is unrelated to economic fundamentals and is instead largely explained by city location. First, local responses were not driven by a federal response . . . Second . . . the second wave of the 1918 flue pandemic spread from east to west . . . the distance to the east coast explains a large part of the variation in NPIs across cities . . .

That’s the pattern that I noticed in the top graph above. I’m still concerned that they cities in the west coast were just growing faster already. Remember, their outcome is growth from 1914 to 1919, so more than half the growth had already happened before the pandemic came.

They also look at changes in city-level bank assets from October 1918 to October 1919. But I’m not so thrilled with this measure. First, I’m not quite sure what to make of it. Second, using a baseline of October 1918 seems a bit too late, as the pandemic had already started by then. So I’ll set that analysis aside, which I think is fine. The abstract only mentions manufacturing output, so let’s just stick with that one. But then I really am concerned about differences between east and west.

They report:

Reassuringly for our purposes, other than differences in the longitude and the variation in the local industry structure, there are no observable differences across cities with different NPIs.

Anyway, here are their main city-level findings. I think we’re supposed to focus on the estimates for 1919, but I guess you can look at 1921 and 1923 also:

The confidence intervals for manufacturing output exclude zero and the intervals for manufacturing employment in 1919 include zero; I guess that’s why they included manufacturing output but not manufacturing employment in the abstract. Fair enough: they reported all their findings in the paper but focused on the statistically significant results in the abstract. But then I’m confused why in their pretty graph (see above) they showed manufacturing employment instead of manufacturing output. Shouldn’t the graph match what’s in the abstract and the news reports?

In any case, I remain concerned that the cities in the west had more time to prepare for the pandemic, so they implemented social distancing etc., and they were growing anyway. This would not say that the authors’ substantive hypothesis—social distancing is good for the economy—is wrong; it just says that the data are hopelessly confounded so they don’t really answer the question as implied in the paper.

They do some robustness checks:

These controls are included to account for other time-varying shocks that may be correlated with local NPI policies. . . . city-level manufacturing employment and output growth from 1909 to 1914 to control for potential city-level pre-trends. . . . census region fixed effects . . . longitude. . . . state WWI servicemen mortality per capita . . . city population density . . . health and total public spending per capita in 1917 . . . city-level estimates of total exports and exports per capita in 1915.

They don’t adjust for all of these at once. That would be difficult, of course. I’m just thinking that maybe linear regression isn’t the right way to do this. Linear regression is great, but maybe before throwing these cities in the regression, you need to do some subsetting, maybe separate analyses for slower and faster-growing cities? Or some matching. Maybe Lowell and Fall River weren’t going to increase their manufacturing base anyway?

At this point you might say that I’m just being an asshole. These researchers wrote this nice paper on an important problem, then I had a question about geography but they already addressed in their robustness check, so what’s my complaint? Sure, not all the numbers in the robustness table have p-values less than 0.05, but I’m the last person to be a statistical significance nazi.

All I can say is . . . sorry, I’m still concerned! Again, I’m not saying the authors’ economic story is wrong—I have no idea—I just question the idea that this analysis is anything more than correlational. To put it another way—I could also easily imagine the authors’ economic story being correct but the statistical analysis going in the opposite direction. The tide of the correlation is so much larger than the swells of the treatment effect.

To put it yet another way, they’re not looking for a needle in a haystack, they’re looking for a needle in a pile of needles.

Again, this is not meant as a slam on the authors of the article in question. They’re using standard methods, they’ve obviously put in a lot of work (except for those bizarre stretchy maps in Figure 3; I don’t know how that happened!), the paper is clearly written, etc. etc. Nothing improper here; I just don’t think this method will solve their problem.

Interlude

And that’s where the story would usually end. A controversial paper is published, gets splashed all over NPR but people have questions. I read the paper and I’m unsatisfied too. Half the world concludes that if I’m not convinced by the paper, that it’s no good; the other half believes in baseball, apple pie, Chevrolet, and econometrics and concludes that I’m just a crabapple; and the authors of the original paper don’t know what to think so they wait for their referee reports to decide how to buttress their conclusions (or maybe to reassess their claims, but, over the years, I’ve seen a lot more buttressing than I’ve seen reassessing).

Part 3

But then there was that email from Lilley et al., who in their paper write:

Using data from 43 US cities, Correia, Luck, and Verner [2020] find that the 1918 Flu pandemic had strong negative effects on economic growth, but that Non Pharmaceutical Interventions (NPIs) mitigated these adverse economic effects. . . . We collect additional data which shows that those results are driven by population growth between 1910 to 1917, before the pandemic. We also extend their difference in differences analysis to earlier periods, and find that once we account for pre-existing differential trends, the estimated effect of NPIs on economic growth are a noisy zero; we can neither rule out substantial positive nor negative effects of NPIs on employment growth.

“A noisy zero” . . . I kinda like that. But I don’t like the “zero” part. Better, I think, to just say that any effects are hard to ascertain from these data.

Anyway, here’s a key graph:

We see those western cities with high growth. But check out San Francisco. Yes, it was in the west. But it had already filled out by 1900 and was not growing fast like Oakland, Seattle, etc. So maybe no need to use lack of social distancing policies to explain San Francisco’s relatively slow post-1917 growth.

Lilley et al. run some regressions of their own and conclude, “locations which implemented NPIs more aggressively grew faster than those which did not both before the policy implementation, and afterward.” That makes sense to me, given the structure of the problem.

I have not examined Lilley et al.’s analysis in detail, and it’s possible they missed something important. It’s hard for me to see how the original strong conclusions of Correia et al. could be salvaged, but who knows. As always, I’m open to evidence and argument.

There’s the potential for bias in my conclusions, if for no other reason than that the authors of the first article didn’t contact me; the authors of the second author did. It’s natural for me to take the side of people who email me things. I don’t always, but a bias in that direction would be expected. On the other hand, there could be selection bias in the other direction: I expect that people who send me things are going to expect some scrutiny.

The larger issue is that there seem to be more outlets for positive claims than negative claims. “We found X” can get you publication in a top journal and major media coverage—in this case, even before publication. “We didn’t find X” . . . that’s a tougher sell. Sure, every once in awhile there’s some attention for a non-replication of some well-known finding, but (a) such events are not so common, and (b) they still require the prior existence of a celebrated positive finding to react to. You could say that this is fine—positive claims should be more newsworthy than negative claims—and that’s fair enough, as long as we come in with the expectation that those positive claims will often be wrong, or, at least, not supported by the data.

University of Washington biostatistician unhappy with ever-changing University of Washington coronavirus projections

The University of Washington in Seattle is a big place.

It includes the Institute for Health Metrics and Evaluation (IHME), which has produced a widely-circulated and widely-criticized coronavirus model. As we’ve discussed, the IHME model is essentially a curve-fitting exercise that makes projections using the second derivative of the time trend on the log scale. It is what it is; unfortunately the methods are not so transparent and their forecasts keep changing, not just because new data come in but also because they keep rejiggering their model. Model improvement is always a possibility; on important problems we’re always working near Cantor’s corner. But if your model is a hot steaming mess and you keep having to add new terms to keep it under control, then that might be a sign that curve fitting isn’t doing the job.

Anyway, back to the University of Washington. In addition to the doctors, economists, and others who staff the aforementioned institute, this university also has a biostatistics department. In that department is Ruth Etzioni, who does cancer modeling and longitudinal modeling more generally. And she’s not happy with the IHME report.

Etzioni writes:

As a long-time modeler, the latest update from Chris Murray and the IHME model makes me [Etzioni] cringe.

On May 4, the IHME called a press conference to release the results of their model update which showed a staggering departure from their prior predictions of about 60,000 deaths.

The new prediction through August is more than double that, at over 130,000 deaths. Murray, the institute’s director and chief model spokesman told reporters that the update had taken mobility data into account which captured an uptick in movement in re-opening states.

According to Politico, the primary reason given by Murray for the increase was many states’ “premature relaxation of social distancing.” . . .

It makes a nice story, to tell the world that the reason your model’s predictions have changed is because the population’s behavior has changed. Indeed, that was the explanation given by IHME for their model revising its early death toll of about 90,000 dramatically downward. Then, Dr Murray explained that the change in predictions showed that social distancing had been a wild success – better than we could ever have imagined. That, in turn, led to howls that we had over-reacted by shutting down and staying home. I said then that that interpretation was wrong and misleading and placed far too much credibility in the early predictions. The early model results, based on a massive assumption about the shape of the mortality curve, and driven by thin local data, were not to be compared with later predictions based on major updates to both its inputs and its structure.

The same thing is happening now. A quick skim of the IHME’s model updates site leads one to an eye-glazing list of changes, including some that have nothing to do with mobility and everything to do with improving the fit to the past. Here is one that truly raised my eyebrows: “Since our initial release, we have increased the number of multi-Gaussian distribution weights that inform our death model’s predictions for epidemic peaks and downward trends. As of today’s release, we are including 29 elements… this expansion now allows for longer epidemic peaks and tails, such that daily COVID-19 deaths are not predicted to fall as steeply as in previous releases.” This change alters the shape of the assumed mortality curve so it does not go down as fast; it alone could explain a substantial portion of the inflation in the revised mortality predictions.

The proof is in the Washington State pudding. The IHME no longer predicting that we will have less than one case per million on May 28th and can therefore safely reopen, as it did in its previous incarnation. But little has changed here on the policy front and according to Google Mobility reports through April.

I am not aiming my comments at the IHME modeling team, which I imagine is sincerely doing its best to deliver results that match the data and produce ever-more-complex predictions of the future. They are working overtime to fit the rapidly evolving and imperfect data, marching to an impossible drumbeat of deadlines. To their credit, the update does note that both model changes and increased mobility projections could account for the change in predictions. But that never made it into the headlines. And that is a problem of transparency.

Transparency begins in the sincere effort by those who communicate models to make sure that they are properly interpreted by the policymakers and public that are using them. The IHME pays lip service to transparency by documenting their model’s updates on their website. But their pages-long description is chock full of technical fine print and is hard to understand, even for a seasoned modeler like myself. A key part of transparency is acknowledging your model’s limitations, and uncertainties. This is never front and center in the IHME’s updates. It needs to be.

Etzioni concludes:

This epidemic took root and grew massively in every state from miniscule beginnings. We should all be sobered by our real-time experience of exponential growth. If there is an ambient prevalence of more than a handful of cases in any state, then anything that increases the potential for transmission will lead to a re-growth. We do not need a model to be able to predict that. But, as we plan for how to reopen in each state of our union, we need to know what extent of regrowth we can manage. And models can help us with that.

When and how much we can reopen will depend on the surveillance and containment infrastructure that we put in place to control upticks and outbreaks. I am convinced that models can help to think clearly about complex policy questions about the balance between changes that increase transmission and measures to contain it. Models, along with other data and evidence, can guide us towards making sensible policy decisions. I have seen this happen time and time again in my work advising national cancer screening policy panels. But as modelers, we have a responsibility. We must make sure that the key caveats of our work find their way into the headlines and are not relegated to the fine print.

I agree. There is a fundamental difficulty here in that one thing that modelers have to offer is their expertise in modeling. If you make your model too transparent, then anyone can run it! And then what happened to your expertise?

I’m not saying that anyone is making their methods obscure on purpose. Not at all. Rather, the challenge is that it’s hard to write things up. Unless you have a good workflow, obscurity comes by default, and it takes extra effort to be transparent. That’s one advantage of using Stan.

Anyway, transparency takes effort, so you’re only gonna do it if you have an incentive to do so. One incentive is that if your work is transparent, it will be easier for other people to help you. Another advantage is that your work might ultimately be more influential, as other people can take your code and alter it for their purposes. These are big advantages in the new world of open science, but not everyone has moved to this world.

New Within-Chain Parallelisation in Stan 2.23: This One‘s Easy for Everyone!

What’s new? The new and shiny reduce_sum facility released with Stan 2.23 is far more user-friendly and makes it easier to scale Stan programs with more CPU cores than it was before. While Stan is awesome for writing models, as the size of the data or complexity of the model increases it can become impractical to work iteratively with the model due to too long execution times. Our new reduce_sum facility allows users to utilise more than one CPU per chain such that the performance can be scaled to the needs of the user, provided that the user has access to respective resources such as a multi-core computer or (even better) a large cluster. reduce_sum is designed to calculate in parallel a (large) sum of independent function evaluations, which basically is the evaluation of the likelihood for the observed data with independent contributions as applicable to most Stan programs (GP problems would not qualify though).

Where do we come from? Before 2.23, the map_rect facility in Stan was the only tool enabling CPU based parallelisation. Unfortunately, map_rect has an awkward interface since it forces the user to pack their model into a set of weird data structures. Using map_rect often requires a complete rewrite of the model which is error prone, time intensive, and certainly not user-friendly. In addition, chunks of works had to be formed manually leading to great confusion around how to „shard“ things. As a result, map_rect was only used by a small number of super-users. I feel like I should apologise for map_rect given that I proposed the design. Still, map_rect did drive some crazy analyses with up to 600 cores!

What is it about? reduce_sum leverages the fact that the sum operation is associative. As a consequence, we can break a large sum of independent terms into an arbitrary number of partial sums. Hence, the user needs to provide a “partial sum” function. This function must follow conventions that allow it to evaluate arbitrary partial sums. The key to user-friendliness is that the partial sum function allows an arbitrary number of additional arguments of arbitrary structure. Therefore, the user can naturally formulate their model as no awkward packing/unpacking is needed. Finally, the actual slicing into smaller partial sums is performed in full automation which automatically tunes the computational task to the given resources.

What can users expect? As usual, the answer is „it depends“. Great… but on what? Well, first of all we have to account for the fact that we do not parallelise the entire Stan program, but only a fraction of the total program is run in parallel. The theoretical speedups in this case are described by Amdahl‘s law (plot is taken from the respective Wikipedia page)

AmdahlsLaw

You can see that only when the fraction of the parallel task is really large (beyond 95%), then you can expect very good scaling of the performance up to many cores. Still, doubling the speed is easily done for most cases with just 2-3 cores. Thus, users should pack as much of their Stan program into the partial sum function to increase the fraction of parallel work load – not only the data likelihood, but ideally also the calculation to get the by data record model mean value, for example. For Stan programs this will usually mean that code in the transformed parameters and model block will be moved into the partial sum function. As a bonus for doing so, we have actually observed that this will speedup your program – even when using only a single core! The reason is that reduce_sum will slice the given task into many small ones which improves the use of CPU caches.

How can users apply it? Easy! Grab CmdStan 2.23 and dive into our documentation (R / Python users may use CmdStanR / CmdStanPy – RStan 2.23 is underway). I would recommend to go over our documentation in this order:

1. A case study which adapts Richard McElreath’s intro to map_rect for reduce_sum
2. User manual introduction to reduce_sum parallelism with a simple example as well: 23.1 Reduce-Sum
3. Function reference: 9.4 Reduce-Sum Function

I am very happy with the new facility. It was a tremendous piece of work to get this into Stan and I want to thank my Stan team colleagues Ben Bales, Steve Bronder, Rok Cesnovar, and Mitzi Morris for making all of this possible in a really short time frame. We are looking forward to what our users will do with it. We definitely encourage everyone to try it out!

Calibration and recalibration. And more recalibration. IHME forecasts by publication date

Carlos Ungil writes:

The IHME released an update to their model yesterday.

Using now a better model and taking into account the relaxation of mitigation measures their forecast for US deaths has almost doubled to 134k (95% uncertainty range 95k-243k).

My [Ungil’s] charts of the evolution of forecasts across time can be found here.

I haven’t checked Ungil’s graphs—I haven’t looked at the data or code—but, conditional on them being accurate summaries of the IHME’s reports, they tell an interesting story.

Above are a few of the graphs. Lots more at the link.

Imperial College report on Italy is now up

See here. Please share your reactions and suggestions in comments. I’ll be talking with Seth Flaxman tomorrow, and we’d appreciate all your criticisms and suggestions.

All this is important not just for Italy but for making sensible models to inform policy all over the world, including here.

Bayesian analysis of Santa Clara study: Run it yourself in Google Collab, play around with the model, etc!

The other day we posted some Stan models of coronavirus infection rate from the Stanford study in Santa Clara county.

The Bayesian setup worked well because it allowed us to directly incorporate uncertainty in the specificity, sensitivity, and underlying infection rate.

Mitzi Morris put all this in a Google Collab notebook so you can run it online. here.

Python

To run it in Python, go here:

Click on Open in Collab and log in using your gmail, and you’ll see this:

Then just run the code online, one paragraph at a time, by clicking in the open brackets [ ] at the top left of each paragraph, going down the page.

The first time you click, you’ll get this annoying warning message:

Just click on Run Anyway and it will work.

The first few paragraphs load in CmdStan and upload the model and data. After that the fun begins and you can run the models.

R

Same thing, just start here.

Altering the Stan programs online

There was a way on these Collab pages to go in and alter the code and then re-run the model, which was really helpful in understanding what was going on, as it allowed you to play around with the Stan code. I can’t figure out how to do this with the above pages, but for now you can find everything in Github.

P.S. Loki (pictured above) wants to push this commit, but first he wants you to do some unit tests and clean his litter box.