Skip to content

Capitalist science: The solution to the replication crisis?

Bruce Knuteson pointed me to this article, which begins:

The solution to science’s replication crisis is a new ecosystem in which scientists sell what they learn from their research. In each pairwise transaction, the information seller makes (loses) money if he turns out to be correct (incorrect). Responsibility for the determination of correctness is delegated, with appropriate incentives, to the information purchaser. Each transaction is brokered by a central exchange, which holds money from the anonymous information buyer and anonymous information seller in escrow, and which enforces a set of incentives facilitating the transfer of useful, bluntly honest information from the seller to the buyer. This new ecosystem, capitalist science, directly addresses socialist science’s replication crisis by explicitly rewarding accuracy and penalizing inaccuracy.

The idea seems interesting to me, even though I don’t think it would quite work for my own research as my work tends to be interpretive and descriptive without many true/false claims. But it could perhaps work for others. Some effort is being made right now to set up prediction markets for scientific papers.

Knuteson replied:

Prediction markets have a few features that led me to make different design decisions. Two of note:
– Prices on prediction markets are public. The people I have spoken with in industry seem more willing to pay for information if the information they receive is not automatically made public.
– Prediction markets generally deal with true/false claims. People like being able to ask a broader set of questions.

A bit later, Knuteson wrote:

I read your post “Authority figures in psychology spread more happy talk, still don’t get the point . . .”

You may find this Physics World article interesting: Figuring out a handshake.

I fully agree with you that not all broken eggs can be made into omelets.

Also relevant is this paper where Eric Loken and I consider the idea of peer review as an attempted quality control system, and we discuss proposals such as prediction markets for improving scientific communication.

Bad Numbers: Media-savvy Ivy League prof publishes textbook with a corrupted dataset

[cat picture]

I might not have noticed this one, except that it happened to involve Congressional elections, and this is an area I know something about.

The story goes like this. I’m working to finish up Regression and Other Stories, going through the examples. There’s one where we fit a model to predict the 1988 elections for the U.S. House of Representatives, district by district, given the results from the previous election and incumbency status. We fit a linear regression, then used the fitted model to predict 1990, then compared to the actual election results from 1990. A clean example with just a bit of realism—the model doesn’t fit perfectly, there’s some missing data, there are some choices in how to set up the model.

This example was in Data Analysis Using Regression and Multilevel/Hierarchical Models—that’s the book that Regression and Other Stories is the updated version of the first half of—and for this new book I just want to redo the predictions using stan_glm() and posterior_predict(), which is simpler and more direct than the hacky way we were doing predictions before.

So, no problem. In the new book chapter I adapt the code, cleaning it in various places, then I open an R window and an emacs window for my R script and check that everything works ok. Ummm, first I gotta find the directory with the old code and data, I do that, everything seems to work all right. . . .

I look over what I wrote one more time. It’s kinda complicated: I’d imputed winners of uncontested elections at 75% of the two-party vote—that’s a reasonable choice, it’s based on some analysis we did many years ago of the votes in districts the election before or after they became uncontested—but then there was a tricky thing where I excluded some of these when fitting the regression and put them back in the imputation. In rewriting the example, it seemed simpler to just impute all those uncontested elections once and for all and then do the modeling and fitting on all the districts. Not perfect—and I can explain that in the text—but less of a distraction from the main point in this section, which is the use of simulation for nonlinear predictors, in this case the number of seats predicted to be won by each party in the next election.

Here’s what I had in the text: “Many of the elections were uncontested in 1988, so that y_i = 0 or 1 exactly; for simplicity, we exclude these from our analysis. . . . We also exclude any elections that were won by third parties. This leaves us with n = 343 congressional elections for the analysis.” So I went back to the R script and put the (suitably imputed) uncontested elections back in. This left me with 411 elections in the dataset, out of 435. The rest were NA’s. And I rewrote the paragraph to simply say: “We exclude any elections that were won by third parties in 1986 or 1988. This leaves us with $n=411$ congressional elections for the analysis.”

But . . . wait a minute! Were there really 34 24 districts won by third parties in those years? That doesn’t sound right. I go to the one of the relevant data file, “1986.asc,” and scan down until I find some of the districts in question:

The first column’s the state (we were using “ICPSR codes,” and states 44, 45, and 46 are Georgia, Louisiana, and Mississippi, respectively), the second is the congressional district, third is incumbency (+1 for Democrat running for reelection, -1 for Republican, 0 for an open seat), and the last two columns are the votes received by the Democratic and Republican candidates. If one of those last two columns is 0, that’s an uncontested election. If both are 0, I was calling it a third-party victory.

But can this be right?

Here’s the relevant section from the codebook:

Nothing about what to do if both columns are 0.

Also this:

For those districts with both columns -9, it says the election didn’t take place, or there was a third party victory, or there was an at-large election.

Whassup? Let’s check Louisiana (state 45 in the above display). Google *Louisiana 1986 House of Representatives Elections* and it’s right there on Wikipedia. I have no idea who went to the trouble of entering all this information (or who went to the trouble of writing a computer program to enter all this information), but here it is:

So it looks like the data table I had was just incomplete. I have no idea how this happened, but it’s kinda embarrassing that I never noticed. What with all those uncontested elections, I didn’t really look carefully at the data with zeroes -9’s in both columns.

Also, the incumbency information isn’t all correct. Our file had LA-6 with a Republican incumbent running for reelection, but according to Wikipedia, the actual election was an open seat (but with the Republican running unopposed).

I’m not sure what’s the best way forward. Putting together a new dataset for all those decades of elections, that would be a lot of work. But maybe such a file now exists somewhere? The easiest solution would be to clean up the existing dataset just for the three elections I need for the example: 1986, 1988, 1990. On the other hand, if I’m going to do that anyway, maybe better to use some more recent data, such as 2006, 2008, 2010.

No big deal—it’s just one example in the book—but, still, it’s a mistake I should never have made.

This is all a good example of the benefits of a reproducible workflow. It was through my efforts to put together clean, reproducible code that I discovered the problem.

Also, errors in this dataset could have propagated into errors in these published articles:

[2008] Estimating incumbency advantage and its variation, as an example of a before/after study (with discussion). {\em Journal of the American Statistical Association} {\bf 103}, 437–451. (Andrew Gelman and Zaiying Huang)

[1991] Systemic consequences of incumbency advantage in U.S. House elections. {\em American Journal of Political Science} {\bf 35}, 110–138. (Gary King and Andrew Gelman)

[1990] Estimating incumbency advantage without bias. {\em American Journal of Political Science} {\bf 34}, 1142–1164. (Andrew Gelman and Gary King)

I’m guessing that the main conclusions won’t change, as the total number of these excluded cases is small. Of course those papers were all written before the era of reproducible analyses, so it’s not like the data and code are all there for you to re-run.

Problems with the jargon “statistically significant” and “clinically significant”

Someone writes:

After listening to your EconTalk episode a few weeks ago, I have a question about interpreting treatment effect magnitudes, effect sizes, SDs, etc. I studied Econ/Math undergrad and worked at a social science research institution in health policy as a research assistant, so I have a good amount of background.

At the institution where I worked we started adopting the jargon “statistically significant” AND “clinically significant.” The latter describes the importance of the magnitude in the real world. However, my understanding of standard T testing and p-values is that since the null hypothesis is treatment == 0, then if we can reject the null at p>.05, then this is only evidence that the treatment is > 0. Because the test was against 0, we cannot make any additional claims about the magnitude. If we wanted to make claims about the magnitude, then we would need to test against the null hypothesis of treatment effect == [whatever threshold we assess as clinically significant]. So, what do you think? Were we always over-interpreting the magnitude results or am I missing something here?

My reply:

Section 2.4 of this recent paper with John Carlin explains the problem with talking about “practical” (or “clinical”) significance.

More generally, that’s right, the hypothesis test is, at best, nothing more than the rejection of a null hypothesis that nobody should care about. In real life, treatment effects are not exactly zero. A treatment will help some people and hurt others; it will have some average benefit which will in turn depend on the population being studied and the settings where the treatment is being applied.

But, no, I disagree with your statement that, if we wanted to make claims about the magnitude, then we would need to test other hypotheses. The whole “hypothesis” thing just misses the point. There are no “hypotheses” here in the traditional statistical sense. The hypothesis is that some intervention helps more than it hurts, for some people in some settings. The way to go, I think, is to just model these treatment effects directly. Estimate the treatment effect and its variation, and go from there. Forget the hypotheses and p-values entirely.

Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.

[cat picture]

The following email came in:

I’m in a PhD program (poli sci) with a heavy emphasis on methods. One thing that my statistics courses emphasize, but that doesn’t get much attention in my poli sci courses, is the problem of simultaneous inferences. This strikes me as a problem.

I am a bit unclear on exactly how this works, and it’s something that my stats professors have been sort of vague about. But I gather from your blog that this is a subject near and dear to your heart.

For purposes of clarification, I’ll work under the frequentist framework, since for better or for worse, that’s what almost all poli sci literature operates under.

But am I right that any time you want to claim that two things are significant *at the same time* you need to halve your alpha? Or use Scheffe or whatever multiplier you think is appropriate if you think Bonfronni is too conservative?

I’m thinking in particular of this paper [“When Does Negativity Demobilize? Tracing the Conditional Effect of Negative Campaigning on Voter Turnout,” by Yanna Krupnikov].

In particular the findings on page 803.

Setting aside the 25+ predictors, which smacks of p-hacking to me, to support her conclusions she needs it to simultaneously be true that (1) negative ads themselves don’t affect turnout, (2) negative ads for a disliked candidate don’t affect turnout; (3) negative ads against a preferred candidate don’t affect turnout; (4) late ads for a disliked candidate don’t affect turnout AND (5) negative ads for a liked candidate DO affect turnout. In other words, her conclusion is valid iff she finds a significant effect at #5.

This is what she finds, but it looks like it just *barely* crosses the .05 threshold (again, p-hacking concerns). But am I right that since she needs to make inferences about five tests here, her alpha should be .01 (or whatever if you use a different multiplier)? Also, that we don’t care about the number of predictors she uses (outside of p-hacking concerns) since we’re not really making inferences about them?

My reply:

First, just speaking generally: it’s fine to work in the frequentist framework, which to me implies that you’re trying to understand the properties of your statistical methods in the settings where they will be applied. I work in the frequentist framework too! The framework where I don’t want you working is the null hypothesis significance testing framework, in which you try to prove your point by rejecting straw-man nulls.

In particular, I have no use for statistical significance, or alpha-levels, or familywise error rates, or the .05 threshold, or anything like that. To me, these are all silly games, and we should just cut to the chase and estimate the descriptive and casual population quantities of interest. Again, I am interested in the frequentist properties of my estimates—I’d like to understand their bias and variance—but I don’t want to do it conditional on null hypotheses of zero effect, which are hypotheses of zero interest to me. That’s a game you just don’t need to play anymore.

When you do have multiple comparisons, I think the right way to go is to analyze all of them using a hierarchical model—not to pick one or two or three out of context and then try to adjust the p-values using a multiple comparisons correction. Jennifer Hill, Masanao Yajima, and I discuss this in our 2011 paper, Why we (usually) don’t have to worry about multiple comparisons.

To put it another way, the original sin is selection. The problem with p-hacked work is not that p-values are uncorrected for multiple comparison, it’s that some subset of comparisons is selected for further analysis, which is wasteful of information. It’s better to analyze all the comparisons of interest at once. This paper with Steegen et al. demonstrates how many different potential analyses can be present, even in a simple study.

OK, so that’s my general advice: look at all the data and fit a multilevel model allowing for varying baselines and varying effects.

What about the specifics?

I took a look at the linked paper. I like the title. “When Does Negativity Demobilize?” is much better than “Does Negatively Demobilize.” The title recognizes that (a) effects are never zero, and (b) effects vary. I can’t quite buy this last sentence of the abstract, though: “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate.” No way! There must be other cases when negativity can demobilize. That said, at this point the paper could still be fine: even if a paper is working within a flawed inferential framework, it could still be solid empirical work. After all, it’s completely standard to estimate constant treatment effects—we did this in our first paper on incumbency advantage and I still think most of our reported findings were basically correct.

Reading on . . . Krupnikov writes, “The first section explores the psychological determinants that underlie the power of negativity leading to the focal hypothesis of this research. The second section offers empirical tests of this hypothesis.” For the psychological model, she writes that first a person decides which candidate to support, then he or she decides whether to vote. That seems a bit of a simplification, as sometimes I know I’ll vote even before I decide whom to vote for. Haven’t you ever heard of people making their decision inside the voting booth? I’ve done that! Even beyond that, it doesn’t seem quite right to identify the choice as being made at a single precise time. Again, though, that’s ok: Krupnikov is presenting a model, and models are inherently simplifications. Models can still help us learn from the data.

OK, now on to the empirical part of the paper. I see what you mean: there are a lot of potential explanatory variables running around: overall negativity, late negativity, state competitiveness, etc etc. Anything could be interacted with anything. This is a common concern in social science, as there is an essentially unlimited number of factors that could influence the outcome of interest (turnout, in this case). On one hand, it’s a poopstorm when you throw all these variables into your model at once; on the other hand, if you exclude anything that might be important, it can be hard to interpret any comparisons in observational data. So this is something we’ll have to deal with: it won’t be enough to just say there are too many variables and then give up. And it certainly won’t be a good idea to trawl through hundreds of comparisons, looking for something that’s significant at the .001 level or whatever. That would make no sense at all. Think of what happens: you grab the comparison with a z-score of 4, setting aside all those silly comparisons with z-scores of 3, or 2, or 1, but this doesn’t make much sense, given that these z-scores are so bouncy: differences of less than 3 in z-scores are not themselves statistically significant.

To put it another way, “multiple comparisons” can be a valuable criticism, but multiple comparisons corrections are not so useful as a method of data analysis.

Getting back to the empirics . . . here I agree that there are problems. I don’t like this:

Estimating Model 1 shows that overall negativity has a null effect on turnout in the 2004 presidential election (Table 2, Model 1). While the coefficient on the overall negativity variable is negative, it does not reach conven- tional levels of statistical significance. These results are in line with Finkel and Geer (1998), as well as Lau and Pomper (2004), and show that increases in the negativity in a respondent’s media market over the entire duration of the campaign did not have any effect on his likelihood of turning out to vote in 2004.

Not statistically significant != zero.

Here’s more:

Going back to the conclusion from the abstract, “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate,” I think Krupnikov is just wrong here in her application of her empirical results. She’s taking non-statistically-significant comparisons as zero, and she’s taking the difference between significant and non-significant as being significant. Don’t do that.

Given that the goal here is causal inference, I think it would’ve been better off setting this up more formally as an observational study comparing treatment and control groups.

I did not read the rest of the paper, nor am I attempting to offer any evaluation of the work. I was just focusing on the part addressed by your question. The bigger picture, I think, is that it can be valuable for a researcher to (a) summarize the patterns she sees in data, and (b) consider the implications of these patterns for understanding recent and future campaigns, while (c) recognizing residual uncertainty.

Remember Tukey’s quote: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

The attitude I’m offering is not nihilistic: even if we have not reached anything close to certainty, we can still learn from data and have a clearer sense of the world after our analysis than before.

Stan®

It’s official

“Stan” is now a registered trademark. For those keeping score, it’s

How to refer to Stan

Please just keep writing “Stan”. We’ll be using the little ® symbol in prominent branding, but you don’t have to.

Thanks to NumFOCUS

Thanks to Leah Silen and NumFOCUS for shepherding the application through the registration process. NumFOCUS is the official trademark holder.

“Stan”, not “STAN”

We use “Stan” rather than “STAN”, because “Stan” isn’t an acronym. Stan is named after Stanislaw Ulam.

The mark is rendered as “STAN” on the USPTO site. Do not be fooled! The patent office capitalizes everything because the registrations are case insensitive.

What about the logo?

That should be through the process soon. More on that when we get a final decision.

Incentives Matter (Congress and Wall Street edition)

[cat picture]

Thomas Ferguson sends along this paper. From the summary:

Social scientists have traditionally struggled to identify clear links between political spending and congressional voting, and many journalists have embraced their skepticism. A giant stumbling block has been the challenge of measuring the labyrinthine ways money flows from investors, firms, and industries to particular candidates. Ferguson, Jorgensen, and Chen directly tackle that classic problem in this paper. Constructing new data sets that capture much larger swaths of political spending, they show direct links between political contributions to individual members of Congress and key floor votes . . .

They show that prior studies have missed important streams of political money, and, more importantly, they show in detail how past studies have underestimated the flow of political money into Congress. The authors employ a data set that attempts to bring together all forms of campaign contributions from any source— contributions to candidate campaign committees, party committees, 527s or “independent expenditures,” SuperPACs, etc., and aggregate them by final sources in a unified, systematic way. To test the influence of money on financial regulation votes, they analyze the U.S. House of Representatives voting on measures to weaken the Dodd-Frank financial reform bill. Taking care to control as many factors as possible that could influence floor votes, they focus most of their attention on representatives who originally voted in favor of the bill and subsequently to dismantle key provisions of it. Because these are the same representatives, belonging to the same political party, in substantially the same districts, many factors normally advanced to explain vote shifts are ruled out from the start. . . .

The authors test five votes from 2013 to 2015, finding the link between campaign contributions from the financial sector and switching to a pro-bank vote to be direct and substantial. The results indicate that for every $100,000 that Democratic representatives received from finance, the odds they would break with their party’s majority support for the Dodd-Frank legislation increased by 13.9 percent. Democratic representatives who voted in favor of finance often received $200,000–$300,000 from that sector, which raised the odds of switching by 25–40 percent. The authors also test whether representatives who left the House at the end of 2014 behaved differently. They find that these individuals were much more likely to break with their party and side with the banks. . . .

I had a quick question: how do you deal with the correlation/causation issue? The idea that Wall St is giving money to politicians who would already support them? That too is a big deal, of course, but it’s not quite the story Ferguson et al. are telling in the paper.

Ferguson responded:

We actually considered that at some length. That’s why we organized the main discussion on Wall Street and Dodd-Frank around looking at Democratic switchers — people who originally voted for passage (against Wall Street, that is), but then switched in one or more later votes to weaken. Nobody is in that particular regression who didn’t already vote against Wall Street once already, when it really counted.

I replied: Sure, but there’s still the correlation problem, in that one could argue that switchers are people whose latent preferences were closer to the middle, so they were just the ones who were more likely to shift following a change in the political weather.

Ferguson:

Conservatism is controlled for in the analysis, using a measure derived from that Congress. This isn’t going to the middle; it’s a tropism for money. The other obvious comment is that if they are really latent Wall Street lovers, they should be moving mostly in lockstep on the subsequent votes. If you look at our summary nos., you can see they weren’t. We could probably mine that point some more.
Short of administering the MMPPI for banks in advance, are you prepared to accept any empirical evidence? Voting against banks in the big one is pretty good, I think.

Me: I’m not sure, I’ll have to think about it. One answer, I think, is that if it’s just $ given to pre-existing supporters of Wall St., it’s still an issue, as the congressmembers are then getting asymmetrically rewarded (votes for Wall St get the reward, votes against don’t get the reward), and, as economists are always telling us, Incentives Matter.

Ferguson:

Remember those folks who turned on Swaps Push Out didn’t necessarily turn out for the banks on other votes. If it’s “weather” it’s a pretty strange weather.

Stan Weekly Roundup, 23 June 2017

Lots of activity this week, as usual.

* Lots of people got involved in pushing Stan 2.16 and interfaces out the door; Sean Talts got the math library, Stan library (that’s the language, inference algorithms, and interface infrastructure), and CmdStan out, while Allen Riddell got PyStan 2.16 out and Ben Goodrich and Jonah Gabry are tackling RStan 2.16

* Stan 2.16 is the last series of releases that will not require C++11; let the coding fun begin!

* Ari Hartikainen (of Aalto University) joined the Stan dev team—he’s working with Allen Riddell on PyStan, where judging from the pull request traffic, he put in a lot of work on the 2.16 release. Welcome!

* Imad Ali’s working on adding more cool features to RStanArm including time series and spatial models; yesterday he and Mitzi were scheming to get intrinsic conditional autoregressive models in and I heard all those time series name flying around (like ARIMA)

* Michael Betancourt rearranged the Stan web site with some input from me and Andrew; Michael added more descriptive text and Sean Talts managed to get the redirects in so all of our links aren’t broken; let us know what you think

* Markus Ojala of Smartly wrote a case study on their blog, Tutorial: How We Productized Bayesian Revenue Estimation with Stan

* Mitzi Morris got in the pull request for adding compound assignment and arithmetic; this adds statements such as n += 1.

* lots of chatter about characterization tests and a pull request from Daniel Lee to update some of update some of our our existing performance tests

* Roger Grosse from U.Toronto visited to tell us about his, Siddharth Ancha, and Daniel Roy’s 2016 NIPS paper on testing MCMC using bidirectional Monte Carlo sampling; we talked about how he modified Stan’s sampler to do annealed importance sampling

* GPU integration continues apace

* I got to listen in on Michael Betancourt and Maggie Lieu of the European Space Institute spend a couple days hashing out astrophysics models; Maggie would really like us to add integrals.

* Speaking of integration, Marco Inacio has updated his pull request; Michael’s worried there may be numerical instabilities, because trying to calculate arbitrary bounded integrals is not so easy in a lot of cases

* Andrew continues to lobby for being able to write priors directly into parameter declarations; for example, here’s what a hierarchical prior for beta might look like

parameters {
  real mu ~ normal(0, 2);
  real sigma ~ student_t(4, 0, 2);
  vector[N] beta ~ normal(mu, sigma);
}

* I got the go-ahead on adding foreach loops; Mitzi Morris will probably be coding them. We’re talking about

real ys[N];
...
for (y in ys)
  target += log_mix(lambda, normal_lpdf(y | mu[1], sigma[1]),
                            normal_lpdf(y | mu[2], sigma[2]));

* Kalman filter case study by Jouni Helske was discussed on Discourse

* Rob Trangucci rewrote the Gaussian processes chapter of the Stan manual; I’m to blame for the first version, writing it as I was learning GPs. For some reason, it’s not up on the web page doc yet.

* This is a very ad hoc list. I’m sure I missed lots of good stuff, so feel free to either send updates to me directly for next week’s letter or add things to comments. This project’s now way too big for me to track all the activity!

Best correction ever: “Unfortunately, the correct values are impossible to establish, since the raw data could not be retrieved.”

Commenter Erik Arnesen points to this:

Several errors and omissions occurred in the reporting of research and data in our paper: “How Descriptive Food Names Bias Sensory Perceptions in Restaurants,” Food Quality and Preference (2005) . . .

The dog ate my data. Damn gremlins. I hate when that happens.

As the saying goes, “Each year we publish 20+ new ideas in academic journals, and we appear in media around the world.” In all seriousness, the problem is not that they publish their ideas, the problem is that they are “changing or omitting data or results such that the research is not accurately represented in the research record.” And of course it’s not just a problem with Mr. Pizzagate or Mr. Gremlins or Mr. Evilicious or Mr. Politically Incorrect Sex Ratios: it’s all sorts of researchers who (a) don’t report what they actually did, and (b) refuse to reconsider their flimsy hypotheses in light of new theory or evidence.

Question about the secret weapon

Micah Wright writes:

I first encountered your explanation of secret weapon plots while I was browsing your blog in grad school, and later in your 2007 book with Jennifer Hill. I found them immediately compelling and intuitive, but I have been met with a lot of confusion and some skepticism when I’ve tried to use them. I’m uncertain as to whether it’s me that’s confused, or whether my audience doesn’t get it. I should note that my formal statistical training is somewhat limited—while I was able to take a couple of stats courses during my masters, I’ve had to learn quite a bit on the side, which makes me skeptical as to whether or not I actually understand what I’m doing.

My main question is this: when using the secret weapon, does it make sense to subset the data across any arbitrary variable of interest, as long as you want to see if the effects of other variables vary across its range? My specific case concerns tree growth (ring widths). I’m interested to see how the effect of competition (crowding and other indices) on growth varies at different temperatures, and if these patterns change in different locations (there are two locations). To do this, I subset the growth data in two steps: first by location, then by each degree of temperature, which I rounded to the nearest integer. I then ran the same linear model on each subset. The model had growth as the response, and competition variables as predictors, which were standardized. I’ve attached the resulting figure [see above], which plots the change in effect for each predictor over the range of temperature.

My reply: I like these graphs! In future you might try a 6 x K grid, where K is the number of different things you’re plotting. That is, right now you’re wasting one of your directions because your 2 x 3 grid doesn’t mean anything. These plots are fine, but if you have more information for each of these predictors, you can consider plotting the existing information as six little graphs stacked vertically and then you’ll have room for additional columns. In addition, you should make the tick marks much smaller, put the labels closer to the axes, and reduce the number of axis labels, especially on the vertical axes. For example, (0.0, 0.3, 0.6, 0.9) can be replaced by labels at 0, 0.5, 1.

Regarding the larger issue of, what is the secret weapon, as always I see it as an approximation to a full model that bridges the different analyses. It’s a sort of nonparametric analysis. You should be able to get better estimates by using some modeling, but a lot of that smoothing can be done visually anyway, so the secret weapon gets you most of the way there, and in my view it’s much much better than the usual alternative of fitting a single model to all the data without letting all the coefficients vary.

“Developers Who Use Spaces Make More Money Than Those Who Use Tabs”

Rudy Malka writes:

I think you’ll enjoy this nice piece of pop regression by David Robinson: developers who use spaces make more money than those who use tabs. I’d like to know your opinion about it.

At the above link, Robinson discusses a survey that allows him to compare salaries of software developers who use tabs to those who use spaces. The key graph is above. Robinson found similar results after breaking down the data by country, job title, or computer language used, and it also showed up in a linear regression controlling in a simple way for a bunch of factors.

As Robinson put it in terms reminiscent of our Why Ask Why? paper:

This is certainly a surprising result, one that I didn’t expect to find when I started exploring the data. . . . I tried controlling for many other confounding factors within the survey data beyond those mentioned here, but it was difficult to make the effect shrink and basically impossible to make it disappear.

Speaking with the benefit of hindsight—that is, seeing Robinson’s results and assuming they are a correct representation of real survey data—it all makes sense to me. Tabs seem so amateurish, I much prefer spaces—2 spaces, not 4, please!!!—so from that perspective it makes sense to me that the kind of programmers who use tabs tend to be programmers with poor taste and thus, on average, of lower quality.

I just want to say one thing. Robinson writes, “Correlation is not causation, and we can never be sure that we’ve controlled for all the confounding factors present in a dataset.” But this isn’t quite the point. Or, to put it another way, I think he has the right instinct here but isn’t quite presenting the issue precisely. To see why, suppose the survey had only 2 questions: How much money do you make? and Do you use spaces or tabs? And suppose we had no other information on the respondents. And, for that matter, suppose there was no nonresponse and that we had a simple random sample of all programmers from some specified set of countries. In that case, we’d know for sure that there are no other confounding factors in the dataset, as the dataset is nothing but those two columns of numbers. But we’d still be able to come up with a zillion potential explanations.

To put it another way, the descriptive comparison is interesting in its own right, and we just should be careful about misusing causal language. Instead of saying, “using spaces instead of tabs leads to an 8.6% higher salary,” we could say, “comparing two otherwise similar programmers, the one who uses spaces has, on average, an 8.6% higher salary than the one who uses tabs.” That’s a bit of a mouthful—but such a mouthful is necessary to accurately describe the comparison that’s being made.

Time-sharing Experiments for the Social Sciences

Jamie Druckman writes:

Time-sharing Experiments for the Social Sciences (TESS) is an NSF-funded initiative. Investigators propose survey experiments to be fielded using a nationally representative Internet platform via NORC’s AmeriSpeak® Panel (see http:/tessexperiments.org for more information). In an effort to enable younger scholars to field larger-scale studies than what TESS normally conducts, we are pleased to announce a Special Competition for Young Investigators. While anyone can submit at any time through TESS’s regular proposal mechanism, this Special Competition is limited to graduate students and individuals who are who are no more than 3 years post-PhD. Winning projects will be allowed to be fielded at a size up to twice the usual budget as a regular TESS study. For more specifics on the special competition, see: http://tessexperiments.org/yic.html  We will begin accepting proposals for the Special Competition on August 1, 2017, and the deadline is October 1, 2017.  Full details about the competition are available at http://www.tessexperiments.org/yic.html.   This page includes information about what is required of proposals and how to submit, and should be reviewed by anyone entering the competition.

After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project.

Someone pointed me to this post by “Neuroskeptic”:

A new paper in the prestigious journal PNAS contains a rather glaring blooper. . . . right there in the abstract, which states that “three neuropeptides (β-endorphin, oxytocin, and dopamine) play particularly important roles” in human sociality. But dopamine is not a neuropeptide. Neither are serotonin or testosterone, but throughout the paper, Pearce et al. refer to dopamine, serotonin and testosterone as ‘neuropeptides’. That’s just wrong. A neuropeptide is a peptide active in the brain, and a peptide in turn is the term for a molecule composed of a short chain of amino acids. Neuropeptides include oxytocin, vasopressin, and endorphins – which do feature in the paper. But dopamine and serotonin aren’t peptides, they’re monoamines, and testosterone isn’t either, it’s a steroid. This isn’t a matter of opinion, it’s basic chemistry.

The error isn’t just an isolated typo: ‘neuropeptide’ occurs 27 times in the paper, while the correct terms for the non-peptides are never used.

Neuroskeptic speculates on how this error got in:

It’s a simple mistake; presumably whoever wrote the paper saw oxytocin and vasopressin referred to as “neuropeptides” and thought that the term was a generic one meaning “signalling molecule.” That kind of mistake could happen to anyone, so we shouldn’t be too harsh on the authors . . .

The authors of the papers work in a psychology department so I guess they’re rusty on their organic chemistry.

Fair enough; I haven’t completed a chemistry class since 11th grade, and I didn’t know what a peptide is, either. Then again, I’m not writing articles on peptides for the National Academy of Sciences.

But how did this get through the review process? Let’s take a look at the published article:

Ahhhh, now I understand. The editor is Susan Fiske, notorious as the person who opened the gates of PPNAS for the articles on himmicanes, air rage, and ages ending in 9. I wonder who were the reviewers of this new paper. Nobody who knows what a peptide is, I guess. Or maybe they just read it very quickly, flipped through to the graphs and the conclusions, and didn’t read a lot of the words.

Did you catch that? Neuroskeptic refers to “the prestigious journal PNAS.” That’s PPNAS for short. This is fine, I guess. Maybe the science is ok. Based on a quick scan of the paper, I don’t think we should take a lot of the specific claims seriously, as they seem to based on the difference between “significant” and “non-significant.”

In particular, I’m not quite sure what is their support for the statement from the abstract that “each neuropeptide is quite specific in its domain of influence.” They’re rejecting various null hypotheses but I don’t know that this is supporting their substantive claims in the way that they’re saying.

I might be missing something here—I might be missing a lot—but in any case there seem to be some quality control problems at PPNAS. This should be no surprise: PPNAS is a huge journal, publishing over 3000 papers each year.

On their website they say, “PNAS publishes only the highest quality scientific research,” but this statement is simply false. I can’t really comment on this particular paper—it doesn’t seem like “the highest quality scientific research” to me, but, again, maybe I’m missing something big here. But I can assure you that the papers on himmicanes, air rage, and ages ending in 9 are not “the highest quality scientific research.” They’re not high quality research at all! What they are, is low-quality research that happens to be high-quality clickbait.

OK, let’s be fair. This is not a problem unique to PPNAS. The Lancet publishes crap papers, Psychological Science published crap papers, even JASA and APSR have their share of duds. Statistical Science, to its eternal shame, published that Bible Code paper in 1994. That’s fine, it’s how the system operates. Editors are only human.

But, really, do we have to make statements that we know are false? Platitudes are fine but let’s avoid intentional untruths.

So, instead of “PNAS publishes only the highest quality scientific research,” how about this: “PNAS aims to publish only the highest quality scientific research.” That’s fair, no?

P.S. Here’s a fun little graphics project: Redo Figure 1 as a lineplot. You’ll be able to show a lot more comparisons much more directly using lines rather than bars. The current grid of barplots is not the worst thing in the world—it’s much better than a table—but it could be much improved.

P.P.S. Just to be clear: (a) I don’t know anything about peptides so I’m offering no independent judgment of the paper in question; (b) whatever the quality is of this particular paper, does not affect my larger point that PPNAS publishes some really bad papers and so they should change their slogan to something more accurate.

P.P.P.S. The relevant Pubpeer page pointed to the following correction note that was posted on the PPNAS site after I wrote the above post but before it was posted:

The authors wish to note, “We used the term ‘neuropeptide’ in referring to the set of diverse neurochemicals that we examined in this study, some of which are not peptides; dopamine and serotonin are neurotransmitters and should be listed as such, and testosterone should be listed as a steroid. Our usage arose from our primary focus on the neuropeptides endorphin and oxytocin. Notwithstanding the biochemical differences between these neurochemicals, we note that these terminological issues have no implications for the significance of the findings reported in this paper.”

On deck through the rest of the year (and a few to begin 2018)

Here they are. I love seeing all the titles lined up in one place; it’s like a big beautiful poem about statistics:

  • After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project.
  • “Developers Who Use Spaces Make More Money Than Those Who Use Tabs”
  • Question about the secret weapon
  • Incentives Matter (Congress and Wall Street edition)
  • Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.
  • Problems with the jargon “statistically significant” and “clinically significant”
  • Capitalist science: The solution to the replication crisis?
  • Bayesian, but not Bayesian enough
  • Let’s stop talking about published research findings being true or false
  • Plan 9 from PPNAS
  • No, I’m not blocking you or deleting your comments!
  • “Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.”
  • “The Null Hypothesis Screening Fallacy”?
  • What is a pull request?
  • Turks need money after expensive weddings
  • Statisticians and economists agree: We should learn from data by “generating and revising models, hypotheses, and data analyzed in response to surprising findings.”
  • My unpublished papers
  • Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories
  • Night Hawk
  • Why they aren’t behavioral economists: Three sociologists give their take on “mental accounting”
  • Further criticism of social scientists and journalists jumping to conclusions based on mortality trends
  • Daryl Bem and Arthur Conan Doyle
  • Classical statisticians as Unitarians
  • Slaying Song
  • What is “overfitting,” exactly?
  • Graphs as comparisons: A case study
  • Should we continue not to trust the Turk? Another reminder of the importance of measurement
  • “The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll
  • His concern is that the authors don’t control for the position of games within a season.
  • How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?
  • “Bayes factor”: where the term came from, and some references to why I generally hate it
  • A stunned Dyson
  • Applying human factors research to statistical graphics
  • Recently in the sister blog
  • Adding a predictor can increase the residual variance!
  • Died in the Wool
  • “Statistics textbooks (including mine) are part of the problem, I think, in that we just set out ‘theta’ as a parameter to be estimated, without much reflection on the meaning of ‘theta’ in the real world.”
  • An improved ending for The Martian
  • Delegate at Large
  • Iceland education gene trend kangaroo
  • Reproducing biological research is harder than you’d think
  • The fractal zealots
  • Giving feedback indirectly by invoking a hypothetical reviewer
  • It’s hard to know what to say about an observational comparison that doesn’t control for key differences between treatment and control groups, chili pepper edition
  • PPNAS again: If it hadn’t been for the jet lag, would Junior have banged out 756 HRs in his career?
  • Look. At. The. Data. (Hollywood action movies example)
  • “This finding did not reach statistical sig­nificance, but it indicates a 94.6% prob­ability that statins were responsible for the symptoms.”
  • Wolfram on Golomb
  • Irwin Shaw, John Updike, and Donald Trump
  • What explains my lack of openness toward this research claim? Maybe my cortex is just too damn thick and wrinkled
  • I love when I get these emails!
  • Consider seniority of authors when criticizing published work?
  • Does declawing cause harm?
  • Bird fight! (Kroodsma vs. Podos)
  • The Westlake Review
  • “Social Media and Fake News in the 2016 Election”
  • Also holding back progress are those who make mistakes and then label correct arguments as “nonsensical.”
  • Just google “Despite limited statistical power”
  • It is somewhat paradoxical that good stories tend to be anomalous, given that when it comes to statistical data, we generally want what is typical, not what is surprising. Our resolution of this paradox is . . .
  • “Babbage was out to show that not only was the system closed, with a small group controlling access to the purse strings and the same individuals being selected over and again for the few scientific honours or paid positions that existed, but also that one of the chief beneficiaries . . . was undeserving.”
  • Irish immigrants in the Civil War
  • Mixture models in Stan: you can use log_mix()
  • Don’t always give ’em what they want: Practicing scientists want certainty, but I don’t want to offer it to them!
  • Cumulative residual plots seem like they could be useful
  • Sucker MC’s keep falling for patterns in noise
  • Nice interface, poor content
  • “From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up.”
  • Chris Moore, Guy Molyneux, Etan Green, and David Daniels on Bayesian umpires
  • Using statistical prediction (also called “machine learning”) to potentially save lots of resources in criminal justice
  • “Mainstream medicine has its own share of unnecessary and unhelpful treatments”
  • What are best practices for observational studies?
  • The Groseclose endgame: Getting from here to there.
  • Causal identification + observational study + multilevel model
  • All cause and breast cancer specific mortality, by assignment to mammography or control
  • Iterative importance sampling
  • Rosenbaum (1999): Choice as an Alternative to Control in Observational Studies
  • Gigo update (“electoral integrity project”)
  • How to design and conduct a subgroup analysis?
  • Local data, centralized data analysis, and local decision making
  • Too much backscratching and happy talk: Junk science gets to share in the reputation of respected universities
  • Selection bias in the reporting of shaky research: An example
  • Self-study resources for Bayes and Stan?
  • Looking for the bottom line
  • “How conditioning on post-treatment variables can ruin your experiment and what to do about it”
  • Trial by combat, law school style
  • Causal inference using data from a non-representative sample
  • Type M errors studied in the wild
  • Type M errors in the wild—really the wild!
  • Where does the discussion go?
  • Maybe this paper is a parody, maybe it’s a semibluff
  • As if the 2010s never happened
  • Using black-box machine learning predictions as inputs to a Bayesian analysis
  • It’s not enough to be a good person and to be conscientious. You also need good measurement. Cargo-cult science done very conscientiously doesn’t become good science, it just falls apart from its own contradictions.
  • Air rage update
  • Getting the right uncertainties when fitting multilevel models
  • Chess records page
  • Weisburd’s paradox in criminology: it can be explained using type M errors
  • “Cheerleading with an agenda: how the press covers science”
  • Automated Inference on Criminality Using High-tech GIGO Analysis
  • Some ideas on using virtual reality for data visualization: I don’t really agree with the details here but it’s all worth discussing
  • Contribute to this pubpeer discussion!
  • For mortality rate junkies
  • The “fish MRI” of international relations studies.
  • “5 minutes? Really?”
  • 2 quick calls
  • Should we worry about rigged priors? A long discussion.
  • I’m not on twitter
  • I disagree with Tyler Cowen regarding a so-called lack of Bayesianism in religious belief
  • “Why bioRxiv can’t be the Central Service”
  • Sudden Money
  • The house is stronger than the foundations
  • Please contribute to this list of the top 10 do’s and don’ts for doing better science
  • Partial pooling with informative priors on the hierarchical variance parameters: The next frontier in multilevel modeling
  • Does racquetball save lives?
  • When do we want evidence-based change? Not “after peer review”
  • “I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.”
  • “Bayesian evidence synthesis”
  • Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??”
  • Beyond forking paths: using multilevel modeling to figure out what can be learned from this survey experiment
  • From perpetual motion machines to embodied cognition: The boundaries of pseudoscience are being pushed back into the trivial.
  • Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference
  • “La critique est la vie de la science”: I kinda get annoyed when people set themselves up as the voice of reason but don’t ever get around to explaining what’s the unreasonable thing they dislike.
  • How to discuss your research findings without getting into “hypothesis testing”?
  • Does traffic congestion make men beat up their wives?
  • The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and reporting
  • I think it’s great to have your work criticized by strangers online.
  • In the open-source software world, bug reports are welcome. In the science publication world, bug reports are resisted, opposed, buried.
  • If you want to know about basketball, who ya gonna trust, the Irene Blecker Rosenfeld Professor of Psychology at Cornell University and author of “The Wisest One in the Room: How You Can Benefit from Social Psychology’s Most Powerful Insights,” . . . or that poseur Phil Jackson??
  • Quick Money
  • An alternative to the superplot
  • Where the money from Wiley Interdisciplinary Reviews went . . .
  • Retract or correct, don’t delete or throw into the memory hole
  • Using Mister P to get population estimates from respondent driven sampling
  • “Americans Greatly Overestimate Percent Gay, Lesbian in U.S.”
  • “It all reads like a classic case of faulty reasoning where the reasoner confuses the desirability of an outcome with the likelihood of that outcome.”
  • Pseudoscience and the left/right whiplash
  • The time reversal heuristic (priming and voting edition)
  • The Night Riders
  • Why you can’t simply estimate the hot hand using regression
  • Stan to improve rice yields
  • When people proudly take ridiculous positions
  • “A mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state,” and other notes on “Whither Science?” by Danko Antolovic
  • Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.
  • What should this student do? His bosses want him to p-hack and they don’t even know it!
  • Fitting multilevel models when predictors and group effects correlate
  • I hate that “Iron Law” thing
  • High five: “Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking.”
  • “What is a sandpit?”
  • No no no no no on “The oldest human lived to 122. Why no person will likely break her record.”
  • Tips when conveying your research to policymakers and the news media
  • Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs.
  • Spatial models for demographic trends?
  • A pivotal episode in the unfolding of the replication crisis
  • We start by talking reproducible research, then we drift to a discussion of voter turnout
  • Wine + Stan + Climate change = ?
  • Stan is a probabilistic programming language
  • Using output from a fitted machine learning algorithm as a predictor in a statistical model
  • Poisoning the well with a within-person design? What’s the risk?
  • “Dear Professor Gelman, I thought you would be interested in these awful graphs I found in the paper today.”
  • I know less about this topic than I do about Freud.
  • Driving a stake through that ages-ending-in-9 paper
  • What’s the point of a robustness check?
  • Oooh, I hate all talk of false positive, false negative, false discovery, etc.
  • Trouble Ahead
  • A new definition of the nerd?
  • Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem
  • Popular expert explains why communists can’t win chess championships!
  • The four missing books of Lawrence Otis Graham
  • “There was this prevalent, incestuous, backslapping research culture. The idea that their work should be criticized at all was anathema to them. Let alone that some punk should do it.”
  • Loss of confidence
  • “How to Assess Internet Cures Without Falling for Dangerous Pseudoscience”
  • Ed Jaynes outta control!
  • A reporter sent me a Jama paper and asked me what I thought . . .
  • Workflow, baby, workflow
  • Two steps forward, one step back
  • Yes, you can do statistical inference from nonrandom samples. Which is a good thing, considering that nonrandom samples are pretty much all we’ve got.
  • The Night Riders
  • The piranha problem in social psychology / behavioral economics: The “take a pill” model of science eats itself
  • Ready Money
  • Stranger than fiction
  • “The Billy Beane of murder”?
  • Red doc, blue doc, rich doc, rich doc
  • Working Class Postdoc
  • “We wanted to reanalyze the dataset of Nelson et al. However, when we asked them for the data, they said they would only share the data if we were willing to include them as coauthors.”
  • UNDER EMBARGO: the world’s most unexciting research finding
  • Setting up a prior distribution in an experimental analysis
  • Walk a Crooked MiIe
  • It’s . . . spam-tastic!
  • The failure of null hypothesis significance testing when studying incremental changes, and what to do about it
  • Robust standard errors aren’t for me
  • Stupid-ass statisticians don’t know what a goddam confidence interval is
  • Forking paths plus lack of theory = No reason to believe any of this.
  • Turn your scatterplots into elegant apparel and accessories!
  • Your (Canadian) tax dollars at work

And a few to begin 2018:

  • The Ponzi threshold and the Armstrong principle
  • I’m with Errol: On flypaper, photography, science, and storytelling
  • Politically extreme yet vital to the nation
  • How does probabilistic computation differ in physics and statistics?
  • “Each computer run would last 1,000-2,000 hours, and, because we didn’t really trust a program that ran so long, we ran it twice, and it verified that the results matched. I’m not sure I ever was present when a run finished.”

Enjoy.

We’ll also intersperse topical items as appropriate.

Not everyone’s aware of falsificationist Bayes

Stephen Martin writes:

Daniel Lakens recently blogged about philosophies of science and how they relate to statistical philosophies. I thought it may be of interest to you. In particular, this statement:

From a scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide an answer to the main question of interest, which is the verisimilitude of scientific theories. Belief can be used to decide which questions to examine, but it can not be used to determine the truth-likeness of a theory.

My response, TLDR:
1) frequentism and NP require more subjectivity than they’re given credit for (assumptions, belief in perfectly known sampling distributions, Beta [and thus type-2 error ‘control’] requires subjective estimate of the alternative effect size)

2) Bayesianism isn’t inherently more subjective, it just acknowledges uncertainty given the data [still data-driven!]

3) Popper probably wouldn’t like the NHST ritual, given that we use p-values to support hypotheses, not to refute an accepted hypothesis [the nil-hypothesis of 0 is not an accepted hypothesis in most cases]

4) Refuting falsifiable hypotheses can be done in Bayes, which is largely what Popper cared about anyway

5) Even in a NP or LRT framework, people don’t generally care about EXACT statistical hypotheses, they care about substantive hypotheses, which map to a range of statistical/estimate hypotheses, and YET people don’t test the /range/, they test point values; bayes can easily ‘test’ the hypothesized range.

My [Martin’s] full response is here.

I agree with everything that Martin writes above. And, for that matter, I agree with most of Lakens wrote too. The starting point for all of this is my 2011 article, Induction and deduction in Bayesian data analysis. Also relevant are my 2013 article with Shalizi, Philosophy and the practice of Bayesian statistics and our response to the ensuing discussion, and my recent article with Hennig, Beyond subjective and objective in statistics.

Lakens covers the same Popper-Lakatos ground that we do, although he (Lakens) doesn’t appear to be aware of the falsificationist view of Bayesian data analysis, as expressed in chapter 6 of BDA and the articles listed above. Lakens is stuck in a traditionalist view of Bayesian inference as based on subjectivity and belief, rather than what I consider a more modern approach of conditionality, where Bayesian inference works out the implications of a statistical model or system of assumptions, the better to allow us to reveal problems that motivate improvements and occasional wholesale replacements of our models.

Overall I’m glad Lakens wrote his post because he’s reminding people of important issues that are not handled well in traditional frequentist or subjective-Bayes approaches, and I’m glad that Martin filled in some of the gaps. The audience for all of this seems to be psychology researchers, so let me re-emphasize a point I’ve made many times, the distinction between statistical models and scientific models. A statistical model is necessarily specific, and we should avoid the all-too-common mistake of rejecting some uninteresting statistical model and taking this as evidence for a preferred scientific model. That way lies madness.

Breaking the dataset into little pieces and putting it back together again

Alex Konkel writes:

I was a little surprised that your blog post with the three smaller studies versus one larger study question received so many comments, and also that so many people seemed to come down on the side of three smaller studies. I understand that Stephen’s framing led to some confusion as well as practical concerns, but I thought the intent of the question was pretty straightforward.

At the risk of beating a dead horse, I wanted to try asking the question a different way: if you conducted a study (or your readers, if you want to put this on the blog), would you ever divide up the data into smaller chunks to see if a particular result appeared in each subset? Ignoring cases where you might want to examine qualitatively different groups, of course; would you ever try to make fundamentally homogeneous/equivalent subsets? Would you ever advise that someone else do so?

For those caught up in the details, assume an extremely simple design. A simple comparison of two groups ending in a (Bayesian) t-test with no covariates, nothing fancy. In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment. Would you forego analyzing the entire sample and instead break them into three random chunks?

My personal experience is that empirically speaking, no one does this. Except for cases where people are interested in avoiding model overfitting and so use some kind of cross validation or training set vs testing set paradigm, I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions. The blog comments, however, seem to come down on the side of this being a good practice. Are you (or your readers) going to start doing this?

My reply:

From a Bayesian standpoint, the result is the same, whether you consider all the data at once, or stir in the data one-third at a time. The problem would come if you make intermediate decisions that involve throwing away information, for example if you take parts of the data and just describe them as statistically significant or not.

Don’t say “improper prior.” Say “non-generative model.”

[cat picture]

In Bayesian Data Analysis, we write, “In general, we call a prior density p(θ) proper if it does not depend on data and integrates to 1.” This was a step forward from the usual understanding which is that a prior density is improper if an infinite integral.

But I’m not so thrilled with the term “proper” because it has different meanings for different people.

Then the other day I heard Dan Simpson and Mike Betancourt talking about “non-generative models,” and I thought, Yes! this is the perfect term! First, it’s unambiguous: a non-generative model is a model for which it is not possible to generate data. Second, it makes use of the existing term, “generative model,” hence no need to define a new concept of “proper prior.” Third, it’s a statement about the model as a whole, not just the prior.

I’ll explore the idea of a generative or non-generative model through some examples:

Classical iid model, y_i ~ normal(theta, 1), for i=1,…,n. This is not generative because there’s no rule for generating theta.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with uniform prior density, p(theta) proportional to 1 on the real line. This is not generative because you can’t draw theta from a uniform on the real line.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with data-based prior, theta ~ normal(y_bar, 10), where y_bar is the sample mean of y_1,…,y_n. This model is not generative because to generate theta, you need to know y, but you can’t generate y until you know theta.

In contrast, consider a Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with non-data-based prior, theta ~ normal(0, 10). This is generative: you draw theta from the prior, then draw y given theta.

Some subtleties do arise. For example, we’re implicitly conditioning on n. For the model to be fully generative, we’d need a prior distribution for n as well.

Similarly, for a regression model to be fully generative, you need a prior distribution on x.

Non-generative models have their uses; we should just recognize when we’re using them. I think the traditional classification of prior, labeling them as improper if they have infinite integral, does not capture the key aspects of the problem.

P.S. Also relevant is this comment, regarding some discussion of models for the n:

As in many problems, I think we get some clarity by considering an existing problem as part of a larger hierarchical model or meta-analysis. So if we have a regression with outcomes y, predictors x, and sample size n, we can think of this as one of a larger class of problems, in which case it can make sense to think of n and x as varying across problems.

The issue is not so much whether n is a “random variable” in any particular study (although I will say that, in real studies, n typically is not precisely defined ahead of time, what with difficulties of recruitment, nonresponse, dropout, etc.) but rather that n can vary across the reference class of problems for which a model will be fit.

Where’d the $2500 come from?

Brad Buchsbaum writes:

Sometimes I read the New York Times “Well” articles on science and health. It’s a mixed bag, sometimes it’s quite good and sometimes not. I came across this yesterday:

What’s the Value of Exercise? $2,500

For people still struggling to make time for exercise, a new study offers a strong incentive: You’ll save $2,500 a year.

The savings, a result of reduced medical costs, don’t require much effort to accrue — just 30 minutes of walking five days a week is enough.

The findings come from an analysis of 26,239 men and women, published today in the Journal of the American Heart Association. . . .

I [Buchsbaum] thought: I wonder where the number came from? So I tracked down the paper referred to in the article (which was unhelpfully not linked or properly named).

I was horrified to find that the $2500 figure appears to be nowhere in the paper (see table 2). Moreover, the closest number I could find ($1900) was based on a regression model without covarying age, sex, ethnicity, income, or anything else. Of course older people exercise less and spend more on healthcare!

I sent the following email (see below) to the NYTimes author, but she has not responded.

At any rate, I thought this example of very high-profile science-blogging to be particularly egregious, so I thought I’d bring it to your attention.

The research article is Economic Impact of Moderate-Vigorous Physical Activity Among Those With and Without Established Cardiovascular Disease: 2012 Medical Expenditure Panel Survey, by Javier Valero-Elizondo, Joseph Salami, Chukwuemeka Osondu, Oluseye Ogunmoroti, Alejandro Arrieta, Erica Spatz, Adnan Younus, Jamal Rana, Salim Virani, Ron Blankstein, Michael Blaha, Emir Veledar, and Khurram Nasir.

And here’s Buchsbaum’s letter to Gretchen Reynolds, the author of that news article:

I very much enjoy your health articles for the New York Times. Sometimes I try and find the paper and examine the data, just for my own benefit.

After perusing the paper, I’m was not quite sure where the $2500 figure came from. In table 2 (see attached paper), the unadjusted expenditures are reported over all subjects.

non-optimal PA: $5397, optimal PA: $3443 for a difference of $1900.

This is close to $2500 but your number is higher.

However, remember, this is an *unadjusted model*. It does not account for age, sex, family income, race/ethnicity, insurance type, geographical location or comorbidity.

In other words, it’s a virtually useless model.

Lets look at Model 3, which does account for the above factors.

non-optimal PA: $4867, optimal PA: $4153 for a difference of $714

So $714 closer to the mark.

BUT, this includes ALL subjects, including those with cardiovascular disease (CVD).

If you look at people without CVD then the estimates depend on the cardiovascular risk profile (CRF). If you have an average or optimal profile then the difference is around $430 or $493. If you have a “poor” profile, then the difference is around $1060 (although the 95% confidence intervals overlapped, meaning the effect was not reliable).

What is my conclusion?

I’m afraid the title of your article is misleading since it is larger (by $600) than the $1900 estimate based on the meaningless unadjusted model! Even if the title was “What’s the Value of Exercise? $700”, it would still be misleading, because it implicitly assumes a causal relationship between exercise and expenditure.

Remember also that the adjusted variables are only the measures the authors happened to record. There are dozens of potentially other mediating variables which are related to both physical exercise and health expenditures. Including these other adjusting factors might further reduce the estimates.

Best Regards,

It’s just a news article so some oversimplification is perhaps unavoidable. But I do wonder where the $2500 number came from. I’m guessing it’s from some press release but I don’t know.

Also, I’m surprised the reporter didn’t respond to the email. But maybe New York Times reporters get too many emails to respond to, or even read. I should also emphasize that I did not read that news article or the scientific paper in detail, so I’m not endorsing (or disagreeing with) Buchsbaum’s claim. Here I’m just interested the general challenge of tracking down numbers like that $2500 that have no apparent source.

Stan Weekly Roundup, 16 June 2017

We’re going to be providing weekly updates for what’s going on behind the scenes with Stan. Of course, it’s not really behind the scenes, because the relevant discussions are at

  • stan-dev GitHub organization: this is the home of all of our source repos; design discussions are on the Stan Wiki

  • Stan Discourse Groups: this is the home of our user and developer lists (they’re all open); feel free to join the discussion—we try to be friendly and helpful in our responses, and there is a lot of statistical and computational expertise in the wings from our users, who are increasingly joining the discussion. By the way, thanks for that—it takes a huge load off us to get great answers from users to other user questions. We’re up to about 15 or so active discussion threads a day or thereabouts (active topics in the last 24 hours include AR(K) models, web site reorganization, ragged arrays, order statitic priors, new R packages built on top of Stan, docker images for Stan on AWS, and many more!)

OK, let’s get started with the weekly review, though this is a special summer double issue, just like the New Yorker.

Your news here: If you have any Stan news you’d like to share, please let me know at carp@alias-i.com (we’ll probably get a more standardized way to do this in the future).

New web site: Michael Betancourt redesigned the Stan web site; hopefully this will be easier to use. We’re no longer trying to track the literature. If you want to see the Stan literature in progress, do a search for “Stan Development Team” or “mc-stan.org” on Google Scholar; we can’t keep up! Do let us know either in an issue on GitHub for the web site or in the user group on Discourse if you have comments or suggestions.

New user and developer lists: We’ve shuttered our Google group and moved to Discourse for both our user and developer lists (they’re consolidated now in categories on one list). It’s easy to signup with GitHub or Google IDs and much easier to search and use online.
See Stan Discourse Groups and for the old discussions, Stan’s shuttered Google group for users and Stan’s shuttered Google group for developers“. We’re not removing any of the old content, but we are prohibiting new posts.

GPU support: Rok Cesnovar and Steve Bronder have been getting GPU support working for linear algebra operations. They’re starting with Cholesky decomposition because it’s a bottleneck for Gaussian process (GP) models and because it has the pleasant property of being quadratic in data and cubic in computation.
See math pull request 529

Distributed computing support: Sebastian Weber is leading the charge into distributed computing using the MPI framework (multi-core or multi-machine) by essentially coding up map-reduce for derivatives inside of Stan. Together with GPU support, distributed computing of derivatives will give us a TensorFlow-like flexibility to accelerate computations. Sebastian’s also looking into parallelizing the internals of the Boost and CVODES ordinary differential equation (ODE) solvers using OpenCL.
See math issue 101, math issue 551,

Logging framework: Daniel Lee added a logging framework to Stan to allow finer-grained control of

Operands and partials: Sean Talts finished the refactor of our underlying operands and partials data structure, which makes it much simpler to write custom derivative functions

See pull request 547

Autodiff testing framework: Bob Carpenter finished the first use case for a generalized autodiff tester to test all of our higher-order autodiff thoroughly
See math pull request 562

C++11: We’re all working toward the 2.16 release, which will be our last release before we open the gates of C++11 (and some of C++14). This is going to make our code a whole lot easier to write and maintain, and will open up awesome possibilities like having closures to define lambdas within the Stan language, as well as consolidating many of our uses of Boost into standard template library.

Append arrays: Ben Bales added signatures for append_array, to work like our appends for vectors and matrices.
See pull request 554 and pull request 550

ODE system size checks: Sebastian Weber pushed a bug fix that cleans up ODE system size checks to avoid seg faults at run time.
See pull request 559

RNG consistency in transformed data: A while ago we relaced the generated-quantities-only nature of _rng functions by allowing them in transformed data (so you can fit fake data generated wholly within Stan or represent posterior uncertainty of some other process, allowing “cut”-like models to be formulated as a two-stage process); Mitzi Morris just cleaned these up so we use the same RNG seed for all chains so that we can perform converence monitoring; multiple replications would then be done by running the whole multi-chain process multiple times.
See Stan pull request 2313

NSF Grant: CI-SUSTAIN: Stan for the Long Run: We (Bob Carpenter, Andrew Gelman, Michael Betancourt) were just awarded an NSF grant for Stan sustainability. This was a follow-on from the first Compute Resource Initiative (CRI) grant we got after building the system. Yea! This adds roughly a year of funding for the team at Columbia University. Our goal is to put in governance processes for sustaining the project as well as shore up all of our unit tests and documentation.

Hiring: We hired two full-time Stan staff at Columbia. Sean Talts joins as a developer at Columbia and Breck Baldwin as a business manager for the project, both at Columbia. Sean had already been working as a contractor for us, hence all the pull requests. (Pro tip: The best way to get a foot in the door for an open-source project is to submit a useful pull request.)

SPEED: Parallelizing Stan using the Message Passing Interface (MPI)

Sebastian Weber writes:

Bayesian inference has to overcome tough computational challenges and thanks to Stan we now have a scalable MCMC sampler available. For a Stan model running NUTS, the computational cost is dominated by gradient calculations of the model log-density as a function of the parameters. While NUTS is scalable to huge parameter spaces, this scalability becomes more of a theoretical one as the computational cost explodes. Models which involve ordinary differential equations (ODE) are such an example, where the runtimes can be of the order of days.

The obvious speedup when using Stan is to run multiple chains at the same time on different computer cores. However, this cannot reduce the total runtime per chain, which requires within-chain parallelization.

Hence, a viable approach is to parallelize the gradient calculation within a chain. As many Bayesian models facilitate hierarchical models over groupings we can often calculate contributions to the log-likelihood separately for each of these groups.

Therefore, the concept of an embarrassingly parallel program can be applied in this setting, i.e. one can calculate these independent work chunks on separate CPU cores and then collect the results.

For reasons implied by Stan’s internals (the gradient calculation must not run in a threaded program) we are restricted in applicable techniques. One possibility is the Message Passing Interface (MPI) which spawns multiple CPU cores by firing off independent processes. A root process will send packets of work (sets of parameters) to the child nodes which do the work and then send back the results (function return values and the gradients). A first toy example shows dramatic speedups (3 ODEs, 7 parameters). That is, when going from 1 core runtime of 5.2h we can crank it down to just 17 minutes by using 20 cores (18x speedup) on a single machine with 20 cores. MPI scales also across machines and when throwing 40 cores at the problem we are down to 10 minutes which is “only” a 31x speedup (see the above plot).

Of course, the MPI approach works best on clusters with many CPU
cores. Overall, this is fantastic news for big models as this opens the door to scale out large problems onto clusters which are available nowadays in many research facilities.

The source code for this prototype is on our github repository. This code should be regarded as working research code and we are currently working on bringing this feature into the main Stan distribution.

Wow. This is a big deal. There are lots of problems where this method will be useful.

P.S. What’s with the weird y-axis labels on that graph? I think it would work better to just go 1, 2, 4, 8, 16, 32 on both axes. I like the wall-time markings on the line, though; that helped me follow what was going on.

Pizzagate gets even more ridiculous: “Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature . . . in the later study they again found the exact opposite, but did not comment on the discrepancy.”

Background

Several months ago, Jordan Anaya​, Tim van der Zee, and Nick Brown reported that they’d uncovered 150 errors in 4 papers published by Brian Wansink, a Cornell University business school professor and who describes himself as a “world-renowned eating behavior expert for over 25 years.”

150 errors is pretty bad! I make mistakes myself and some of them get published, but one could easily go through an entire career publishing less than 150 mistakes. So many in a single paper is kind of amazing.

After the Anaya et al. paper came out, people dug into other papers of Wansink and his collaborators and found lots more errors.

Wansink later released a press release pointing to a website which he said contained data and code from the 4 published papers.

In that press release he described his lab as doing “great work,” which seems kinda weird to me, given that their published papers are of such low quality. Usually we would think that if a lab does great work, this would show up in its publications, but this did not seem to have happened in this case.

In particular, even if the papers in question had no data-reporting errors at all, we would have no reason to believe any of the scientific claims that were made therein, as these claims were based on p-values computed from comparisons selected from uncontrolled and abundant researcher degrees of freedom. These papers are exercises in noise mining, not “great work” at all, not even good work, not even acceptable work.

The new paper

As noted above, Wansink shared a document that he said contained the data from those studies. In a new paper, Anaya, van der Zee, and Brown analyzed this new dataset. They report some mistakes they (Anaya et al.) had made in their earlier paper, and many places where Wanink’s papers misreported his data and data collection protocols.

Some examples:

All four articles claim the study was conducted over a 2-week period, however the senior author’s blog post described the study as taking one month (Wansink, 2016), the senior author told Retraction Watch it was a two-month study (McCook, 2017b), a news article indicated the study was at least 3 weeks long (Lazarz, 2007), and the data release states the study took place from October 18 to December 8, 2007 (Wansink and Payne, 2007). Why the articles claimed the study only took two weeks when all the other reports indicate otherwise is a mystery.

Furthermore, articles 1, 2, and 4 all claim that the study took place in spring. For the Northern Hemisphere spring is defined as the months March, April, and May. However, the news report was dated November 18, 2007, and the data release states the study took place between October and December.

And this:

Article 1 states that the diners were asked to estimate how much they ate, while Article 3 states that the amount of pizza and salad eaten was unobtrusively observed, going so far as to say that appropriate subtractions were made for uneaten pizza and salad. Adding to the confusion Article 2 states:
“Unfortunately, given the field setting, we were not able to accurately measure consumption of non-pizza food items.”

In Article 3 the tables included data for salad consumed, so this statement was clearly inaccurate.

And this:

Perhaps the most important question is why did this study take place? In the blog post the senior author did mention having a “Plan A” (Wansink, 2016), and in a Retraction Watch interview revealed that the original hypothesis was that people would eat more pizza if they paid more (McCook, 2017a). The origin of this “hypothesis” is likely a previous study from this lab, at a different pizza buffet, with nearly identical study design (Just and Wansink, 2011). In that study they found diners who paid more ate significantly more pizza, but the released data set for the present study actually suggests the opposite, that diners who paid less ate more. So was the goal of this study to replicate their earlier findings? And if so, did they find it concerning that not only did they not replicate their earlier result, but found the exact opposite? Did they not think this was worth reporting?
Another similarity between the two pizza studies is the focus on taste of the pizza. Article 1 specifically states:

“Our reading of the literature leads us to hypothesize that one would rate pizza from an $8 pizza buffet as tasting better than the same pizza at a $4 buffet.”

Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature, because in that paper they found ratings for overall taste, taste of first slice, and taste of last slice to all be higher in the lower price group, albeit with different levels of significance (Just and Wansink, 2011). However, in the later study they again found the exact opposite, but did not comment on the discrepancy.

Anaya et al. summarize:

Of course, there is a parsimonious explanation for these contradictory results in two apparently similar studies, namely that one or both sets of results are the consequence of modeling noise. Given the poor quality of the released data from the more recent articles . . . it seems quite likely that this is the correct explanation for the second set of studies, at least.

And this:

No good theory, no good data, no good statistics, no problem. Again, see here for the full story.

Not the worst of it

And, remember, those 4 pizzagate papers are not the worst things Wansink has published. They’re only the first four articles that anyone bothered to examine carefully enough to see all the data problems.

There was this example dug up by Nick Brown:

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results . . . Here is a plot of the number of times each decimal digit appears in the last position in these tables:

These don’t look like so much like real data but they do seem consistent with someone making up numbers and not wanting them to seem too round, and not being careful to include enough 0’s and 5’s in the last digits.

And this discovery by Tim van der Zee:

Wansink, B., Cheney, M. M., & Chan, N. (2003). Exploring comfort food preferences across age and gender. Physiology & Behavior, 79(4), 739-747.

Citations: 334

Using the provided summary statistics such as mean, test statistics, and additional given constraints it was calculated that the data set underlying this study is highly suspicious. For example, given the information which is provided in the article the response data for a Likert scale question should look like this:

Furthermore, although this is the most extreme possible version given the constraints described in the article, it is still not consistent with the provided information.

In addition, there are more issues with impossible or highly implausible data.

And:

Sığırcı, Ö, Rockmore, M., & Wansink, B. (2016). How traumatic violence permanently changes shopping behavior. Frontiers in Psychology, 7,

Citations: 0

This study is about World War II veterans. Given the mean age stated in the article, the distribution of age can only look very similar to this:

The article claims that the majority of the respondents were 18 to 18.5 years old at the end of WW2 whilst also have experienced repeated heavy combat. Almost no soldiers could have had any other age than 18.

In addition, the article claims over 20% of the war veterans were women, while women only officially obtained the right to serve in combat very recently.

There’s lots more at the link.

From the NIH guidelines on research misconduct:

Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.