[66] Outliers: Evaluating A New P-Curve Of Power Poses

In a forthcoming Psych Science paper, Cuddy, Schultz, & Fosse, hereafter referred to as CSF, p-curved 55 power-posing studies (.pdf | SSRN), concluding that they contain evidential value [1]. Thirty-four of those studies were previously selected and described as “all published tests” (p. 657) by Carney, Cuddy, & Yap (2015; .htm). Joe and Uri p-curved those 34 studies and concluded that they lacked evidential value (.htm | Colada[37]). The two p-curve analyses – Joe & Uri's old p-curve and CSF's new p-curve – arrive at different conclusions not because the different sets of authors used different sets of tools, but rather because they used the same tool to analyze different sets of data.

In this post we discuss CSF's decision to include four studies with unusually small p-values (e.g., p < 1 in a quadrillion) in their analysis. The inclusion of these studies was sufficiently problematic that we stopped further evaluating their p-curve. [2].

Aside:
Several papers have replicated the effect of power posing on feelings of power and, as Joe and Uri reported in their Psych Science paper (.htm, pg.4), a p-curve of those feelings-of-power effects suggests they contain evidential value. CSF interpret this as a confirmation of the central power-posing hypothesis, whereas we are reluctant to interpret it as such for reasons that are both psychological and statistical. Fleshing out the arguments on both sides may be interesting, but it is not the topic of this post.

Evaluating p-curves
Evaluating any paper is time consuming and difficult. Evaluating a p-curve paper – which is in essence, a bundle of other papers – is necessarily more time consuming and more difficult.

We have, over time, found ways to do it more efficiently. We begin by preliminarily assessing three criteria. If the p-curve fails any of these criteria, we conclude that it is invalid and stop evaluating it. If the p-curve passes all three criteria, we evaluate the p-curve work more thoroughly.

Criterion 1: Study Selection Rule
Our first step is to verify that the authors followed a clear and reproducible study selection rule. CSF did not. That’s a problem, but it is not the focus of this post. Interested readers can check out this footnote: [3].

Criterion 2: Test Selection
Figure 4 (.pdf) in our first p-curve paper (SSRN) explains and summarizes which tests to select from the most common study designs. The second thing we do when evaluating a p-curve paper is to verify that the guidelines were followed by focusing on the subset of designs that are most commonly incorrectly treated by p-curvers. For example, we look at interaction hypotheses to make sure that the right test is included, and we look to see whether omnibus tests are selected (they should almost never be; see Colada[60]). CSF selected some incorrect test results (e.g., their smallest p-value comes from an omnibus test). See "Outlier 1" below.

Criterion 3. Outliers
Next we sort studies by p-value to identify possible outliers, and we carefully read the papers containing an outlier result. We do this both because outliers exert a disproportionate effect on the results of p-curve, and because outliers are much more likely to represent the erroneous inclusion of a study or the erroneous selection of a test result. This post focuses on outliers.

This figure presents the distribution of p-values in CSF’s p-curve analysis (see their disclosure table .xlsx). As you can see, there are four outliers:

Outlier 1
CSF's smallest p-value is from F(7, 140) = 19.47, approximately p = .00000000000000002, or 1 in 141 quadrillion. It comes from a 1993 experiment published in the journal The Arts in Psychotherapy (.htm).

In this within-subject study (N = 24), each participant held three “open” and three “closed” body poses. At the beginning of the study, and then again after every pose, they rated themselves on eight emotions. The descriptions of the analyses are insufficiently clear to us (and to colleagues we sent the paper to), but as far as we can tell, the following things are true:

(1) Some effects are implausibly large. For example, Figure 1 in their paper (.htm) suggests that the average change in happiness for those adopting the “closed” postures was ~24 points on a 0-24 scale. This could occur only if every participant was maximally happy at baseline and then maximally miserable after adopting every one of the 3 closed postures.

(2) The statistical analyses incorrectly treat multiple answers by the same participants as independent, across emotions and across poses.

(3) The critical test of an interaction between emotion valence and pose is not reported. Instead the authors report only an omnibus interaction: F(7, 140) = 19.47. Given the degrees-of-freedom of the test, we couldn’t figure out what hypothesis this analysis was testing, but regardless, no omnibus test examines the directional hypothesis of interest. Thus, it should not be included in a p-curve analysis.

Outlier 2
CSF's second smallest p-value is from F(1,58)=85.9, p = .00000000005, or 1 in 2 trillion. It comes from a 2016 study published in Biofeedback Magazine (.htm). In that study, 33 physical therapists took turns in dyads, with one of them (the “tester”) pressing down on the other’s arm, and the other (the “subject”) attempting to resist that pressure.

The p-value selected by CSF compares subjective arm strength when the subject is standing straight (with back support) vs. slouching (without support). As the authors of the original article explain, however, that has nothing to do with any psychological consequences of power posing, but rather, with its mechanical consequences. In their words: “Obviously, the loss of strength relates to the change in the shoulder/body biomechanics and affects muscle activation recorded from the trapezius and medial and anterior deltoid when the person resists the downward applied pressure” (p. 68-69; emphasis added) [4].

Outlier 3
CSF's third smallest p-value is from F(1,68)=26.25, p = .00000267, or 1 in ~370,000. It comes from a 2014 study published in Psychology of Women Quarterly (.htm).

This paper explores two main hypotheses, one that is quite nonintuitive, and one that is fairly straightforward. The nonintuitive hypothesis predicts, among other things, that women who power pose while sitting on a throne will attempt more math problems when they are wearing a sweatshirt but fewer math problems when they are wearing a tank-top; the prediction is different for women sitting in a child's chair instead of a throne [5].

CSF chose the p-value for the straightforward hypothesis, the prediction that people experience fewer positive emotions while slouching (“allowing your rib cage to drop and your shoulders to rotate forward”) than while sitting upright (“lifting your rib cage forward and pull[ing] your shoulders slightly backwards”).

Unlike the previous two outliers, one might be able to validly include this p-value in p-curve. But we have reservations, both about the inclusion of this study, and about the inclusion of this p-value.

First, we believe most people find power posing interesting because it affects what happens after posing, not what happens while posing. For example, in our opinion, the fact that slouching is more uncomfortable than sitting upright should not be taken as evidence for the power poses hypothesis.

Second, while the hypothesis is about mood, this study’s dependent variable is a principal component that combines mood with various other theoretically irrelevant variables that could be driving the effect, such as how "relaxed" or "amused" the participants were. We discuss two additional reservations in this footnote: [6].

Outlier 4
CSF's fourth smallest p-value is from F(2,44)=13.689, p=.0000238, or 1 in 42,000. It comes from a 2015 study published in the Mediterranean Journal of Social Sciences (.htm). Fifteen male Iranian students were all asked to hold the same pose for almost the entirety of each of nine 90-minute English instruction sessions, varying across sessions whether it was an open, ordinary, or closed pose. Although the entire class was holding the same position at the same time, and evaluating their emotions at the same time, and in front of all other students, the data were analyzed as if all observations were independent, artificially reducing the p-value.

Conclusion
Given how difficult and time consuming it is to thoroughly review a p-curve analysis or any meta-analysis (e.g., we spent hours evaluating each of the four studies discussed here), we preliminarily rely on three criteria to decide whether a more exhaustive evaluation is even warranted. CSF's p-curve analysis did not satisfy any of the criteria. In its current form, their analysis should not be used as evidence for the effects of power posing, but perhaps a future revision might be informative.

Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted CSF and the authors of the four studies we reviewed.

Amy Cuddy responded to our email, but did not discuss any of the specific points we made in our post, or ask us to make any specific changes. Erik Peper, lead author of the second outlier study, helpfully noticed that we had the wrong publication date and briefly mentioned several additional articles of his own on how slouched positions affect emotions, memory, and energy levels (.htm; .pdf; htm; html; html). We also received an email from the second author of the first outlier study; he had "no recommended changes." He suggested that we try to contact the lead author but we were unable to find her current email address.

Subscribe to Blog via Email

Footnotes

When p-curve concludes that there is evidential value, it is simply saying that at least one of the analyzed findings was unlikely to have arisen from the mere combination of random noise and selective reporting. In other words, at least one of the studies would be expected to repeatedly replicate [↩]
After reporting the overall p-curve, CSF also split the 55 studies based on the type of dependent variable: (i) feelings of power, (ii) EASE (“Emotion, Affect, and Self-Evaluation”), and (iii) behavior or hormonal response (non-EASE). They find evidential value for the first two, but not the last. The p-curve for EASE includes all four of the studies described in this post. [↩]
To ensure that studies are not selected in a biased manner, and more generally to help readers and reviewers detect possible errors, the set of studies included in p-curve must be determined by a predetermined rule. The rule, in turn, should be concrete and precise enough that an independent set of researchers following the rule would generate the same, or virtually the same, set of studies. The rule, as described in CSF's paper, lacks the requisite concreteness and precision. In particular, the paper lists 24 search terms (e.g., “power”, “dominance”) that were combined (but the combinations are not listed). The resulting hits were then “filter[ed] out based on title, then abstract, and then the study description in the full text” in an unspecified manner. (Supplement: https://osf.io/5xjav/ | our archived copy .txt). In sum, though the authors provide some information about how they generated their set of studies, neither the search queries nor the filters are specified precisely enough for someone else to reproduce them. Joe and Uri's p-curve, on the other hand, followed a reproducible study selection rule: all studies that were cited by Carney et al. (2015) as evidence for power posing. [↩]
The paper also reports that upright vs. collapsed postures may affect emotions and the valence of memories, but these claims are supported by quotations rather than by statistics. The one potential exception is that the authors report a “negative correlation between perceived strength and severity of depression (r=-.4)." Given the sample size of the study, this is indicative of a p-value in the .03-.05 range. The critical effect of pose on feelings, however, is not reported. [↩]
The study (N = 80) employed a fully between-subjects 2(self-objectification: wearing a tanktop vs. wearing a sweatshirt) x 2(power/status: sitting in a “grandiose, carved wooden decorative antique throne” vs. a “small wooden child’s chair from the campus day-care facility”) x 2(pose: upright vs slumped) design. [↩]
First, for robustness, one would need to include in the p-curve the impact of posing on negative mood, which is also reported in the paper and which has a considerably larger p-value (F = 13.76 instead of 26.25). Second, the structure of the experiment is very complex, involving a three-way interaction which in turn hinges on a two-way reversing interaction and a two-way attenuated interaction. It is hard to know if the p-value distribution of the main effect is expected to be uniform under the null (a requirement of p-curve analysis) when the researcher is interested in these trickle-down effects. For example, it hard to know whether p-hacking the attenuated interaction effect would cause the p-value associated with the main effect to be biased downwards. [↩]