Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 22:11:609647.
doi: 10.3389/fpsyg.2020.609647. eCollection 2020.

Statistical Significance Filtering Overestimates Effects and Impedes Falsification: A Critique of

Affiliations

Statistical Significance Filtering Overestimates Effects and Impedes Falsification: A Critique of

Jonathan Z Bakdash et al. Front Psychol. .

Abstract

Whether in meta-analysis or single experiments, selecting results based on statistical significance leads to overestimated effect sizes, impeding falsification. We critique a quantitative synthesis that used significance to score and select previously published effects for situation awareness-performance associations (Endsley, 2019). How much does selection using statistical significance quantitatively impact results in a meta-analytic context? We evaluate and compare results using significance-filtered effects versus analyses with all effects as-reported. Endsley reported high predictiveness scores and large positive mean correlations but used atypical methods: the hypothesis was used to select papers and effects. Papers were assigned the maximum predictiveness scores if they contained at-least-one significant effect, yet most papers reported multiple effects, and the number of non-significant effects did not impact the score. Thus, the predictiveness score was rarely less than the maximum. In addition, only significant effects were included in Endsley's quantitative synthesis. Filtering excluded half of all reported effects, with guaranteed minimum effect sizes based on sample size. Results for filtered compared to as-reported effects clearly diverged. Compared to the mean of as-reported effects, the filtered mean was overestimated by 56%. Furthermore, 92% (or 222 out of 241) of the as-reported effects were below the mean of filtered effects. We conclude that outcome-dependent selection of effects is circular, predetermining results and running contrary to the purpose of meta-analysis. Instead of using significance to score and filter effects, meta-analyses should follow established research practices.

Keywords: confirmation bias; falsification; meta-analysis; p-hacking; performance; selection bias; significance filter; situation awareness.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Top Panel: Distribution of the 241 effects from the 38 papers included in our dataset. Values of individual effect sizes are indicated by ticks above the x-axis with a thick vertical line for overall mean estimated using a meta-analytic model. Bottom panel: Distribution of significance filtered effects from the above dataset, the overall mean from Table 1 (r = 0.46). Note the restriction of range with significance filtering.
FIGURE 2
FIGURE 2
Top: The probability that at least one effect in a given paper is significant or marginally significant as a function of the number of effects per paper (k), defined by 1−(1−δ)k, where power δ is determined using the significance level (α = 0.05, dark red, or α = 0.10, pink), the empirical sample size (N = 24), and the empirical effect size (ρ = 0.29). The median (blue) and mean (gray) number of effects per paper are shown as dotted lines. Individual k values for the papers in our data set are indicated by ticks above the x-axis. Bottom: The expected value of the predictiveness score as a function of k, again using the empirical sample size (N = 24) and effect size (ρ = 0.29).
FIGURE 3
FIGURE 3
For a one-tailed, positive correlation: The shaded areas are the possible range of effect sizes for significant (dark red), marginally significant (pink), and non-significant results from papers using the sample size reported in each paper/dataset. The lowest values of the red and pink shaded areas depict the guaranteed minimum effect sizes. Dark gray dots show the actual effect sizes (y-axis) and corresponding sample sizes (x-axis) as-reported for results of the individual papers in our dataset. Red and pink dots indicate overfit results (excessive degrees of freedom, see section “Dataset” and Supplementary Material 1.2), that only reached two-tailed significance or marginal significance, respectively, due to overfitting. For example, one paper has a stated sample size of 10 participants but reports a result of r(52) = 0.32, p < 0.02; a Pearson correlation has N – 2 degrees of freedom so this should be r(8), see Supplementary Material 1.2 for more information. Note with one-tailed tests, all non-positive effects will be non-significant and thus filtered out based on Endsley’s described method. One paper with N = 171 is not shown.
FIGURE 4
FIGURE 4
Forest plot depicting mean correlations between SA and performance, both overall and for individual SA measures. Meta-analytic means and prediction interval using all 241 reported effects are shown in black; means of filtered effects (Table 1) are shown in red. For the reason previously described in 2.1, no confidence interval could be calculated for the overall filtered mean.

Similar articles

References

    1. Aarts E., Verhage M., Veenvliet J. V., Dolan C. V., van der Sluis S. (2014). A solution to dependency: using multilevel analysis to accommodate nested data. Nat. Neurosci. 17 491–496. 10.1038/nn.3648 - DOI - PubMed
    1. Aschwanden C. (2019). We’re All “P-Hacking” Now. Wired. Available online at: https://web.archive.org/web/20191212142531/https://www.wired.com/story/w... (accessed February 25, 2020).
    1. Assink M., Wibbelink C. J. M. (2016). Fitting three-level meta-analytic models in R: a step-by-step tutorial. Quant. Methods Psychol. 12 154–174. 10.20982/tqmp.12.3.p154 - DOI
    1. Babyak M. A. (2004). What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom. Med. 66 411–421. 10.1097/01.psy.0000127692.23278.a9 - DOI - PubMed
    1. Bakdash J. Z., Marusich L. R. (2017). Repeated measures correlation. Front. Psychol. 8:456. 10.3389/fpsyg.2017.00456 - DOI - PMC - PubMed

LinkOut - more resources