The scientific credibility crisis
Impugning the credibility of science as a whole
In our scientific bulletin from February this year (Reproducibility in Behavioral Neuroscience: Methods Matter), we focused on the findings of a recent survey by Baker an colleagues (Baker et al., 2016), which indicates a significant ‘crisis’ of reproducibility. In this survey – which addressed 1,576 researchers – more than 70% of researchers express that they have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.
In April this year, Kafkafi and colleagues published a comprehensive review on the reproducibility and replicability of rodent phenotyping in preclinical studies (Kafkafi et al., 2018). This intriguing review, which summarizes the most prevalent opinions on this topic, but also gives insights on more controversial positions, is based on a conference at the TAU University in Tel Aviv in January of 2015 on the topic of replicability and reproducibility in animal phenotyping, with participating researchers coming from various disciplines including genetics, behavior genetics, behavioral neuroscience, ethology, statistics, bioinformatics and data science. The highly recommended proceedings of this conference are available as a set of video clips.
Based on the increasing concern with the proportion of published ‘discoveries’ that could not be replicated in subsequent studies, and sometimes could not even be reproduced in reanalysis of the original data, Kafkafi et al. report that this ‘crisis’ is increasingly seen as a problem with the scientific methods and impugns the credibility of science as a whole.
Reproducibility vs. replicability
Since the terms “replicable, reproducible, repeatable, confirmable, stable, generalizable, reviewable, auditable, verifiable and validatable” might have orthogonal or even opposing meanings in different disciplines and fields of science, there is no scientific consensus on the definition of basic terminology or even in the context of naming this crisis. Leek and colleagues offered a helpful distinction at least regarding the basic terms “reproducibility” and “replicability” (Leek & Peng, 2015):
- Reproducibility: The ability to recompute results. It aims to reproduce. Thus, derive the same results, figures and conclusions reported in the publication from the same original data, through reanalysis.
- Replicability: The chances other experimenters will achieve a consistent result. It is concerned with replicating outcomes of another study, in a similar but not necessarily identical way, for example at a different time and/or in a different laboratory, to arrive at similar conclusions in the same research question.
However, while Kafkafi et al. use this distinction by Leek & Peng in their above-mentioned intriguing review, other researchers indeed suggest similar distinctions with the opposite definition of basic terminology. So that the discussion of this credibility crisis is persistently confused even further. Consequently, prominent institutions and journals recently reconsidered their policies due to this ‘crisis’ (e.g., the NIH now uses the comprehensive term ‘rigor’ to describe adequacy of experimental design, metadata, and analytic methods that should hopefully lead to higher rates of replicability and reproducibility).
“Show me” vs. “trust me”
Another basal issue – but important a fortiori – that contributes to this crisis was recently described by Philip Stark, Associate Dean of the Division of Mathematical and Physical Sciences, UC Berkeley Professor of Statistics (Stark 2015). Namely, on top of the confusion based on differing terminology, most scientific publications do not give access to the raw data, the code, and other details needed to confirm the reported results, and thereby demand mere trust in the shown results. Stark thus demands that science should be based on “show me” instead of “trust me”. In 2011 Roger Peng also asked for a ‘gold standard’ concerning scientific publications, which would eventually foster a full replication (Peng, 2011; see Figure 2).
However, the almost omnipresent pressure to minimize the length of methods sections has resulted in drastically abbreviated descriptions of important details. Marinov and colleagues even detected a trend in a study of over 700 recent publications on the topic of ChIP-seq, which provides for a solid evidence of a negative relation between the impact factor of a journal and the likelihood of technical error (Markinov et al., 2014).
To distinguish science from one-time ‘miracles’, and to formulate valid scientific interpretations that are not based on insights, which are idiosyncratic to local and specific conditions, reproducibility and replicability are crucial in all fields of experimental research. But, especially in animal research, lives and welfare of subjects are valuable for ethical reasons, and should not be wasted for inconclusive research!
Nonetheless, due to automated and computerized high-throughput strategies, testing batteries and pipelines used for phenotyping, usually 10²-10³ phenotypic measures per independent variable are recorded. Providing a full report in one publication on all measures following Peng’s ‘gold standard’ is challenging to say at least, so inevitably only a few are highlighted (Peterson et al., 2016).
Biobserve has developed a range of fully automated (home-cage based) experiments, which reduce or eliminate ambiguity of interpretations in behavioural research and provide for automated and consistent data handling as well as robust quality control.
We can help you in designing and establishing robust and reproducible behavioural experiments and offer validated software and support for gathering and analysing your research data.
Please do not hesitate to contact us regarding your next behavioral study!
Baker M. (2016) 1,500 scientists lift the lid on reproducibility. Nature. 26;533(7604):452-4. doi: 10.1038/533452a.
Leek JT, Peng RD (2015) Opinion: Reproducible research can still be wrong: adopting a prevention approach. Proc Natl Acad Sci U S A. 112(6):1645-6.
Marinov GK, Kundaje A, Park PJ, Wold BJ (2014) Large-scale quality analysis of published ChIP-seq data. G3 (Bethesda). 19;4(2):209-23.
Kafkafi N, Agassi J, Chesler EJ, Crabbe JC, Crusio WE, Eilam D, Gerlai R, Golani I, Gomez-Marin A, Heller R, Iraqi F, Jaljuli I, Karp NA, Morgan H, Nicholson G, Pfaff DW, Richter SH, Stark PB, Stiedl O, Stodden V, Tarantino LM, Tucci V, Valdar W, Williams RW, Würbel H, Benjamini Y (2018) Reproducibility and replicability of rodent phenotyping in preclinical studies. Neurosci Biobehav Rev. 87:218-232.
Peng RD (2011) Reproducible research in computational science. Science. 334(6060):1226-7.
Peterson CB, Bogomolov M, Benjamini Y, Sabatti C (2016) Many Phenotypes Without Many False Discoveries: Error Controlling Strategies for Multitrait Association Studies. Genet Epidemiol. 40(1):45-56.
Stark, PB (2015) Science is “show me” not “trust me”. Berkeley Initiative for Transparency in the Social Sciences. http://www.bitss.org/2015/12/31/science-is-show-me-not-trust-me/