top of page

The problem with the "datapipe" line: it's too focused on p-values

In the 28 April 2015, a paper by Jeffrey Leek & Roger Peng, "Statistics: P values are just the tip of the iceberg" was published in Nature. While I agree that p-values are merely the last step in what Leek and Peng call the "data pipeline", I felt that some of their arguements were flawed. Below is my response in the Comments section. Click here for the Commentary and Comments.

2015-09-10 12:18 AM

Leek and Peng suggest that arguing about p-values is 'the tip of the iceberg sinking science'. They correctly identify that for decades the misuse of p-values has been an easy target for many authors. However, given the ubiquity of p-values, continued scrutiny should be welcome as misuse continues. Such scrutiny has led to a better understanding of p-values, the use of newer analytical paradigms like information-theoretic approaches, and the revival of Bayesian analyses. Leek and Peng also suggest that there has been little debate about the steps in the 'data pipeline'. I disagree. Over the last century there has been tremendous debate over some steps in the ‘data pipeline’ including study design, data collection, and statistical approaches. Scientists train in these areas from early in their careers and these lessons are reinforced during formal peer-review of grant proposal and manuscript submission. The peer-review process is not fool-proof, but the rejection rates in most journals suggest that scrutiny is not the problem. Where scrutiny remains lacking, and where scientists receive insufficient training, are the intermediate steps in the ‘data pipeline’: raw data to exploratory analyses. As data sets grow in size and complexity, so does cause for concern. These steps are critical to research outcomes, but typically receive minimal attention from the peer-review process. While I agree with Leek and Peng’s suggestion that training will remedy this problem, I submit that reproducible research – whereby the ability to reproduce analyses is facilitated by authors’ making data, statistical code, and documentation (code or otherwise) regarding data cleaning and exploratory analyses available – is also required. Ergo, I hope that Leek and Peng's paper fosters more debate and more recognition of these steps in the ‘data pipeline’, facilitating better education and broader adoption of reproducible research practices.


bottom of page