Since fall 2020, Robert Montgomerie has been leading a group of nascent Data Editors in a task of designing a framework for monitoring compliance with Data Sharing requirements at The American Naturalist. This entails both setting up policies for a future board of Data Editors whose job will be to evaluate compliance of manuscripts' data and metadata before acceptance, and evaluating where problems lie in the past. What follows is a brief summary from Bob Montgomerie of his findings looking back at 2020 publications' compliance. -Dan Bolnick
Data Transparency 2020
Bob Montgomerie, firstname.lastname@example.org
For the past decade, authors of papers published in The American Naturalist have been required to make their raw data publicly available (Whitlock et al. 2010), either as an online supplement or in a recognized public data repository. The American Naturalist was one of the first journals to make this commitment to reproducibility and transparency but in the intervening 11 years, many biology journals have followed suit. Despite this requirement, however, compliance has too often been spotty (Roche et al. 2015) with data too often being incomplete, unintelligible, inconsistent or non-existent, though by 2020 all papers in Am Nat have made some data available to readers.
The myriad forms of, and problems with, data associated with papers is hardly surprising as journals rarely, if ever, provide guidelines for authors. For that reason, The American Naturalist now has a specific set of guidelines for providing data (https://www.journals.uchicago.edu/journals/an/instruct)—much like the usual guidelines to authors for manuscript style—and a small team of data editors to help authors comply. Our policies and procedures for data will undoubtedly evolve in the coming months as our goal is to help authors make their data as transparent as possible, while also saving time for both authors and downstream users of those datasets.
To provide a summary of the current state of data available for Am Nat papers I looked at 100 papers published in 2020 (issues 1-5). By my count, 78 of those papers analyzed data that I expected to be made available (e.g., not analytical theory, or a synthetic review). The good news is that all but six of those papers made their raw data available—3 of those had embargoed their data for a reasonable period, and three others had not yet made their data available, which we immediately rectified. The not so good news—and this applies to all journals that I use regularly—is that those data are too often incomplete, or inscrutably difficult to understand (see graph).
The biggest issue, and easy to resolve, is that only about 25% of those papers with data are what we would now call ‘complete’ in that they provide useful information about the datasets and variables provided. On trying to use data from a variety of journals in my statistics courses over the past decade, I often found that it would take me hours or even days to replicate analyses, too often involving correspondence with the authors to figure out cryptic variable names and complex data structures.
Those 75 data repositories that I looked at are remarkably diverse, involving 5 different repositories, 1-100s of files, some 34 different files types, and total repository size ranging from 20 Kb to >13 Gb. Anyone who has tried to open VisiCalc files from 1981, as I have, will appreciate the usefulness of simple file structures that will be accessible for years to come as the landscape of data-handling software evolves.
The survey of papers published in 2020 provides a baseline to gauge our progress in making data associated with Am Nat papers useful and transparent, and our research optimally reproducible. We will revisit this sort of analysis in a year’s time and we welcome your comments and suggestions.
Roche DG, Kruuk LEB, Lanfear R, Binning SA. Public Data Archiving in Ecology and Evolution: How Well Are We Doing?. PLOS Biol. 2015; 13 (11): e1002295. https://doi.org/10.1371/journal.pbio. 1002295 PMID: 26556502
Whitlock MC, MacPeek MA, Rausher MD, Rieseberg L, Moore AJ. 2010 Data archiving. American Naturalist 175: 145-146),