March 2, 2022

Ensuring data and code archive quality: why and how?

This post was co-written by Dan Bolnick (The American Naturalist, Editor), Tim Vines (DataSeer.ai), Bob Montgomerie (AmNat Data Editor). This blog post addresses the value and flaws in current open and reproducible science practices, and hopefully is of interest to authors, other editors, and students considering publishing in biological sciences.


Part 1 articulates the value of data and code archiving, and can be skipped if you are already a convert. Part 2 explains what we see as weaknesses with the current system. Part 3 explains the new Data Editor role adopted by The American Naturalist and the associated workflow including DataSeer.ai's role. Part 4 addresses some remaining issues for consideration, and Part 5 provides some parting thoughts.


Part I: The Value of Data and Code Archiving

In 2010, a group of journal editors announced that some of the leading journals in evolutionary biology would adopt a new policy requiring authors to archive all of the raw data needed to obtain published results (Whitlock et al, 2010 American Naturalist). Previously, some kinds of genetic data (e.g., DNA sequences) were often required to be archived on databases such as GenBank. However, researchers in ecology and evolution often work with system-specific types of data, so that attempts to build standardized databases had typically failed. Sometimes data were provided in tables within a paper or provided as online supplements once journals moved much of their material online in the early 2000s. But, the most common approach was to simply state in the article that data were “available upon request”. 

Relying on authors to provide data on request, however,  was largely ineffective because authors too often would either ignore requests for data, or respond that they no longer had access to the relevant data (“it was on a computer whose hard drive crashed years ago”). As time passes, it’s more and more likely that those datasets are lost forever (see Vines et al 2014). As an example, in 2012 one of us emailed 120 authors asking for raw data underlying effect size estimates used in a previous meta-analysis, and received only five replies. This kind of poor response has been studied more rigorously (see for instance Savage and Vickers 2009 PLoS One and Vines et al 2014), and is a widespread problem. The result has been calls to adopt more rigorous systems (e.g., Langille 2018 Microbiome).

The Editors of some evolution journals sought to change this by requiring data sharing. The American Naturalist adopted this requirement in January 2011 for all new manuscripts, and reached out to authors of previous papers to encourage them to  archive their data retroactively. Notably, this is a move that is only now being followed by biomedical sciences, which is being dubbed a “seismic mandate” (Nature News, 2021).

Although the initial response from some members of the research community was hostile, the benefits of archiving data are by now widely recognized:

  1. Archives are useful as a back-up for your own data, in a secure location. Once archived, you will have access to your own data in perpetuity, safe from the vagaries of damaged hard drives or stolen laptops. Most data repositories also use version control to track changes to files so you can go back and retrieve earlier versions if necessary.


  1. Archived data can be re-used for new studies, providing a cost-effective framework for building upon or consolidating past results. Meta-analyses are an increasingly popular tool for combining discipline-wide findings from a large number of studies. Traditionally meta-analyses have been based on an aggregated analysis of summary statistics from many published studies. But, far more effective approaches are possible when the raw data can be analyzed directly. Re-analyses by third parties do run the risk of (i) scooping an author who plans further publications from a data-set, or (ii) committing errors based on a poor understanding of the biology of the system or experimental details. In practice, we are aware of few, if any, examples of (i) in evolution or ecology, but some clear examples of (ii). There was some debate in recent years whether such re-use represents “research parasitism”, but as long as the findings of re-analyses are correct and provide  added value then everyone wins.


  1. Data archives are citable products one should put on one’s CV. In practice, few people have done this in the past, and data archives have rarely been cited (see a small-scale personal analysis of this here, by one of us, or this analysis by Vannan et al). Nonetheless, there is increasing recognition that data archives are an important work product that might influence decisions about hiring, grant success, and promotion (Cousijn et al 2018).


  1. Archives provide a valuable tool for teaching. Students learning statistics and the graphical presentation of data can benefit from downloading data and code underlying published results, then trying to replicate or improve upon published figures or statistical analyses. By simultaneously examining the data and code, students can learn how to conduct certain kinds of analyses or generate different kinds of graphs, which they can then apply to their own scientific work. Such learner-driven downloads may represent a large fraction of downloads from data repositories, and are surprisingly common (for data on download rates in a small case study, see this ecoevoevoeco blog). This may be especially valuable for graduate students starting work in a new lab, where they can access data and code from their professors or lab alumni to learn more about the biology of the research system they are beginning to work on. From the surprisingly large number of downloads that archives seem to get, we’d suggest that download numbers are better than citations as a reflection of an archive’s value (with the caveat that archives containing suspect data may also be downloaded extensively).


  1. Reported results can be re-evaluated by other scientists who can re-run analyses. This helps to improve  confidence in the veracity of published claims. Readers should therefore have more trust in, and thus be more inclined to cite, journals and papers with data archives that allow reproducible analyses. At times this can lead to discoveries of errors. For instance, The American Naturalist recently published a retraction (Gilbert and Weiss 2021) of a paper (Start et al 2019) whose core finding was based on an incorrectly conducted statistical analysis. The retraction was made possible by a re-analysis by one of the co-authors, who gained access to the original data posted on Dryad by the lead author. We believe that voluntary self-retractions of honest mistakes should be lauded as a healthy correction process.


  1. Data archives provide key information for forensic evaluations in cases of alleged scientific fraud, misconduct, or sloppy practices. For instance, a prominent animal behavior researcher has had numerous articles retracted in 2020-2021 after biologically implausible patterns were found in his data archives. Forensic evaluations of archived data identified blocks of numbers that were copy-and-pasted within files. In a separate case, data were found to be similar between two supposedly unrelated publications, resulting in retractions  (McCauly and Gilbert 2021). Such discoveries would not have been possible without archived data. This point is highlighted by the experiences of other journals that lacked data archiving policies, but became concerned by strings of retractions at other journals. The journal Ecology, for example, lacked an archive policy until relatively recently and has therefore been unable to evaluate their papers authored by an author with multiple retractions elsewhere. We mention this not to criticize that journal, but rather to illustrate the value of archiving. 


  1. The datasets obtained during any research are, in a sense, the property of the public whose taxes paid for the research. Thus, there’s an ethical and political argument to be made for the necessity of making all raw data available. The failure to make data available has, in fact, been weaponized by politicians seeking to undermine environmental rules. In recent years, moves by a certain US President’s administration aimed to block any policies based on scientific results that didn’t follow open data practices. This means that most research in toxicology or conservation biology from previous decades would be tossed out, undermining environmental protections. Data archiving protects your science against such political maneuvers. 

Following the adoption of data archiving rules, the number of data repositories at major host sites has skyrocketed (Figure from Evans et al 2016 PLoS Biology reproduced below

The adoption of data archiving policies was not without resistance. Authors quite reasonably have a sense of ownership of their data (but see (g) above) and worry about being scooped on any further analyses of their own data. To allay that fear, journal editors can accommodate reasonable requests in this regard: long-term datasets that might yield multiple papers can be embargoed for a reasonable period of time, but must nevertheless be archived at the time of publication to ensure that when the embargo ends the data become visible to the public. These concerns seem to have largely evaporated as the community has habituated to the rules. For instance, in nearly five years as Editor of AmNat, Dan Bolnick received only a couple of requests for waivers or embargos. A few requests were clearly from authors not wanting to bother archiving their data, but who readily provided data when their excuses were questioned. Some had valid reasons such as being bound by confidentiality rules about the precise geographic locations of threatened species. At other journals that publish data on human subjects, individual confidentiality rules or cultural group requirements may legitimately bar the sharing of some forms of data.


Part 2: Problems with the Current State of Data Sharing

To realize the benefits outlined above, data archives must be complete, readable, and clear. Unfortunately, many are not. This problem first really came to the attention of The American Naturalist’s editor (Dan Bolnick) with the retractions and corrections of Jonathan Pruitt’s papers. Most of the problems with those papers involved implausible patterns in the available data. However, the investigating committee also found that some key data were simply missing. Shortly after this case was made public, a reader reached out to the AmNat Editor to request that a different author provide missing information for an incomplete archive. The Editor contacted the author who (after several contacts) found the file, but then claimed to be unable to figure out how to update their Dryad repository. This was eventually resolved, but entailed a year of back-and-forth emails. Shortly on the heels of this case, an anonymous whistle-blower contacted the Editor to point out missing data files for several papers by author Denon Start. When contacted, Start explained that the data were no longer available, having been lost when a laptop was stolen, evcen though those data should have been archived before publication in the journal as requested by journal policy. The author provided a court case number which the Editor confirmed was legitimate and involved stolen material, and that Start was one of the victims, though could not confirm the specific involvement of a laptop, nor whether the laptop contained the only copy of the data (e.g., there were no backups;see RetractionWatch article). After consulting with Dryad and the Committee on Publication Ethics (CoPE), it was clear that data sharing rules are recent enough that CoPE does not have formal recommendations for handling incomplete archives (such guidelines are in preparation now). Following recommendations from Dryad and CoPE, the journal published a set of Editorial Expressions of Concern for Start’s missing data files (eg Bolnick, 2021

Missing data are not the only issue. In a recent survey of repositories for papers published in The American Naturalist in 2020,  Bob Montgomerie found many cases where authors posted only summary statistics (means, standard errors) rather than the original data used to generate those results.  Only a small fraction of repositories was complete, well-documented, and useful for re-creating all of the published results (see AmNat blog post by Montgomerie). Typical problems with data repositories included missing datasets, summary rather than raw data, no information about variable names in those files (many of which were indecipherable), and, in a few cases, no data repository at all (eventually fixed by contacting the authors and Dryad; see AmNat blog post by Montgomerie)

These observations led to a growing realization that requiring data archiving is not enough. We cannot simply trust that authors are creating complete and clear archives, because often this is too often not what happens. It has become clear that we also need to proactively check the archives to confirm that they meet our minimum standards. There are data archive checks that occur within the Dryad office. However, these checks occur after a paper is accepted, whereas we believe it is crucial that acceptance be conditioned on confirmation that archives are complete and useable. Checking before acceptance ensures that the journal retains leverage to encourage authors to adopt acceptable practices. This then avoids the difficulty of having authors who simply do not respond, claim data are missing, or take years to act because they have no incentive to make corrections once their paper is published. Moreover, Dryad does not currently check data archive contents against the text of the corresponding manuscript to confirm that all data files and variables represented in the published article are present in the archive. 


Part 3: New Procedures to Check Data Archive Quality


In the summer of 2021, The American Naturalist began having a small team of data editors, led by Bob Montgomerie, look at the data repositories for all papers accepted for publication. Once a paper has been accepted by the handling editor, Montgomerie uses our Editorial Manager to download the manuscript and its data repository and immediately sends the manuscript to DataSeer for evaluation. DataSeer uses a machine-learning process to read the manuscript and identify all mentions of data and code, providing a summary within 48 hours that matches the mentions of data and code to the data and code in the repository. One of the data editors then writes a summary of the findings from DataSeer and makes recommendations to the author(s) to make their repository complete and useful. The manuscript then goes to the editor in chief (Bolnick) for final decision. In cases where the repository is seriously deficient, the data editor might ask to see the repository again before the manuscript is finally accepted for publication.


One thing that authors should be aware of: the Data Editors are not there to judge the elegance of your code. They do check that the code actually runs, and they may encourage you to make clearer annotations. If they incidentally identify errors in the code (not their assigned task), they’ll let you know to avoid you having to correct a mistake after publication. But authors should not feel bashful about the ‘style’ of their code. Code is a tool, and if the tool gets the job done correctly, it is a useful tool. And if you don’t trust your code enough for someone to see it, well then perhaps it is time to ask a collaborator to double-check your code. Having a coding mistake identified after publication is far more embarrassing (one of us had to go through such a retraction due to a coding error).


By the end of 2021, the data editors had evaluated the data repositories of 63 accepted manuscripts, all of which were submitted to the journal before a data editor had been appointed. Thus, the analysis of those repositories provides a glimpse at what repositories looked like before authors knew that anyone was going to look at them very closely, at least not before their paper was published (as a reminder, repositories do get more views & downloads than you might think).


For each of the 63 repositories evaluated in 2021, Montgomerie subjectively gave them a rating on a scale of 1-5 where 5 is complete and informative, 3 is moderately useful but missing some data files or some raw data, and 1 indicates a virtually incomprehensible and thus almost useless set of files. Expectations for a high-quality repository are not especially stringent, and are outlined in a recent blog post for authors. Here is a summary of his ratings of the data repositories for those manuscripts:



Clearly, the data repositories evaluated in 2021 were generally incomplete and few were very good. The most frequent problems were (i) absent or uninformative README file, (ii) missing raw data files, (iii) code absent, not annotated or could not run on a clean computer, and (iv) broken links to repositories.  Note that authors were not required to provide code at the time those manuscripts were submitted, so the provision of code was not a factor in the ratings shown above. Moreover, we did not set our standards to be particularly high and thus wrote reports that provided a few usually relatively simple suggestions to authors to improve their repositories. We felt that this soft approach was desirable at the outset until authors get used to the new (and old) data requirements, and the newly adopted (January 2022) requirement that authors archive code (if code is used)  as well as data.




The perennial problem with promoting open data and open code is knowing exactly what needs to be done for each article: it’s relatively easy to write a broad policy (like the JDAP), but it’s much, much harder to work out how that policy applies to each manuscript.


Authors are obviously best placed to understand which datasets they’ve collected and what code they wrote to analyze their data, but by the time the manuscript is accepted for publication the work itself is months or years in the past and their focus is on writing up their next manuscript. As noted above, the incentives for sharing data and code are diffuse and take time to accrue, while the benefits of publishing just one more paper are more immediate. It’s therefore difficult to persuade authors to spontaneously compile a list of their datasets and code objects and then put these all onto public repositories. Our experience has been that the setting up of data and code repositories, and a useful README file, is best begun at the outset of a project rather than waiting until a final manuscript is ready to submit.


Back when Tim Vines was the Managing Editor for Molecular Ecology, he read each manuscript and compiled a list of all the data  for each article that was accepted for publication. One day in 2014, it occurred to him that this 20-30 minute job could be completed in seconds by a machine: Natural Language Processing (NLP) excels at picking out data collection sentences (e.g. “We measured snout-vent length with Vernier calipers”) and determining what kind of data was collected. With that information in hand, authors can then be led through the data and code sharing process for their articles without intensive attention from an Editor.


That NLP solution is now the basis for DataSeer, which has been integrated into the editorial workflow at The American Naturalist to help their Editors promote open data and open code. We processed our first article in July 2021 and have looked at over 100 articles in the intervening months. It’s been fascinating. The American Naturalist publishes a wide range of articles: some are entirely theoretical and make no use of empirically collected data, while others combine pre-existing datasets and novel analyses. Other articles collect and analyze entirely new datasets. 


We find that a few issues are common. First, many empirical datasets are vague about where and when samples were collected – a problem also noted by Pope et al for Molecular Ecology. Ideally, a reader should be able to find out where and when a sample was taken. One immediate reaction might be ‘what possible use is that level of detail?’, to which the reasonable answer is ‘we have no idea’ – yet. However, by not recording detailed collection metadata, we deny our future selves (or our future colleagues) the opportunity to test as-yet unimagined hypotheses. At the simplest level, providing detailed metadata is the only way to allow anyone to re-sample at exactly the same spot, which is important for  myriad reasons.


Second, authors working with existing datasets should make it clear which individuals and variables were included in their analyses. Two efficient ways to do this include (i) providing the code that accesses and then parses the existing data, or (ii) providing the subset of the dataset that was re-analyzed as a read-in file for the analysis code. This is particularly important when re-using data from sources that are continually updated (e.g. WorldClim), as without these details readers have no hope of reproducing the results in the article. A corollary is that when authors collect data that they do not use in a particular paper, at least some version of their data archive should be pared down to only the subset of data actually used to generate reported results, to avoid burdening end-users with extraneous information not pertinent to the study’s results.


Finally, some manuscripts that focus on theory give the impression that the authors only worked with pencil and paper before transcribing their work into Word or LaTex – the existence of a Mathematica or MatLab notebook where they actually did the work is never mentioned. For readers to fully understand the research, they also need to see and interact with those notebooks. 


DataSeer is privileged to be part of this experiment in open science at The American Naturalist, and we will be keeping a close eye on how the new approach affects the reproducibility and stature of their published articles. 



Part 4: Future Needs

Our experiences so far suggest some ways forward for both authors and journals.


For authors, the first and essential step is to make a plan. Many agencies now require such plans in grant applications and these should include when (now) and where to set up your repository for the study. There are many free online sites for data repositories that you can use during project development and manuscript writing. Your plan should also include details on who will manage the repository, file structure, versioning,  and what to include (everything possible at the outset at least). The second, and we hope obvious, step is to stick to that plan. This is a hard one as we all want to get on with data analysis, writing and publishing but all of those—and especially the provision of a useful repository on manuscript submission are made easier by sticking to your plan and keeping your repository up-to-date and well documented. There can be no doubt that a complete, well-organized and well-documented data repository can be a lot of work and this should be recognized by granting agencies by covering the additional costs.


For journals, we think (of course) that having a data editor is a good idea, essential even. Editors and reviewers simply do not all have the time (or the inclination and expertise) to vet data repositories. Our experience so far, and especially with the use of DataSeer, is that a repository for an accepted manuscript can be fully evaluated in about 45 minutes. That’s not much time compared to the job that most reviewers and editors perform. As authors become more in tune with this process, repositories (we hope) will become much more complete and useful, moving most repositories into category 5 on the graph above, and taking much less time to evaluate. 


At The American Naturalist, data editors do not see repositories until a manuscript has been accepted by a handling editor, or is resubmitted following a request for minor revisions where one might reasonably expect final acceptance. This policy helps limit the burden on Data Editors down to approximately 20% to 25% of manuscript submissions at The American Naturalist.  In practice it might be more efficient and useful to authors if repositories were evaluated on first submission, and papers desk rejected before review if repositories are not up to scratch. Such a change in workflow is not practical for us, yet, as it would require a large increase in the number of repositories to evaluate, and might alienate some authors from submitting to the journal. We expect, though, that within a few years, papers submitted to The American Naturalist will all have high quality repositories, especially as more journals take seriously the need for useful data and code. 


Conversations are ongoing with other journals to encourage wider adoption of data editing practices, but at present The American Naturalist is (we are proud to say) the only journal we know of to be actually checking the content of data archives against manuscript descriptions to ensure completeness. Ecology Letters has implemented a slightly different set of procedures but also typically checks on archives. We also have been in conversations with Dryad, which also conducts some quality control checks on their repositories. As a result, some of what the journal’s Data Editors do is duplicating Dryad efforts (one might say, helping authors meet Dryad expectations in advance). The reasons for these duplications, however, is that Dryad’s quality control steps do not consider what should be in the archive based on manuscript text, nor do the evsaluators at Dryad necessarily have the expertise to evaluate data from all fieeld of research and all forms of coding. Also, as noted above we believe it is important that authors be asked to bring their repositories up to our standards before acceptance, rather than after as Dryad does. 



Part 5: Conclusions

A decade after data sharing began in evolution and ecology, it is now widely accepted practice. Even those laggards—the biomedical sciences—are catching on despite muttering about “data parasites”. A few years ago, The American Naturalist began “strongly encouraging” code archiving as well, and we quickly found that the vast majority of authors were voluntarily doing so. When The American Naturalist began requiring code (if any is used) in January 2022, we have yet to hear a complaint or a request for an exception. The biological research community seems to have accepted the many benefits of data and code archiving. The final frontier in this move towards better reproducibility and open science is quality control to ensure adherence to the standards we say that we require. The first decade of data archiving rules were effectively an honor system. That honor system has resulted in a shocking number of incomplete or unclear archives, so a bit more supervision is, we believe, warranted. Thanks to the streamlining by DataSeer, our Data Editors can efficiently check a given paper’s archive in less than an hour, including writing the report to the Editor and authors identifying steps needed. Where data archives are seriously deficient, we do delay acceptance. The end result, we hope, is that authors will have more confidence in the results reported in our journal, and thus be more likely to read, trust, and cite the papers.


We recognize that adherence to these rules takes time, effort, and training. The time and effort are part of the publication process. In our experience if authors prepare their README files, organize data well, and annotate their code as they obtain data and analyze their data, then the additional time required is minimal.  Graduate training programs should (we think) make sure to take time to train students in data archiving and code annotation practices, as a key part of ethical science education. Meanwhile, experienced Data Editors may be willing to provide advice to authors struggling with how to comply with the ever-changing (and hopefully improving) open science landscape.