March 2, 2022

Ensuring data and code archive quality: why and how?

This post was co-written by Dan Bolnick (The American Naturalist, Editor), Tim Vines (DataSeer.ai), Bob Montgomerie (AmNat Data Editor). This blog post addresses the value and flaws in current open and reproducible science practices, and hopefully is of interest to authors, other editors, and students considering publishing in biological sciences.


Part 1 articulates the value of data and code archiving, and can be skipped if you are already a convert. Part 2 explains what we see as weaknesses with the current system. Part 3 explains the new Data Editor role adopted by The American Naturalist and the associated workflow including DataSeer.ai's role. Part 4 addresses some remaining issues for consideration, and Part 5 provides some parting thoughts.


Part I: The Value of Data and Code Archiving

In 2010, a group of journal editors announced that some of the leading journals in evolutionary biology would adopt a new policy requiring authors to archive all of the raw data needed to obtain published results (Whitlock et al, 2010 American Naturalist). Previously, some kinds of genetic data (e.g., DNA sequences) were often required to be archived on databases such as GenBank. However, researchers in ecology and evolution often work with system-specific types of data, so that attempts to build standardized databases had typically failed. Sometimes data were provided in tables within a paper or provided as online supplements once journals moved much of their material online in the early 2000s. But, the most common approach was to simply state in the article that data were “available upon request”. 

Relying on authors to provide data on request, however,  was largely ineffective because authors too often would either ignore requests for data, or respond that they no longer had access to the relevant data (“it was on a computer whose hard drive crashed years ago”). As time passes, it’s more and more likely that those datasets are lost forever (see Vines et al 2014). As an example, in 2012 one of us emailed 120 authors asking for raw data underlying effect size estimates used in a previous meta-analysis, and received only five replies. This kind of poor response has been studied more rigorously (see for instance Savage and Vickers 2009 PLoS One and Vines et al 2014), and is a widespread problem. The result has been calls to adopt more rigorous systems (e.g., Langille 2018 Microbiome).

The Editors of some evolution journals sought to change this by requiring data sharing. The American Naturalist adopted this requirement in January 2011 for all new manuscripts, and reached out to authors of previous papers to encourage them to  archive their data retroactively. Notably, this is a move that is only now being followed by biomedical sciences, which is being dubbed a “seismic mandate” (Nature News, 2021).

Although the initial response from some members of the research community was hostile, the benefits of archiving data are by now widely recognized:

  1. Archives are useful as a back-up for your own data, in a secure location. Once archived, you will have access to your own data in perpetuity, safe from the vagaries of damaged hard drives or stolen laptops. Most data repositories also use version control to track changes to files so you can go back and retrieve earlier versions if necessary.


  1. Archived data can be re-used for new studies, providing a cost-effective framework for building upon or consolidating past results. Meta-analyses are an increasingly popular tool for combining discipline-wide findings from a large number of studies. Traditionally meta-analyses have been based on an aggregated analysis of summary statistics from many published studies. But, far more effective approaches are possible when the raw data can be analyzed directly. Re-analyses by third parties do run the risk of (i) scooping an author who plans further publications from a data-set, or (ii) committing errors based on a poor understanding of the biology of the system or experimental details. In practice, we are aware of few, if any, examples of (i) in evolution or ecology, but some clear examples of (ii). There was some debate in recent years whether such re-use represents “research parasitism”, but as long as the findings of re-analyses are correct and provide  added value then everyone wins.


  1. Data archives are citable products one should put on one’s CV. In practice, few people have done this in the past, and data archives have rarely been cited (see a small-scale personal analysis of this here, by one of us, or this analysis by Vannan et al). Nonetheless, there is increasing recognition that data archives are an important work product that might influence decisions about hiring, grant success, and promotion (Cousijn et al 2018).


  1. Archives provide a valuable tool for teaching. Students learning statistics and the graphical presentation of data can benefit from downloading data and code underlying published results, then trying to replicate or improve upon published figures or statistical analyses. By simultaneously examining the data and code, students can learn how to conduct certain kinds of analyses or generate different kinds of graphs, which they can then apply to their own scientific work. Such learner-driven downloads may represent a large fraction of downloads from data repositories, and are surprisingly common (for data on download rates in a small case study, see this ecoevoevoeco blog). This may be especially valuable for graduate students starting work in a new lab, where they can access data and code from their professors or lab alumni to learn more about the biology of the research system they are beginning to work on. From the surprisingly large number of downloads that archives seem to get, we’d suggest that download numbers are better than citations as a reflection of an archive’s value (with the caveat that archives containing suspect data may also be downloaded extensively).


  1. Reported results can be re-evaluated by other scientists who can re-run analyses. This helps to improve  confidence in the veracity of published claims. Readers should therefore have more trust in, and thus be more inclined to cite, journals and papers with data archives that allow reproducible analyses. At times this can lead to discoveries of errors. For instance, The American Naturalist recently published a retraction (Gilbert and Weiss 2021) of a paper (Start et al 2019) whose core finding was based on an incorrectly conducted statistical analysis. The retraction was made possible by a re-analysis by one of the co-authors, who gained access to the original data posted on Dryad by the lead author. We believe that voluntary self-retractions of honest mistakes should be lauded as a healthy correction process.


  1. Data archives provide key information for forensic evaluations in cases of alleged scientific fraud, misconduct, or sloppy practices. For instance, a prominent animal behavior researcher has had numerous articles retracted in 2020-2021 after biologically implausible patterns were found in his data archives. Forensic evaluations of archived data identified blocks of numbers that were copy-and-pasted within files. In a separate case, data were found to be similar between two supposedly unrelated publications, resulting in retractions  (McCauly and Gilbert 2021). Such discoveries would not have been possible without archived data. This point is highlighted by the experiences of other journals that lacked data archiving policies, but became concerned by strings of retractions at other journals. The journal Ecology, for example, lacked an archive policy until relatively recently and has therefore been unable to evaluate their papers authored by an author with multiple retractions elsewhere. We mention this not to criticize that journal, but rather to illustrate the value of archiving. 


  1. The datasets obtained during any research are, in a sense, the property of the public whose taxes paid for the research. Thus, there’s an ethical and political argument to be made for the necessity of making all raw data available. The failure to make data available has, in fact, been weaponized by politicians seeking to undermine environmental rules. In recent years, moves by a certain US President’s administration aimed to block any policies based on scientific results that didn’t follow open data practices. This means that most research in toxicology or conservation biology from previous decades would be tossed out, undermining environmental protections. Data archiving protects your science against such political maneuvers. 

Following the adoption of data archiving rules, the number of data repositories at major host sites has skyrocketed (Figure from Evans et al 2016 PLoS Biology reproduced below

The adoption of data archiving policies was not without resistance. Authors quite reasonably have a sense of ownership of their data (but see (g) above) and worry about being scooped on any further analyses of their own data. To allay that fear, journal editors can accommodate reasonable requests in this regard: long-term datasets that might yield multiple papers can be embargoed for a reasonable period of time, but must nevertheless be archived at the time of publication to ensure that when the embargo ends the data become visible to the public. These concerns seem to have largely evaporated as the community has habituated to the rules. For instance, in nearly five years as Editor of AmNat, Dan Bolnick received only a couple of requests for waivers or embargos. A few requests were clearly from authors not wanting to bother archiving their data, but who readily provided data when their excuses were questioned. Some had valid reasons such as being bound by confidentiality rules about the precise geographic locations of threatened species. At other journals that publish data on human subjects, individual confidentiality rules or cultural group requirements may legitimately bar the sharing of some forms of data.


Part 2: Problems with the Current State of Data Sharing

To realize the benefits outlined above, data archives must be complete, readable, and clear. Unfortunately, many are not. This problem first really came to the attention of The American Naturalist’s editor (Dan Bolnick) with the retractions and corrections of Jonathan Pruitt’s papers. Most of the problems with those papers involved implausible patterns in the available data. However, the investigating committee also found that some key data were simply missing. Shortly after this case was made public, a reader reached out to the AmNat Editor to request that a different author provide missing information for an incomplete archive. The Editor contacted the author who (after several contacts) found the file, but then claimed to be unable to figure out how to update their Dryad repository. This was eventually resolved, but entailed a year of back-and-forth emails. Shortly on the heels of this case, an anonymous whistle-blower contacted the Editor to point out missing data files for several papers by author Denon Start. When contacted, Start explained that the data were no longer available, having been lost when a laptop was stolen, evcen though those data should have been archived before publication in the journal as requested by journal policy. The author provided a court case number which the Editor confirmed was legitimate and involved stolen material, and that Start was one of the victims, though could not confirm the specific involvement of a laptop, nor whether the laptop contained the only copy of the data (e.g., there were no backups;see RetractionWatch article). After consulting with Dryad and the Committee on Publication Ethics (CoPE), it was clear that data sharing rules are recent enough that CoPE does not have formal recommendations for handling incomplete archives (such guidelines are in preparation now). Following recommendations from Dryad and CoPE, the journal published a set of Editorial Expressions of Concern for Start’s missing data files (eg Bolnick, 2021

Missing data are not the only issue. In a recent survey of repositories for papers published in The American Naturalist in 2020,  Bob Montgomerie found many cases where authors posted only summary statistics (means, standard errors) rather than the original data used to generate those results.  Only a small fraction of repositories was complete, well-documented, and useful for re-creating all of the published results (see AmNat blog post by Montgomerie). Typical problems with data repositories included missing datasets, summary rather than raw data, no information about variable names in those files (many of which were indecipherable), and, in a few cases, no data repository at all (eventually fixed by contacting the authors and Dryad; see AmNat blog post by Montgomerie)

These observations led to a growing realization that requiring data archiving is not enough. We cannot simply trust that authors are creating complete and clear archives, because often this is too often not what happens. It has become clear that we also need to proactively check the archives to confirm that they meet our minimum standards. There are data archive checks that occur within the Dryad office. However, these checks occur after a paper is accepted, whereas we believe it is crucial that acceptance be conditioned on confirmation that archives are complete and useable. Checking before acceptance ensures that the journal retains leverage to encourage authors to adopt acceptable practices. This then avoids the difficulty of having authors who simply do not respond, claim data are missing, or take years to act because they have no incentive to make corrections once their paper is published. Moreover, Dryad does not currently check data archive contents against the text of the corresponding manuscript to confirm that all data files and variables represented in the published article are present in the archive. 


Part 3: New Procedures to Check Data Archive Quality


In the summer of 2021, The American Naturalist began having a small team of data editors, led by Bob Montgomerie, look at the data repositories for all papers accepted for publication. Once a paper has been accepted by the handling editor, Montgomerie uses our Editorial Manager to download the manuscript and its data repository and immediately sends the manuscript to DataSeer for evaluation. DataSeer uses a machine-learning process to read the manuscript and identify all mentions of data and code, providing a summary within 48 hours that matches the mentions of data and code to the data and code in the repository. One of the data editors then writes a summary of the findings from DataSeer and makes recommendations to the author(s) to make their repository complete and useful. The manuscript then goes to the editor in chief (Bolnick) for final decision. In cases where the repository is seriously deficient, the data editor might ask to see the repository again before the manuscript is finally accepted for publication.


One thing that authors should be aware of: the Data Editors are not there to judge the elegance of your code. They do check that the code actually runs, and they may encourage you to make clearer annotations. If they incidentally identify errors in the code (not their assigned task), they’ll let you know to avoid you having to correct a mistake after publication. But authors should not feel bashful about the ‘style’ of their code. Code is a tool, and if the tool gets the job done correctly, it is a useful tool. And if you don’t trust your code enough for someone to see it, well then perhaps it is time to ask a collaborator to double-check your code. Having a coding mistake identified after publication is far more embarrassing (one of us had to go through such a retraction due to a coding error).


By the end of 2021, the data editors had evaluated the data repositories of 63 accepted manuscripts, all of which were submitted to the journal before a data editor had been appointed. Thus, the analysis of those repositories provides a glimpse at what repositories looked like before authors knew that anyone was going to look at them very closely, at least not before their paper was published (as a reminder, repositories do get more views & downloads than you might think).


For each of the 63 repositories evaluated in 2021, Montgomerie subjectively gave them a rating on a scale of 1-5 where 5 is complete and informative, 3 is moderately useful but missing some data files or some raw data, and 1 indicates a virtually incomprehensible and thus almost useless set of files. Expectations for a high-quality repository are not especially stringent, and are outlined in a recent blog post for authors. Here is a summary of his ratings of the data repositories for those manuscripts:



Clearly, the data repositories evaluated in 2021 were generally incomplete and few were very good. The most frequent problems were (i) absent or uninformative README file, (ii) missing raw data files, (iii) code absent, not annotated or could not run on a clean computer, and (iv) broken links to repositories.  Note that authors were not required to provide code at the time those manuscripts were submitted, so the provision of code was not a factor in the ratings shown above. Moreover, we did not set our standards to be particularly high and thus wrote reports that provided a few usually relatively simple suggestions to authors to improve their repositories. We felt that this soft approach was desirable at the outset until authors get used to the new (and old) data requirements, and the newly adopted (January 2022) requirement that authors archive code (if code is used)  as well as data.




The perennial problem with promoting open data and open code is knowing exactly what needs to be done for each article: it’s relatively easy to write a broad policy (like the JDAP), but it’s much, much harder to work out how that policy applies to each manuscript.


Authors are obviously best placed to understand which datasets they’ve collected and what code they wrote to analyze their data, but by the time the manuscript is accepted for publication the work itself is months or years in the past and their focus is on writing up their next manuscript. As noted above, the incentives for sharing data and code are diffuse and take time to accrue, while the benefits of publishing just one more paper are more immediate. It’s therefore difficult to persuade authors to spontaneously compile a list of their datasets and code objects and then put these all onto public repositories. Our experience has been that the setting up of data and code repositories, and a useful README file, is best begun at the outset of a project rather than waiting until a final manuscript is ready to submit.


Back when Tim Vines was the Managing Editor for Molecular Ecology, he read each manuscript and compiled a list of all the data  for each article that was accepted for publication. One day in 2014, it occurred to him that this 20-30 minute job could be completed in seconds by a machine: Natural Language Processing (NLP) excels at picking out data collection sentences (e.g. “We measured snout-vent length with Vernier calipers”) and determining what kind of data was collected. With that information in hand, authors can then be led through the data and code sharing process for their articles without intensive attention from an Editor.


That NLP solution is now the basis for DataSeer, which has been integrated into the editorial workflow at The American Naturalist to help their Editors promote open data and open code. We processed our first article in July 2021 and have looked at over 100 articles in the intervening months. It’s been fascinating. The American Naturalist publishes a wide range of articles: some are entirely theoretical and make no use of empirically collected data, while others combine pre-existing datasets and novel analyses. Other articles collect and analyze entirely new datasets. 


We find that a few issues are common. First, many empirical datasets are vague about where and when samples were collected – a problem also noted by Pope et al for Molecular Ecology. Ideally, a reader should be able to find out where and when a sample was taken. One immediate reaction might be ‘what possible use is that level of detail?’, to which the reasonable answer is ‘we have no idea’ – yet. However, by not recording detailed collection metadata, we deny our future selves (or our future colleagues) the opportunity to test as-yet unimagined hypotheses. At the simplest level, providing detailed metadata is the only way to allow anyone to re-sample at exactly the same spot, which is important for  myriad reasons.


Second, authors working with existing datasets should make it clear which individuals and variables were included in their analyses. Two efficient ways to do this include (i) providing the code that accesses and then parses the existing data, or (ii) providing the subset of the dataset that was re-analyzed as a read-in file for the analysis code. This is particularly important when re-using data from sources that are continually updated (e.g. WorldClim), as without these details readers have no hope of reproducing the results in the article. A corollary is that when authors collect data that they do not use in a particular paper, at least some version of their data archive should be pared down to only the subset of data actually used to generate reported results, to avoid burdening end-users with extraneous information not pertinent to the study’s results.


Finally, some manuscripts that focus on theory give the impression that the authors only worked with pencil and paper before transcribing their work into Word or LaTex – the existence of a Mathematica or MatLab notebook where they actually did the work is never mentioned. For readers to fully understand the research, they also need to see and interact with those notebooks. 


DataSeer is privileged to be part of this experiment in open science at The American Naturalist, and we will be keeping a close eye on how the new approach affects the reproducibility and stature of their published articles. 



Part 4: Future Needs

Our experiences so far suggest some ways forward for both authors and journals.


For authors, the first and essential step is to make a plan. Many agencies now require such plans in grant applications and these should include when (now) and where to set up your repository for the study. There are many free online sites for data repositories that you can use during project development and manuscript writing. Your plan should also include details on who will manage the repository, file structure, versioning,  and what to include (everything possible at the outset at least). The second, and we hope obvious, step is to stick to that plan. This is a hard one as we all want to get on with data analysis, writing and publishing but all of those—and especially the provision of a useful repository on manuscript submission are made easier by sticking to your plan and keeping your repository up-to-date and well documented. There can be no doubt that a complete, well-organized and well-documented data repository can be a lot of work and this should be recognized by granting agencies by covering the additional costs.


For journals, we think (of course) that having a data editor is a good idea, essential even. Editors and reviewers simply do not all have the time (or the inclination and expertise) to vet data repositories. Our experience so far, and especially with the use of DataSeer, is that a repository for an accepted manuscript can be fully evaluated in about 45 minutes. That’s not much time compared to the job that most reviewers and editors perform. As authors become more in tune with this process, repositories (we hope) will become much more complete and useful, moving most repositories into category 5 on the graph above, and taking much less time to evaluate. 


At The American Naturalist, data editors do not see repositories until a manuscript has been accepted by a handling editor, or is resubmitted following a request for minor revisions where one might reasonably expect final acceptance. This policy helps limit the burden on Data Editors down to approximately 20% to 25% of manuscript submissions at The American Naturalist.  In practice it might be more efficient and useful to authors if repositories were evaluated on first submission, and papers desk rejected before review if repositories are not up to scratch. Such a change in workflow is not practical for us, yet, as it would require a large increase in the number of repositories to evaluate, and might alienate some authors from submitting to the journal. We expect, though, that within a few years, papers submitted to The American Naturalist will all have high quality repositories, especially as more journals take seriously the need for useful data and code. 


Conversations are ongoing with other journals to encourage wider adoption of data editing practices, but at present The American Naturalist is (we are proud to say) the only journal we know of to be actually checking the content of data archives against manuscript descriptions to ensure completeness. Ecology Letters has implemented a slightly different set of procedures but also typically checks on archives. We also have been in conversations with Dryad, which also conducts some quality control checks on their repositories. As a result, some of what the journal’s Data Editors do is duplicating Dryad efforts (one might say, helping authors meet Dryad expectations in advance). The reasons for these duplications, however, is that Dryad’s quality control steps do not consider what should be in the archive based on manuscript text, nor do the evsaluators at Dryad necessarily have the expertise to evaluate data from all fieeld of research and all forms of coding. Also, as noted above we believe it is important that authors be asked to bring their repositories up to our standards before acceptance, rather than after as Dryad does. 



Part 5: Conclusions

A decade after data sharing began in evolution and ecology, it is now widely accepted practice. Even those laggards—the biomedical sciences—are catching on despite muttering about “data parasites”. A few years ago, The American Naturalist began “strongly encouraging” code archiving as well, and we quickly found that the vast majority of authors were voluntarily doing so. When The American Naturalist began requiring code (if any is used) in January 2022, we have yet to hear a complaint or a request for an exception. The biological research community seems to have accepted the many benefits of data and code archiving. The final frontier in this move towards better reproducibility and open science is quality control to ensure adherence to the standards we say that we require. The first decade of data archiving rules were effectively an honor system. That honor system has resulted in a shocking number of incomplete or unclear archives, so a bit more supervision is, we believe, warranted. Thanks to the streamlining by DataSeer, our Data Editors can efficiently check a given paper’s archive in less than an hour, including writing the report to the Editor and authors identifying steps needed. Where data archives are seriously deficient, we do delay acceptance. The end result, we hope, is that authors will have more confidence in the results reported in our journal, and thus be more likely to read, trust, and cite the papers.


We recognize that adherence to these rules takes time, effort, and training. The time and effort are part of the publication process. In our experience if authors prepare their README files, organize data well, and annotate their code as they obtain data and analyze their data, then the additional time required is minimal.  Graduate training programs should (we think) make sure to take time to train students in data archiving and code annotation practices, as a key part of ethical science education. Meanwhile, experienced Data Editors may be willing to provide advice to authors struggling with how to comply with the ever-changing (and hopefully improving) open science landscape.




December 1, 2021

Guidelines for archiving Code with Data

The following is a cross-post from the Editor's blog of The American Naturalist, developed with input from various volunteers (credited below).

A CHECKLIST FOR REPRODUCIBLE ARCHIVING DATA AND CODE IN ECOLOGY, EVOLUTION, AND BEHAVIOR

 

December 1, 2021

Daniel I. Bolnick (daniel.bolnick@uconn.edu), Roger Schürch (rschurch@vt.edu), Daniel Vedder (daniel.vedder@idiv.de), Max Reuter (m.reuter@ucl.ac.uk), Leron Perez (leron@stanford.edu), Robert Montgomerie (mont@queensu.ca)




Starting January 1, 2022, The American Naturalist will require that any analysis and simulation code (R scripts, Matlab scripts, Mathematica notebooks) used to generate reported results be archived in a public repository (we specifically prefer Dryad, see below). This has been our recommendation for a couple of years, and author compliance has been very common. As part of our commitment to Open and Reproducible Science, we are transitioning to make this a requirement. The following document, developed with input from a variety of volunteers, is intended to be a relatively basic guide to help authors comply with this new requirement.

 

 

RATIONALE

The fundamental question you should ask yourself is, “If a reader downloads my data and code, will my scripts be comprehensible, and will they run to completion and yield the same results on their computer?” Any computer code used to generate scientific results should be readily usable by reviewers or readers. Sharing this information is vital for several reasons as it promotes: (i) the appropriate interpretation of results, (ii) checking the validity of analyses and conclusions, (iii) future data synthesis, (iv) replication, and (v) their use as a teaching tool for anyone learning to do analyses themselves. Shared code provides greater confidence in results. 



The recommendations below are designed to help authors conduct a final check when finishing a research project, before submitting a manuscript for publication. In our experience, you will find it easier to build reusable code and data if you adhere to these recommendations from the start of your research project. We separately list requirements, and recommendations in each category below.

 

1. CLEAR DOCUMENTATION

 

  Great template available here: https://github.com/gchure/reproducible_research

 

REQUIRED:

 

 Prepare a single README file with important information about your data repository as a whole (code and data files). Text (.txt or .rtf) and Markdown (.md) README files are readable by a wider variety of free and open source software tools, so have greater longevity. The README file should simply be called README.txt (or .rtf or .md). That file should contain, in the following order:

Citation to the publication associated with the datasets and code 

Author names, contact details

A brief summary of what the study is about 

 Identify who is responsible for collecting data and writing code.

List of all folders and files by name, and a brief description of their contents. For each data file, list all variables (e.g., columns) with a clear description of each variable (e.g., units)

Information about versions of packages and software used (including operating system) and dependencies (if these are not installed by the script itself). An easy way to get this information is to use sessionInfo() in R, or 'pip list --format freeze' in Python.

 

RECOMMENDED (for inclusion in the README file):

  Provide workflow instructions for users to run the software (e.g., explain the project workflow, and any configuration parameters of your software)

 Use informative names for folders and files (e.g., “code”, “data”, “outputs”)

  Provide license information, such as Creative Commons open source license language granting readers the right to reuse code. For more information on how to choose and write a license, see choosealicense.com. This is not necessary for DRYAD repositories, as you choose licensing standards when submitting your files.

 If applicable, list funding sources used to generate the archived data, and include information about permits (collection, animal care, human research). This is not necessary for DRYAD repositories, as it is also recoded when submitting your files.

  Link to Protocols.io or equivalent methods repositories where applicable




2. CLEAN CODE


REQUIRED:

  Scripts should start by loading required packages, then importing raw data from files archived in your data repository.

  Use relative paths to files and folders (e.g. avoid setwd() with an absolute path in R), so other users can replicate your data input steps on their own computers. 

 Annotate your code with comments indicating what the purpose of each set of commands is (i.e., “why?”). If the functioning of the code (i.e., “how”) is unclear, strongly consider re-writing it to be clearer/simpler.  In-line comments can provide specific details about a particular command

 Annotate code to indicate how commands correspond to figure numbers, table numbers, or subheadings of results within the manuscript.

  If you are adapting other researcher’s published code for your own purposes, acknowledge and cite the sources you are using. Likewise, cite the authors of packages that you use in your published article.

 

RECOMMENDED:

  Test code before uploading to your repository, ideally on a pristine machine without any packages installed, but at least using a new session.

  Use informative names for input files, variables, and functions (and describe them in the README file).

  Any data manipulations (merging, sorting, transforming, filtering) should be done in your script, for fully transparent documentation of any changes to the data.

  Organise your code by splitting it into logical sections, such as importing and cleaning data, transformations, analysis and graphics and tables. Sections can be separate script files run in order (as explained in your README) or blocks of code within one script that are separated by clear breaks (e.g., comment lines, #--------------), or a series of function calls (which can facilitate reuse of code).

  Label code sections with headers that match the figure number, table number, or text subheading of the paper.

  Omit extraneous code not used for generating the results of your publication, or place any such code in a Coda at the end of your script.

  Where useful, save and deposit intermediate steps as their own files. Particularly, if your scripts include computationally intensive steps, it can be helpful to provide their output as an extra file as an alternative entry point to re-running your code. 

  If your code contains any stochastic process (e.g., random number generation, bootstrap re-sampling), set a random number seed at least once at the start of the script or, better, for each random sampling task. This will allow other users to reproduce your exact results.

  Include clear error messages as annotations in your code that explain what might go wrong (e.g., if the user gave a text input where a numeric input was expected) and what the effect of the error or warning is.

 

3. CLEAN DATA

Checklist for preparing data to upload to DRYAD (or other repository)

 

Repository contents 

REQUIRED: 

  All data used to generate a published result should be included in the repository. For papers with multiple experiments or sets of observations, this may mean more than one data file.

 Save each file with a short, meaningful file name and extension (see DRYAD recommendations here).

 Prepare a README text file to accompany the data. Our recommendation is to put this in the single README file described above. For complex repositories where this readme becomes unmanageably long, you may opt to create a set of separate README files for the overall repository, with one master README and more specific README files for code and for data. But, our preference is one README. The README file(s) should provide a brief overall description of each data file’s contents, and a list of all variable names with explanation (e.g. units). This should allow a new reader to understand what the entries in each column mean and relate this information to the Methods and Results of your paper. Alternatively, this may be a “Codebook” file in a table format with each variable as a row and column providing variable names (in the file), descriptions (e.g. for axis labels), units, etc. 

 Save the README files as a text (.txt) or Markdown (.md) files and all of the data files as comma-separated variable (.csv) files. 

 Save all of the data files as comma-separated variable (.csv) files. If your data are in EXCEL spreadsheets you are welcome to submit those as well (to be able to use colour coding and provide additional information, such as formulae) but each worksheet of data should also be saved as a separate .csv file.

RECOMMENDED:

 We recommend also archiving any digital material used to generate data (e.g., photos, sound recordings, videos, etc), but this may require too much storage space for some repository sites. At a minimum, upload a few example files illustrating the nature of the material and a range of outcomes. We recognize that some projects entail too much raw data to archive all the photos / videos / etc in their original state.


Data file contents and formatting  

REQUIREMENTS: 

 Archived files should include the raw data that you used when you first began analyses, not group means or other summary statistics; for convenience, summary statistics can be provided in a separate file, or generated by code archived with the data.

 Identify each variable (column names) with a short name. Variable names should preferably be <10 characters long and not contain any spaces or special characters that could interfere with reading the data and running analysis code. Use an underline (e.g., wing_length) or camel case (e.g., WingLength) to distinguish words if you think that is needed.

RECOMMENDATIONS: 

 Omit variables not analyzed in your code.

 A common data structure is to ensure that every observation is a row and every variable is a column.

 Each column should contain only one data type (e.g., do not mix numerical values and comments or categorical scores in a single column).

  Use “NA” or equivalent to indicate missing data (and specify what you use in the README file)



4. COMPLETING YOUR ARCHIVE FOR UPLOAD TO DRYAD

REQUIREMENTS:

 Upload your data and code to a curated, version-controlled repository (e.g., DRYAD, zenodo). Your own GitHub account (or other privately or agency controlled website) does not qualify as a public archive because you control access and might take down the data at a later date.

 Provide all the metadata and information requested by the repository, even if this is optional and redundant with information contained in the README files. Metadata makes your archived material easier to find and understand.

 From the repository, get a private URL and provide this on submission of your manuscript so that editors and reviewers can access your archive before your data are made public.

 

RECOMMENDED:
 Prepare your data, code, and README files, before or during manuscript preparation (analysis and writing).



 

5. FOR MORE INFORMATION


More detailed guides to reproducible code principles can be found here:

Documenting Python Code: A Complete Guide - https://realpython.com/documenting-python-code/


A guide to reproducible code in Ecology and Evolution, British Ecological Society: https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible-Code-2019.pdf?utm_source=web&utm_medium=web&utm_campaign=better_science 


Dokta tools for building code repositories:  https://github.com/stencila/dockta#readme


Version management for python projects:  https://python-poetry.org/


Principles of Software Development - an Introduction for Computational Scientists (https://doi.org/10.5281/zenodo.5721380), with an associated code inspection checklist (https://doi.org/10.5281/zenodo.5284377).




Style Guide for Data Files

 See the Google R style guide (https://google.github.io/styleguide/Rguide.htmland the Tidyverse style guide(https://style.tidyverse.org/syntax.html#object-names) for more information

Google style guide for Python: https://google.github.io/styleguide/pyguide.html

 

6. WHY DRYAD OR ZENODO?

 

The American Naturalist requests that authors use DRYAD or zenodo for their archives when possible. 

·                DRYAD/zenodo are curated and this means that there is some initial checking by DRYAD for completeness and consistency in both the data files and the metadata. DRYAD requires some compliance before they will allow a submission.

·                We are finding it much easier and more convenient to download repositories from DRYAD/zenodo rather than searching the ms etc for the files or repository

·                Files on DRYAD/zenodo cannot be arbitrarily deleted or changed by authors or others after publication. DRYAD will allow changes if a good case can be made—all changes are documented and all versions are retained.

·                DRYAD is free for Am Nat authors and we have a good working relationship with them and they take seriously our suggestions for improvement etc.

·                editors, reviewers, and authors will all become familiar with the workings of DRYAD

·                DRYAD and zenodo are now linked. DRYAD is for data files and zenodo for everything else (code, PDFs, etc). You need only upload all files to DRYAD and they will separate your archive into the appropriate parts. As you will see your DRYAD repository provides a link to the files on zenodo and vice versa