Best practices for creating reusable data publications
So, you want to share your research data in Dryad, but are unsure where to start or what you 'should' share? Don't worry, it's not always clear how to craft a dataset with reusability in mind.
We want to help you share your research with the scientific community to increase its visibility and foster collaborations. The following guidelines will help make your Dryad datasets as Findable, Accessible, Interoperable, and Reusable (FAIR) as possible.
No time to dig into the details? Check out our quickstart guide to data sharing.
Gather all relevant data needed for reanalysis
- Consider all of the information necessary for one to reuse your dataset and replicate the analyses in your publication. Gather and organize everything--this may include experimental method details, raw data files, organized data tables, scripts, data visualizations, and statistical output. There are often several levels of data processing involved in a project, and it is important to provide adequate detail. That said, don't hesitate to edit out superfluous or ambiguous content that would confuse others.
- Unprocessed and processed data: Providing both unprocessed and processed data can be valuable for re-analysis, assuming the data are of a reasonable size. Including unprocessed raw digital data from a recording instrument or database ensures that no details are lost, and any issues in the processing pipeline can be discovered and rectified. Processed data are cleaned, formatted, organized and ready for reuse by others.
- Code: Programming scripts communicate to others all of the steps in processing and analysis. Including them ensures that your results are reproducible by others. Informative comments throughout your code will help future users understand its logic.
- External resources: Links to associated data stored in other data repositories, code in software repositories, and associated publications can be included in "Related works".
Make sure your data are shareable
- All files submitted to Dryad must abide by the terms of the Creative Commons Zero (CC0 1.0). Under these terms, the author releases the data to the public domain.
- Review all files and ensure they conform to
CC0 terms and are not covered by copyright claims or other terms-of-use. We cannot archive any files that contain licenses incompatible with
GNU GPL, MIT, CC-BY, etc.), but we can link to content in a dedicated software repository (Github, Zenodo, Bitbucket, or CRAN, etc.).
- For more information on why Dryad uses
CC0, and some dos and don'ts for
- Human subjects data must be properly anonymized and prepared under applicable legal and ethical guidelines (see tips for human subjects data)
- If you work with vulnerable or endangered species, it may be necessary to mask location to prevent any further threat to the population. Please review our recommendations for responsibly sharing data collected from vulnerable species. (see tips endangered species data ).
Make sure your data are accessible
- To maximize accessibility, reusability and preservability, share date in non-proprietary open formats when possible (see preferred formats). This ensures your data will be accessible by most people.
- Review files for errors. Common errors include missing data, misnamed files, mislabeled variables, incorrectly formatted values, and corrupted file archives. It may be helpful to run data validation tools before sharing. For example, if you are working with tabular datasets, a service like goodTables can identify missing data and data type formatting problems.
- Files compression may be necessary to reduce large file sizes or directories of files. Files can be bundled together in compressed file archives (
.zip, .7z, .tar.gz). If you have a large directory of files, and there is a logical way to split it into subdirectories and compress those, we encourage you to do so. We recommend not exceeding 10GB each.
Preferred file formats
Dryad welcomes the submission of data in multiple formats to enable various reuse scenarios. For instance, Dryad's preferred format for tabular data is CSV, however, an Excel spreadsheet may optimize reuse in some cases. Thus, Dryad accepts more than just the preservation-friendly formats listed below.
- README files should be in plain text format (
- Comma-separated values (
CSV) for tabular data
- Semi-structured plain text formats for non-tabular data (e.g., protein sequences)
- Structured plain text (
PDF, JPEG, PNG, TIFF, SVG
FLAC, AIFF, WAV, MP3, OGG
AVI, MPEG, MP4
- Compressed file archive:
TAR.GZ, 7Z, ZIP
Organize files in a logical schema
Name files and directories in a consistent and descriptive manner. Avoid vague and ambiguous filenames. Filenames should be concise, informative, and unique (see Stanford's best practices for file naming).
Avoid blank spaces and special characters (
' '!@#$%^&") in filenames because they can be problematic for computers to interpret. Use a common letter case pattern because they are easily read by both machines and people:
Include the following information when naming files:
- Author surname
- Date of study
- Project name
- Type of data or analysis
- File extension (
.csv, .txt, .R, .xls, .tar.gz, etc.)
Describe your dataset in a README file
Provide a clear and concise description of all relevant details about data collection, processing, and analysis in a README document. This will help others interpret and reanalyze your dataset.
Plain text README files are recommended, however, PDF is acceptable when formatting is important.
If you included a README in a compressed archive of files, please also upload it externally in the README section so that users are aware of the contents before downloading.
Cornell University's Research Data Management Service Group has created an excellent README template
Details to include:
- Citation(s) of your published research derived from these data
- Citation(s) of associated datasets stored elsewhere (include URLs)
- Project name and executive summary
- Contact information regarding analyses
- Methods of data processing and analysis
- Describe details that may influence reuse or replication efforts
- De-identification procedures for sensitive human subjects or endangered species data
- Specialized software (include version and developer's web address) used for analyses and file compression. If proprietary, include open source alternatives.
- Description of file(s):
- file/directory structure
- type(s) of data included (categorical, time-series, human subjects, etc.)
- relationship to the tables, figures, or sections within associated publication
- key of definitions of variable names, column headings and row labels, data codes (including missing data), and measurement units
> Log in and go to "My Datasets" to start your submission now!
Examples of good reusability practices
- Gallo T, Fidino M, Lehrer E, Magle S (2017) Data from: Mammal diversity and metacommunity dynamics in urban green spaces: implications for urban wildlife conservation. Dryad Digital Repository. https://doi.org/10.5061/dryad.9mf02
- Rajon E, Desouhant E, Chevalier M, Débias F, Menu F (2014) Data from: The evolution of bet hedging in response to local ecological conditions. Dryad Digital Repository. https://doi.org/10.5061/dryad.g7jq6
- Drake JM, Kaul RB, Alexander LW, O'Regan SM, Kramer AM, Pulliam JT, Ferrari MJ, Park AW (2015) Data from: Ebola cases and health system demand in Liberia. Dryad Digital Repository. https://doi.org/10.5061/dryad.17m5q
- Wall CB, Mason RAB, Ellis WR, Cunning R, Gates RD (2017) Data from: Elevated pCO2 affects tissue biomass composition, but not calcification, in a reef coral under two light regimes. Dryad Digital Repository. https://doi.org/10.5061/dryad.5vg70.3
- Kriebel R, Khabbazian M, Sytsma KJ (2017) Data from: A continuous morphological approach to study the evolution of pollen in a phylogenetic context: an example with the order Myrtales. Dryad Digital Repository. https://doi.org/10.5061/dryad.j17pm.2