Why Don’t We Cite Data and Code?
Data citation infrastructure exists and has considerable support from funders and publishers. Simple rules to ensure best practice in data citation have been proposed and in principle widely agreed (“Primary data should be robustly archived and directly cited as support for findings, just as literature is cited” Cousijn, Kenall et al. 2018). However, an audit of 600 open access publications in multiple disciplines (Zhao et al. 2018) found that formal data attribution by DOI has not been widely adopted, and these authors found that fewer than 30% of these articles reused existing data. In my opinion, there are six actions recommended by (Cousijn, Kenall et al. 2018) that publishers can simply address in two steps by:
- using a suitable article template to ask authors to select a ‘Data Availability Statement’ and suitable repository, structure their data citation within the reference list, provide XML for an updated DTD that includes citation tags for the typesetter; and then
- to deliver the data citation metadata to Crossref.
We are all in agreement that data and code are as important to research as the research articles that provide the context and meaning of the research. We have the infrastructure to ensure that all of these research objects are uniquely identified with appropriate handles. Indeed, the DOI system used by Crossref has been widely extended to identify datasets via https://datacite.org.
The most frequently reused datasets for biomedical and life science communities reside in NCBI and EMBL-EBI databases that rely on local accession codes rather than DOIs. For these, a compact identifier is resolved in a similar way to a DOI, at an EBI or CDL service indicated within the ID (for example https://identifiers.org/GEO:GDS5157 or https://n2t.net/GEO:GDS5157 : Cousijn, Kenall et al. 2018). There is even a nice API to retrieve and count all citations and mentions of DOI-indexed data from Crossref and Datacite at Event Data.
Funders are also making friendly statements about the importance of FAIR data and measuring data citation to the future of open research:
This strategic plan commits to ensuring that all data science activities and products supported by the agency adhere to the FAIR principles, meaning that data be Findable, Accessible, Interoperable, and Reusable.
“NIH will establish procedures and metrics to monitor data usage and impact—including usage patterns within individual datasets. The above principles and approaches are well aligned with those being implemented by the European inter-governmental data-resource coordinating organization."
The original incentive for publishers to develop the Crossref system of DOI registration was reference linking and to enable resolution of licensed access for contracts with subscribed readers. In the open research world, this existing infrastructure and the dedicated people behind it represent an enormous “peace dividend” with the potential to solve the problems of giving credit for data stewardship and encouraging data reuse.
While a few publishers have recognized that data publication requires dedicated journals in order to give credit to the curation and stewardship of high-value datasets via peer review and publication, the majority still struggle with implementing a small number of essential steps that would make a huge difference to the value of data and “make the magic happen”.
About the AuthorFollow on Twitter Visit Website More Content by Myles Axton