Crossref is a not-for profit membership organization, working with over 7,500 member publishers to make content easy to find, cite, link, and assess. Our member organizations are from a range of countries, vary greatly in size, have differing business models, provide different types of content hosted on different websites, and publish in a whole host of subject areas. In short, it’s a really interesting and diverse landscape to be part of.
However, if you’re a researcher interested in text mining content from a selection of these publishers, that kind of diversity is something that causes issues.
Say I’m interested in mining content in a specific subject area. How do I go to even 100 of these publishers to get the specific selection of papers I’m interested in so that I can extract just a particular type of fact? Going to each publisher individually to ask for the content didn’t seem like a solution for either the researcher or the publisher, who in turn might be dealing with lots of different requests from researchers asking for similar feeds of content. That’s what our pilot tried to solve back in 2012 - we worked with a group of researchers, publishers and hosting platforms to try to figure out how to bridge this gap and bring more automation to the process.
How to solve it
We launched our support for TDM in May 2014 to try to solve this problem. Crossref already collects metadata from its member publishers – they register it with us when assigning Digital Object Identifiers (DOIs) to their journal articles, books, datasets, components such as figures, preprints, and lots of other publication types. The metadata collected for these has been expanding to collect useful information like abstracts, funding data, and ORCID iDs.
To support researchers who need to mine full-text content, we added two extra pieces to the metadata we collect: links directly to the full-text content on the publisher sites (so that researchers know where the content is located) and license information in the form of a URL (so that researchers know what they can do with the content). When publishers give Crossref that information, it’s made available via our REST API (Application Programming Interface). That way, anyone who is interested can find it, use it, and even build tools on top of it.
How does that work for content mining?
Researchers can query the REST API to ask if the DOIs they’re interested in mining have those full text links and the license information. If that’s the case, they can look at the license and decide if it suits their needs (they may have a whitelist of licenses that they know serve their purposes already) and they can actively agree to some specific publisher licenses via the Crossref click-through service which hosts these. Once they’ve done that, they can then request the full-text content from the publisher websites, provided they have access to the content. This is all done programmatically, which makes transactions regarding a large volume of documents from different publishers possible.
The service already supports over 20 million items from over 700 different publishers, including Wiley, Elsevier, and Hindawi so the range of content researchers can access is really starting to expand. We’re keen to see it grow further and to see the outcomes from the researchers using it to streamline their text mining workflows or to start exploring this promising field.
Editor’s note: Wiley works with Crossref metadata to facilitate mining, for more information, read our policy.