Towards a global comprehensive dataset of open access papers for text analytics
Research literature contains some of the most important information we have assembled as human species, such as how to treat diseases, solve difficult engineering problems and answer many of the world’s challenges we are facing today. The entire body of research literature is currently estimated at over 100 million publications (Khabsa & Giles, 2014) with an annual increase of around 1 million published each year (Bjork & Roos, 2009) and an estimated 10% year on year increase in the annual number of these outputs (Bornmann & Mutz, 2014). Systematically reading and analysing the full body of knowledge is now beyond the capacities of any human being. This work analyses and documents the challenges in systematically gathering research papers from repositories and publishers’ systems and assembles a 10.5 million full texts large global dataset of papers to facilitate text and data analytics tasks. We also offer new solutions to the harvesting of full texts from non-standardised systems of major publishers creating a seamless machine access layer over this content using ResourceSync. The key innovations of this work are: - A downloadable dataset of 10 million+ open access full texts, i.e. multiple times larger than any other existing legal downloadable set of Open Access (OA) papers, such as PubMeD OA subset and arXiv.org. - First solution for a large-scale aggregation of hybrid-Gold OA papers from non-standardised systems of key publishers, including Elsevier and Springer. - First implementation and application of ResourceSync that scales to millions of items. Our 10.5 million statistic refers to the number of OA documents we identified, downloaded, extracted text from, validated their relationship to the metadata record and the full texts of which we host on the CORE servers and make available to others. In contrast, BASE and OADOI do not aggregate the full texts of the resources they “provide access to” and consequently do not offer the means to interact with the full texts of these resources or offer bulk download capability of OA resources for text analytics over scholarly literature. The dataset we have developed is also integrated with the OpenMinTeD infrastructure , a European Commission funded project which aims to provide a platform for text mining of scholarly literature in the cloud.
The following license files are associated with this item: