NLM Leverages Data, Text Mining to Sharpen COVID-19 Research Databases
Data-mining techniques will allow health researchers to sharpen COVID-19 literature and clinical trial database search results.

The National Library of Medicine is leveraging its database resources and artificial intelligence capabilities to rapidly provide COVID-19 literature and resources to researchers and scientists as the world races to understand and respond to the pandemic.
The White House in March tapped NLM, under the National Institutes of Health, to join a public-private partnership called the COVID-19 Open Research Dataset (CORD-19) to develop data-mining techniques that could help the science community answer critical questions pertaining to COVID-19. Leveraging its existing infrastructure and establishing processes for content submission, NLM has quickly brought access to COVID-19 literature and clinical trial content on its PubMed Central (PMC) and ClinicalTrials.gov databases.
โAs of May 1, about 46,000 articles had been deposited by publishers to PMC or updated in PMC to have a license that allows for text and data-mining, of which more than 5,600 articles specifically focus on the current novel coronavirus,โ said NLM National Center for Biotechnology Information Acting Director Stephen Sherry. โSome 49 publishers are now included in the PMC COVID-19 initiative.โ
Within the first few weeks since launching the project, PMC saw significant COVID-19 download and data-sharing rates, said PMC Program Manager Kathryn Funk in an NIH webinar. As part of the project, Funkโs team worked to standardize submission data in a machine-readable format.
โThe early results have been encouraging,โ Funk said. โArticles in the Public Health Emergency Collection and PMC were retrieved more than 2 million times in the first two to three weeks of the initiative, and the CORD-19 dataset has been downloaded more than 75,000 times at this time. Itโs our hope that through expanded access and machine learning, NLM will be able to help accelerate scientific research on COVID-19.โ
NLM has also leveraged ClinicalTrials.govโs existing infrastructure to scale up and provide quick access to information about trials related to COVID-19. Teams conducting trials around the world can submit standardized and structured information about their trials directly through an online submission portal called the Protocol Registration and Results system, where trial information is then posted to ClinicalTrials.gov within a couple of days of initial submission, Sherry said.
The data standardization and structure are critical to enabling AI technologies like machine learning and natural-language processing, which can help users more effectively mine and analyze the databasesโ resources and literature to generate knowledge and support research that assist in responding to COVID-19, Sherry said.
โClinicalTrials.gov also leverages NLM resources such as the biomedical vocabularies and standards integrated in the unified Medical Language System (UMLS) to support its search capabilities,โ Sherry said, citing the databaseโs complete list of registered COVID-19 studies. โUsers can filter the search results further by different study design characteristics, recruitment status, location information and other factors to identify trials of interest. All of these search capabilities are also available through the ClinicalTrials.gov API.โ
Sherry likened the ClinicalTrials.gov infrastructure as an โinformation scaffoldโ for discovering information about clinical trials, as the platform applies unique identifiers called National Clinical Trial (NCT) numbers to each trial so that individuals can label and identify trials.
โAs a result, different resources with information about particular trials can be linked and discovered through the use of unique NCT numbers, [such as] ClinicalTrials.gov records, press releases, journal articles, protocol document[s], informed consent forms, systematic reviews, reports, regulatory documents, individual participant-level data,โ Sherry said.
Creating an open data repository ecosystem like ClinicalTrials.gov requires integrating different data contributors in a way that enable interoperability and usability of data, said NIH Director of Data Science Strategy Susan Gregurick, who helped establish the agencyโs data science office in 2018.
โNIH strongly encourages open-access, data-sharing repositories as your first go-to choice when youโre looking for a repository to share your data and your information,โ Gregurick said during an agency webinar last month.
Although NLM had already pledged to modernize its databases, support data-driven science, collaborate with relevant stakeholders and build a future-ready workforce in its strategic plan, such as the multi-year effort to overall modernize ClinicalTrials.gov, COVID-19 has sparked a number of new data-backed initiatives and digital resources around COVID-19, said Sherry and Gregurick.
These are not just on PMC and ClinicalTrials.gov, but also on new platforms and resources, including:
- LitCovid, a COVID-19-specific open-resource literature hub that curates and disseminates a constantly growing comprehensive collection of international research papers relevant to public health. โThis resource builds on NLM research to develop new approaches to locating and indexing the literature related to COVID-19, including a text classification algorithm for screening and ranking relevant documents, topic modeling for suggesting relevant research categories and information extraction for obtaining geographic locations found in the abstract,โ Sherry said.
COVID-19 genetic sequence information additions to GenBank, the worldโs largest genetic sequence database that released the first COVID-19 sequence to the public Jan. 12 and the first sequence collected in America in collaboration with the Centers for Disease Control and Prevention Jan. 25. โAs of April 9, we have 579 SARS-CoV-2 sequences from 26 different countries publicly available,โ Sherry said, adding that NLM has create a data hub on GenBank for individuals to search, retrieve and analyze COVID-19 sequences that have been submitted. - The Sequence Read Archive, an 14-petabyte archive of high-throughput genetic sequence data that as of February became available on commercial cloud-computing platforms, which Sherry said significantly expanded the discovery potential of the data to help identify mutational patterns and inform drug and vaccine development.
- PubChem, an open chemistry database that contains compounds used in COVID-19 clinical trials and found in COVID-19-related protein database structures.
This is a carousel with manually rotating slides. Use Next and Previous buttons to navigate or jump to a slide with the slide dots
-
Inside DODโs Push to Grow the Cyber Workforce Through Academia
Diba Hadi gives her first interview since becoming principal director of the DODโs Cyber Academic Engagement Office.
15m listen -
Agencies Tackle Infrastructure Challenges to Drive AI Adoption
Federal agencies are rethinking data strategies and IT modernization to drive mission impact and operational efficiency as new presidential directives guide next steps.
5m read Partner Content -
Generative AI Demands Federal Workforce Readiness, Officials Say
NASA and DOI outline new generative AI use cases and stress that successful AI adoption depends on strong change management.
6m read -
The Next AI Wave Requires Stronger Cyber Defenses, Data Management
IT officials warn of new vulnerabilities posed by AI as agencies continue to leverage the tech to boost operational efficiency.
5m read -
Federal CIOs Push for ROI-Focused Modernization to Advance Mission Goals
CIOs focus on return on investment, data governance and application modernization to drive mission outcomes as agencies adopt new tech tools.
4m read -
Fed Efficiency Drive Includes Code-Sharing Law, Metahumans
By reusing existing code instead of rewriting it, agencies could dramatically cut costs under the soon-to-be-enacted SHARE IT Act.
5m read -
Agencies Push Data-Driven Acquisition Reforms to Boost Efficiency
New initiatives aim to increase visibility of agency spending, improve data quality and create avenues to deploy solutions across government.
5m read -
Data Transparency Essential to Government Reform, Rep. Sessions Says
Co-Chair of the Congressional DOGE Caucus Rep. Pete Sessions calls for data sharing and partnerships to reduce waste and improve efficiency.
5m read -
DOD Turns to Skills-Based Hiring to Build Next-Gen Cyber Workforce
Mark Gorak discusses DODโs efforts to build a diverse cyber workforce, including skills-based hiring and partnerships with over 480 schools.
20m listen -
AI Foundations Driving Government Efficiency
Federal agencies are modernizing systems, managing risk and building trust to scale responsible AI and drive government efficiency.
40m watch -
Trump Executive Order Boosts HBCUs Role in Building Federal Tech Workforce
The executive order empowers HBCUs to develop tech talent pipelines and expand access to federal workforce opportunities.
3m read -
Navy Memo Maps Tech Priorities for the Future Fight
Acting CTOโs memo outlines critical investment areas, from AI and quantum to cyber and space, as part of an accelerated modernization push.
5m read