NLM Leverages Data, Text Mining to Sharpen COVID-19 Research Databases

Mon, 05/11/2020

Data-mining techniques will allow health researchers to sharpen COVID-19 literature and clinical trial database search results.

7m read

Written by:

Melissa Harris

Close-up of African female pathology technician in late 20s examining data on desktop PC display in Buenos Aires clinical analysis laboratory. — Photo Credit: xavierarnau/iStock

The National Library of Medicine is leveraging its database resources and artificial intelligence capabilities to rapidly provide COVID-19 literature and resources to researchers and scientists as the world races to understand and respond to the pandemic.

The White House in March tapped NLM, under the National Institutes of Health, to join a public-private partnership called the COVID-19 Open Research Dataset (CORD-19) to develop data-mining techniques that could help the science community answer critical questions pertaining to COVID-19. Leveraging its existing infrastructure and establishing processes for content submission, NLM has quickly brought access to COVID-19 literature and clinical trial content on its PubMed Central (PMC) and ClinicalTrials.gov databases.

“As of May 1, about 46,000 articles had been deposited by publishers to PMC or updated in PMC to have a license that allows for text and data-mining, of which more than 5,600 articles specifically focus on the current novel coronavirus,” said NLM National Center for Biotechnology Information Acting Director Stephen Sherry. “Some 49 publishers are now included in the PMC COVID-19 initiative.”

Within the first few weeks since launching the project, PMC saw significant COVID-19 download and data-sharing rates, said PMC Program Manager Kathryn Funk in an NIH webinar. As part of the project, Funk’s team worked to standardize submission data in a machine-readable format.

“The early results have been encouraging,” Funk said. “Articles in the Public Health Emergency Collection and PMC were retrieved more than 2 million times in the first two to three weeks of the initiative, and the CORD-19 dataset has been downloaded more than 75,000 times at this time. It’s our hope that through expanded access and machine learning, NLM will be able to help accelerate scientific research on COVID-19.”

NLM has also leveraged ClinicalTrials.gov’s existing infrastructure to scale up and provide quick access to information about trials related to COVID-19. Teams conducting trials around the world can submit standardized and structured information about their trials directly through an online submission portal called the Protocol Registration and Results system, where trial information is then posted to ClinicalTrials.gov within a couple of days of initial submission, Sherry said.

The data standardization and structure are critical to enabling AI technologies like machine learning and natural-language processing, which can help users more effectively mine and analyze the databases’ resources and literature to generate knowledge and support research that assist in responding to COVID-19, Sherry said.

“ClinicalTrials.gov also leverages NLM resources such as the biomedical vocabularies and standards integrated in the unified Medical Language System (UMLS) to support its search capabilities,” Sherry said, citing the database’s complete list of registered COVID-19 studies. “Users can filter the search results further by different study design characteristics, recruitment status, location information and other factors to identify trials of interest. All of these search capabilities are also available through the ClinicalTrials.gov API.”

Sherry likened the ClinicalTrials.gov infrastructure as an “information scaffold” for discovering information about clinical trials, as the platform applies unique identifiers called National Clinical Trial (NCT) numbers to each trial so that individuals can label and identify trials.

“As a result, different resources with information about particular trials can be linked and discovered through the use of unique NCT numbers, [such as] ClinicalTrials.gov records, press releases, journal articles, protocol document[s], informed consent forms, systematic reviews, reports, regulatory documents, individual participant-level data,” Sherry said.

Creating an open data repository ecosystem like ClinicalTrials.gov requires integrating different data contributors in a way that enable interoperability and usability of data, said NIH Director of Data Science Strategy Susan Gregurick, who helped establish the agency’s data science office in 2018.

“NIH strongly encourages open-access, data-sharing repositories as your first go-to choice when you’re looking for a repository to share your data and your information,” Gregurick said during an agency webinar last month.

Although NLM had already pledged to modernize its databases, support data-driven science, collaborate with relevant stakeholders and build a future-ready workforce in its strategic plan, such as the multi-year effort to overall modernize ClinicalTrials.gov, COVID-19 has sparked a number of new data-backed initiatives and digital resources around COVID-19, said Sherry and Gregurick.

These are not just on PMC and ClinicalTrials.gov, but also on new platforms and resources, including:

LitCovid, a COVID-19-specific open-resource literature hub that curates and disseminates a constantly growing comprehensive collection of international research papers relevant to public health. “This resource builds on NLM research to develop new approaches to locating and indexing the literature related to COVID-19, including a text classification algorithm for screening and ranking relevant documents, topic modeling for suggesting relevant research categories and information extraction for obtaining geographic locations found in the abstract,” Sherry said.
COVID-19 genetic sequence information additions to GenBank, the world’s largest genetic sequence database that released the first COVID-19 sequence to the public Jan. 12 and the first sequence collected in America in collaboration with the Centers for Disease Control and Prevention Jan. 25. “As of April 9, we have 579 SARS-CoV-2 sequences from 26 different countries publicly available,” Sherry said, adding that NLM has create a data hub on GenBank for individuals to search, retrieve and analyze COVID-19 sequences that have been submitted.
The Sequence Read Archive, an 14-petabyte archive of high-throughput genetic sequence data that as of February became available on commercial cloud-computing platforms, which Sherry said significantly expanded the discovery potential of the data to help identify mutational patterns and inform drug and vaccine development.
PubChem, an open chemistry database that contains compounds used in COVID-19 clinical trials and found in COVID-19-related protein database structures.

Trending

This is a carousel with manually rotating slides. Use Next and Previous buttons to navigate or jump to a slide with the slide dots

Accelerating Federal Cyber Resiliency in IT Modernization
15m watch
AI Revolutionizes Defense Decision-Making
29m watch
A Look at Federal Zero Trust Transformation
20m read
Powering Defense with Transparent AI
20m read
Technology Modernization Drives a More Efficient Government
20m read

Related Content

Inside DOD’s Push to Grow the Cyber Workforce Through Academia

Diba Hadi gives her first interview since becoming principal director of the DOD’s Cyber Academic Engagement Office.

15m listen
- Partner Content
- Artificial Intelligence
Agencies Tackle Infrastructure Challenges to Drive AI Adoption

Federal agencies are rethinking data strategies and IT modernization to drive mission impact and operational efficiency as new presidential directives guide next steps.

5m read Partner Content
- Artificial Intelligence
- Workforce
Generative AI Demands Federal Workforce Readiness, Officials Say

NASA and DOI outline new generative AI use cases and stress that successful AI adoption depends on strong change management.

6m read
- Artificial Intelligence
- Customer Experience
The Next AI Wave Requires Stronger Cyber Defenses, Data Management

IT officials warn of new vulnerabilities posed by AI as agencies continue to leverage the tech to boost operational efficiency.

5m read
- Artificial Intelligence
- Cybersecurity
Federal CIOs Push for ROI-Focused Modernization to Advance Mission Goals

CIOs focus on return on investment, data governance and application modernization to drive mission outcomes as agencies adopt new tech tools.

4m read
- Artificial Intelligence
- Digital Services
Fed Efficiency Drive Includes Code-Sharing Law, Metahumans

By reusing existing code instead of rewriting it, agencies could dramatically cut costs under the soon-to-be-enacted SHARE IT Act.

5m read
- Data
- Defense
Agencies Push Data-Driven Acquisition Reforms to Boost Efficiency

New initiatives aim to increase visibility of agency spending, improve data quality and create avenues to deploy solutions across government.

5m read
- Data
- Workforce
Data Transparency Essential to Government Reform, Rep. Sessions Says

Co-Chair of the Congressional DOGE Caucus Rep. Pete Sessions calls for data sharing and partnerships to reduce waste and improve efficiency.

5m read
DOD Turns to Skills-Based Hiring to Build Next-Gen Cyber Workforce

Mark Gorak discusses DOD’s efforts to build a diverse cyber workforce, including skills-based hiring and partnerships with over 480 schools.

20m listen
- Video
- Artificial Intelligence
AI Foundations Driving Government Efficiency

Federal agencies are modernizing systems, managing risk and building trust to scale responsible AI and drive government efficiency.

40m watch
- Health IT
- Workforce
Trump Executive Order Boosts HBCUs Role in Building Federal Tech Workforce

The executive order empowers HBCUs to develop tech talent pipelines and expand access to federal workforce opportunities.

3m read
- Artificial Intelligence
- Defense
Navy Memo Maps Tech Priorities for the Future Fight

Acting CTO’s memo outlines critical investment areas, from AI and quantum to cyber and space, as part of an accelerated modernization push.

5m read