Data is the bread and butter of biomedical research, but how are researchers building the right infrastructure to maximize the power of that data?
New York University Langone Health is addressing this question for the Perlmutter Cancer Center by building a central cancer data hub the past few years. NYU launched the hub in 2021, creating a data lake that can leverage better analytic capabilities and streamline data processes.
NYU Langone Health Management and Architecture Director Rajan Chandras and NYU Long Island School of Medicine Surgery Assistant Professor Dr. Megan Winner explained the process behind establishing the PCC Data Hub and its impacts across the cancer center during the 2022 HIMSS Conference Thursday.
Before building the data hub, the cancer center faced several obstacles with its data collection and use. There was lack of integration between tissue bank resources, as well as duplicated and manual clinical data collection. Externally sourced genome data was collected manually as well. Overall, these issues prevented data access and usability, lack of sustainability and scalability, and an inability to use better analytics tools.
“Traditional architectures cannot handle modern analytic requirements,” Chandras said, so his team opted to establish the PCC Data Hub.
In setting up the data hub, Chandras and Winner aimed to make it an enduring analytic resource that collects and maintains data assets of shared utility and supports clinical and population science research for the cancer center. They opted to build the hub on a Hadoop big data platform, leveraging open-source software and the flexibility, security, scalability, resiliency and longevity it would bring to PCC.
Chandras and Winner took repositories like radiation oncology software, radiation treatment and research databases, tumor registries and mutation data, legacy pathology reports, and dashboards for lung cancer quality and the clinical trials office and linked it all to the new hub.
In doing this, PCC has seen transformation in various workflows. Winner highlighted that for previous treatment data collection, researchers used unsupported software to collect and store data, documented additional datapoints in semi-structured notes in individual charts and downloaded single-use datasets that were manually annotated.
Once the data hub was up, PCC migrated treatment-specific data from the unsupported database to the new hub, which has since improved flaws in the old workflow. The hub enabled semi-structured notes to become discrete data and annotations and raw data could now go hand-in-hand.
“We redesigned the data collection at the point of care in order to then feed the remaining fields in the discrete data feed for this research group,” Winner said. “Step number two, … it was the perfect moment to change documentation at the point of care.”
The hub also changed workflows for PCC’s clinical trials office, Winner added. The office used to circulate reports from third party labs in spreadsheet forms by email and saved potential clinical trial participants in a list for periodic re-reviews for eligibility.
After migrating to the PCC Data Hub, Winner said the clinical trials office has now been able to directly retrieve third party labs’ raw data and centrally store it in the hub. The data lake can also pair key clinical information with mutation details, while also enabling users to preserve annotations and view data in a dashboard format.
While PCC has seen crosscutting improvements from the data hub, Chandras and Winner are looking to further improve the platform. Moving forward, they hope to integrate the data hub into clinical operations and point-of-care change management, prepare data for machine learning, have multi-level access protocols for better security and continue data and analytic democratization. They also hope to eventually replicate the approach they took with the data hub in other clinical areas.