Patrick McMullen, Director of Computational Toxicology
I have often joked that as a computational biologist in an applied field that has been slow to embrace bioinformatics, it is my goal to work myself out of a job. That is, if I can empower the scientific experts responsible for safety decisions with the right tools, then they can ask their questions directly of the data instead of going through me or my colleagues. This short circuit allows experts to establish a more intimate relationship with their experiments and data, driving a faster “hypothesis-test-repeat” cycle. Mel Andersen highlighted the value that emerging intuitive bioinformatics tools are bringing in a recent webinar on the use of transcriptomics for advancing safety decision making . While the implementation of such tools continues to grow, we as a community have an opportunity to build an ecosystem that allows easier access to toxicology data sets.
Over the past several years, there have been increasingly public calls for implementation of better data access systems, such as those following the FAIR Data Principles [1]. Such data should be Findable, Accessible, Interoperable, and Reusable [2]. Implementation of these concepts has proved challenging for a number of reasons. Emerging technologies being used to inform environmental health decisions, such as high-throughput screening data, transcriptomics, and high-content imaging, generate large volumes of data in complex experimental designs that involve practical challenges for storage and organization. Best practices for data generation and processing are a moving target for technologies that are themselves rapidly evolving. Additional constraints arising from patient confidentiality or confidential business data complicate the situation further in some applications.
Because of these challenges, “big data” are often made available by storing them as flat files available for download. While this may meet some of the FAIR principles in the letter, these types of information are a far cry from “actionable”. Their use often requires downloads of large files, specialized software, and a scientific understanding of how to use it appropriately. Presentation of data in this way does not facilitate use. Instead, it is imperative that we provide the scientific community with both data and the tools to use it in an effective way. This means providing data cataloged in defined ways, with well-documented application programming interfaces that allow external groups to link data sources and tools together in applications that extend beyond their original scope. Technical implementation of such a system is made tractable by cloud infrastructure systems—such as Amazon Web Services—that are increasingly easy to implement.
There are templates already in place for this paradigm. Two decades ago, frustration about the availability of gene expression microarray data lead to the development of the MIAME standard [3], which describes a basic ontology for reporting details of the data collection. The National Center for Biotechnology Information’s Gene Expression Omnibus (GEO) has embraced this standard for the hosting of transcriptomic datasets upon publication. GEO has since become the de facto repository for deposition of toxicogenomics data; most scientific journals require studies involving transcriptomics to be submitted to GEO on publication. Subsequently, GEO has sponsored tools to directly take primary data from its database and apply different workflows. This system has closed the loop between the data in the repository and its interpretation by whomever might make use of it.
There many are opportunities today to embrace this philosophy. The surfeit of toxicogenomics data and the rapid development of tools for interpretation invites their combination. Our own MoAviz platform [4] provides a practical example, serving as a bridge between publicly available toxicogenomics data and a platform for interactively interpreting the outputs. There are additional opportunities in toxicology—particularly in the use of ‘omics data to identify biological effect levels—to make data more FAIR. At ScitoVation, we specialize in the generation of tools and dashboards that maximize data usefulness and accessibility.
1. Wilkinson, M. D. et al. (2016) Sci Data 3, 160018.
2. https://factor.niehs.nih.gov/2019/9/science-highlights/data/index.htm
3. Brazma, A. et al. (2001) Nat Genet 29, 365-371.
4. https://moaviz.scitostage.wpengine.com