A machine-learning platform to illuminate the chemical dark matter in mass spectrometry-based metabolomics
Project Number1DP5OD036960-01
Contact PI/Project LeaderSKINNIDER, MICHAEL ALEXANDER
Awardee OrganizationPRINCETON UNIVERSITY
Description
Abstract Text
7. PROJECT SUMMARY/ABSTRACT
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
The human body contains thousands of small molecules, and is exposed to thousands more during daily life.
This complex chemical ecosystem reflects both the endogenous metabolism of human cells, as well as
xenobiotic exposures from our diets, our gut flora, and our natural and built environments. At present, however,
the vast majority of these small molecules remain unknown. Remarkably, this gap is not due to a lack of
appropriate experimental technology: mass spectrometry-based metabolomics routinely detects thousands of
distinct chemical signals in any biological sample. However, only a small fraction of these signals are routinely
identified. The remaining profusion of unidentified chemical entities has been dubbed the “dark matter” of the
metabolome. Computational tools to shed light on this chemical dark matter could transform our understanding
of disease pathobiology, open new avenues for personalized medicine, and increase the scope and efficiency
of any metabolomic study. At the same time, true chemical dark matter must be differentiated from the variety
of technical artefacts, contaminants, and redundant forms of the same biomolecules that are also detected by
mass spectrometry. This project proposes to establish a suite of computational tools that will dramatically
advance our ability to interpret mass spectrometry-based metabolomic datasets, and thereby begin to unlock
the dark metabolome. These tools will apply emerging techniques from the field of natural language
processing, including the same large language model (LLM) architectures that power tools like ChatGPT, to
address two of the most important unmet needs in small molecule mass spectrometry. In Aim 1, we will
develop DecipherMS, a computational tool for de novo annotation of both known and unknown chemical
structures from MS/MS spectra. Despite decades of work in computational mass spectrometry, de novo
annotation of unknown molecules remains a critical gap, with virtually all existing tools designed to search in a
database of known structures. DecipherMS will overcome this gap by using language models to decode
unknown chemical structures directly from MS/MS spectra, using a novel data augmentation strategy to learn
effectively from limited training data. In Aim 2, we will develop FoundationMS, a foundation model for mass
spectrometry-based metabolomics. FoundationMS will standardize data preprocessing workflows that are
required to identify mass spectrometric signals that should be brought forward for annotation in the first place,
which will be achieved by learning from a repository-scale corpus of metabolomic data in a self-supervised
manner. The resulting model will be fine-tuned to perform common preprocessing tasks including peak picking,
retention time alignment, adduct removal, and chemical formula assignment. Both DecipherMS and
FoundationMS will be rigorously benchmarked using appropriate datasets. Implementing these approaches in
well-documented, user-friendly, and computationally efficient software will address central gaps in our ability to
measure small molecules and shift existing paradigms in metabolomic data analysis.
Public Health Relevance Statement
8. PROJECT NARRATIVE
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Whereas scientists can now readily measure the DNA, RNA, and protein content within any biological sample,
measuring the small molecules in that same sample has remained more challenging. This project aims to
develop a suite of machine-learning tools that will allow us to more comprehensively measure both known and
unknown small molecules in any biological sample by analyzing mass spectrometry data. The possibility of
identifying unknown small molecules in high-throughput would increase the scope and efficiency of many
publicly funded studies and open up new perspectives on the chemical world around us.
NIH Spending Category
No NIH Spending Category available.
Project Terms
AddressArchitectureBenchmarkingBiologicalCellsChatGPTChemical StructureChemicalsCollaborationsCommunitiesComplexComputational BiologyComputer softwareCreativenessDNADarknessDataData AnalysesData SetDatabasesDevelopmentDietDiseaseEcosystemEducational process of instructingEnvironmentExcisionExposure toFoundationsFundingGenomicsGoalsHigh Performance ComputingHumanHuman bodyIndividualInfrastructureInstitutionLanguageLearningLifeLinkMachine LearningManualsMass Spectrum AnalysisMeasuresMentorsMentorshipMetabolismModelingMorphologic artifactsNatural Language ProcessingPositioning AttributePrincipal InvestigatorProductivityProteinsPublicationsRNARecording of previous eventsResearchResourcesSamplingScientistSignal TransductionStructureStudentsTechnical ExpertiseTechniquesTechnologyTimeTrainingUniversitiesVisionWorkXenobioticsadductbuilt environmentcomputational suitecomputerized toolscomputing resourcesdark matterdata standardsdesignexperienceexperimental studygraduate studentgut microbiotalarge language modelmetabolomemetabolomicsmultidisciplinarynovelnovel strategiespersonalized medicineprofessorprogramsrecruitrepositoryself supervised learningsmall moleculetooltraining datauser-friendlyvirtual
National Institute of Dental and Craniofacial Research
$1
2024
NIH Office of the Director
$391,851
Year
Funding IC
FY Total Cost by IC
Sub Projects
No Sub Projects information available for 1DP5OD036960-01
Publications
Publications are associated with projects, but cannot be identified with any particular year of the project or fiscal year of funding. This is due to the continuous and cumulative nature of knowledge generation across the life of a project and the sometimes long and variable publishing timeline. Similarly, for multi-component projects, publications are associated with the parent core project and not with individual sub-projects.
No Publications available for 1DP5OD036960-01
Patents
No Patents information available for 1DP5OD036960-01
Outcomes
The Project Outcomes shown here are displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed are those of the PI and do not necessarily reflect the views of the National Institutes of Health. NIH has not endorsed the content below.
No Outcomes available for 1DP5OD036960-01
Clinical Studies
No Clinical Studies information available for 1DP5OD036960-01
News and More
Related News Releases
No news release information available for 1DP5OD036960-01
History
No Historical information available for 1DP5OD036960-01
Similar Projects
No Similar Projects information available for 1DP5OD036960-01