Welcome to the first installment of Agentic Omics: When AI Reads the Book of Life. In this 24-part series, we will systematically review the state of the art of Artificial Intelligence (AI) across all major omics disciplines. We will explore how large language models, foundational transformer architectures, and eventually fully autonomous “Agentic Omics” systems are orchestrating domain-specific models to accelerate drug discovery, personalized medicine, and our fundamental understanding of biology.
Before we can dive deeply into the AI systems transforming these fields, however, we must first establish a shared vocabulary and a clear map of the territory. What exactly are the “omics,” and why are they so critical to the future of healthcare and biology? More importantly for computational scientists and ML engineers, what is the data landscape like, and how “AI-ready” is each of these distinct disciplines?
Introduction: Biology’s Layered Complexity
For much of the 20th century, biology was an observational science, deeply reliant on low-throughput experiments and reductionist thinking. Scientists would isolate a single gene, a single protein, or a single pathway, and spend years, sometimes entire careers, characterizing its function. While this approach yielded immense foundational knowledge, it struggled to capture the interconnected, holistic reality of living systems.
The suffix “-omics” denotes the comprehensive, large-scale study of a set of biological molecules. The “omics revolution” represents a paradigm shift from studying isolated biological components to analyzing entire systems simultaneously. Instead of looking at one gene, we look at all of them (Genomics). Instead of one protein, we look at the entire protein repertoire of a cell (Proteomics).
This shift was catalyzed by the Human Genome Project, completed in 2003, which provided the first comprehensive map of human DNA. Since then, technological advancements have driven an unprecedented data explosion. We can now sequence entire genomes in hours, capture the RNA expression of millions of individual cells, and predict the 3D structures of virtually all known proteins.
Living systems operate through a layered, interconnected flow of information, famously described by Francis Crick as the “Central Dogma of Molecular Biology”: DNA is transcribed into RNA, which is translated into proteins. While the reality is vastly more complex, involving intricate feedback loops, epigenetic modifications, and metabolic networks, this layered model provides a useful scaffold for understanding the various omics disciplines.
Let us explore each of these distinct layers, what they measure, why they matter, and how ready they are for the AI revolution.
1. Genomics: The Foundational Blueprint
What it measures: The complete set of DNA instructions (the genome) within an organism. Genomics focuses on the sequence of nucleotides (A, C, T, G), identifying genetic variations (SNPs, indels, structural variants), and understanding how these variations influence traits and disease susceptibility.
Why it matters: The genome is the fundamental blueprint. Variations in our DNA dictate everything from eye color to our risk for complex diseases like Alzheimer’s, cancer, and heart disease. Genomics is the bedrock of precision medicine. By identifying disease-causing mutations, we can develop targeted therapies and preventative strategies.
The Data Landscape: Genomics is arguably the most mature omics field in terms of data generation and standardization. The cost of sequencing a human genome has plummeted from billions of dollars to less than $200. Projects like the NIH All of Us Research Program (which recently released a massive genomic dataset in early 2026 ranking alongside the UK Biobank and the Million Veteran Program) have generated petabytes of genomic data linked to electronic health records. Formats like FASTQ, BAM, and VCF are heavily standardized.
AI Readiness: High. Genomics is exceptionally well-suited for AI. DNA is essentially a linear sequence, making it highly amenable to sequence-based deep learning architectures, particularly Transformers. As we will explore in Post 5, DNA foundation models (like DNABERT-2, Nucleotide Transformer, and Evo) are already achieving remarkable success in predicting gene expression and classifying variants directly from raw sequence data. The massive scale of available, standardized data makes genomics the current vanguard of the AI-omics revolution.
2. Transcriptomics: The Dynamic Messengers
What it measures: The complete set of RNA transcripts (the transcriptome) produced by the genome under specific circumstances or in a specific cell. While the genome is largely static across all cells in an organism, the transcriptome is highly dynamic. It tells us which genes are actively being “read” and at what levels.
Why it matters: If the genome is the blueprint, the transcriptome is the construction site manager’s log. It reveals what the cell is actually doing at a given moment. Transcriptomics is crucial for understanding how cells differentiate, how they respond to environmental stress, and how gene expression goes awry in diseases like cancer.
The Data Landscape: The advent of RNA sequencing (RNA-seq) revolutionized transcriptomics. More recently, Single-Cell RNA Sequencing (scRNA-seq) has allowed researchers to measure gene expression at the resolution of individual cells, revealing immense cellular heterogeneity that bulk sequencing obscures. Initiatives like the Human Cell Atlas are mapping the transcriptomes of all human cell types, generating massive, high-dimensional datasets.
AI Readiness: Very High. Like DNA, RNA is a linear sequence, making it prime territory for language models. However, transcriptomics AI often focuses on the count matrix—the expression levels of thousands of genes across thousands of cells. Foundation models like scGPT and Geneformer, trained on tens of millions of single-cell transcriptomes, are proving highly effective at cell type annotation, gene network inference, and perturbation prediction. The field is rapidly maturing, driven by high-quality, abundant single-cell data.
3. Proteomics: The Functional Machinery
What it measures: The entire set of proteins (the proteome) produced or modified by an organism or system. Proteins are the actual workhorses of the cell, carrying out the instructions encoded in the DNA and RNA. They form structural components, catalyze metabolic reactions, and facilitate cellular communication.
Why it matters: Because proteins execute cellular functions, they are the primary targets for most drugs. Understanding protein expression levels, post-translational modifications (PTMs), and, crucially, their 3D structures and interactions is essential for uncovering disease mechanisms and designing novel therapeutics.
The Data Landscape: Proteomics data is fundamentally more complex than genomic or transcriptomic data. Proteins are composed of 20 different amino acids, fold into complex 3D structures, and undergo dynamic modifications. Technologies like Mass Spectrometry (MS) are the workhorses of proteomics, but analyzing MS spectra remains a formidable challenge. Databases like UniProt and the Protein Data Bank (PDB) are central repositories.
AI Readiness: Extremely High (Structure) / Moderate (Expression/Modifications). AI has already revolutionized structural proteomics. DeepMind’s AlphaFold 2 fundamentally solved the protein structure prediction problem, a breakthrough that earned the 2024 Nobel Prize in Chemistry. The recent AlphaFold 3 extends this to predicting complexes of proteins with DNA, RNA, and ligands. Protein language models like Meta’s ESM-3 are enabling de novo protein design. However, AI for quantifying protein expression levels and understanding dynamic PTMs from mass spectrometry data, while advancing, still lags behind the structure prediction revolution.
4. Metabolomics: The Chemical Fingerprint
What it measures: The complete set of small-molecule chemicals (metabolites) found within a biological sample. These include things like sugars, amino acids, lipids, and vitamins. The metabolome represents the downstream end products of gene expression and environmental influences.
Why it matters: Metabolomics provides the most direct readout of an organism’s physiological state. It is the chemical fingerprint of what is happening in the body right now. Metabolites serve as excellent clinical biomarkers for disease diagnosis, monitoring, and predicting drug toxicity.
The Data Landscape: Metabolomics is arguably the most chemically diverse and analytically challenging of the omics. It relies heavily on Nuclear Magnetic Resonance (NMR) spectroscopy and high-resolution Mass Spectrometry. The data generated is incredibly complex, consisting of thousands of spectral peaks. A major bottleneck is identifying which specific chemical compound corresponds to a given spectral peak.
AI Readiness: Low to Moderate. Metabolomics is the least “language-model-ready” omics field. The data is not a sequence, but rather a complex physical measurement of chemical properties. While machine learning and deep learning tools (like SIRIUS and MS2DeepScore) are increasingly being used to improve spectral matching and metabolite identification, the lack of massive, standardized, perfectly annotated datasets—coupled with the sheer chemical diversity of the metabolome—makes foundation model approaches difficult. AI progress here is steady, but it has not yet experienced an “AlphaFold moment.”
5. Metagenomics: Decoding the Microbiome
What it measures: The collective genomic material of an entire community of microorganisms sampled directly from their natural environment (e.g., the human gut, the soil, the ocean), without the need for prior culturing in a laboratory.
Why it matters: The human body is home to trillions of microbes (the microbiome) that play a profound role in our health, influencing digestion, immunity, and even neurological function. Metagenomics allows us to identify which microbes are present and what functional genes they carry, linking microbiome composition to diseases like Inflammatory Bowel Disease (IBD), obesity, and cancer.
The Data Landscape: Metagenomic sequencing generates massive amounts of short DNA reads originating from hundreds or thousands of different species simultaneously. Assembling these fragmented reads into contiguous genomes (binning) and accurately classifying their taxonomy is a monumental computational task.
AI Readiness: Moderate to High. AI is significantly improving metagenomic workflows. Transformer-based approaches and deep learning are being deployed to enhance sequence assembly, taxonomic classification, and the prediction of antibiotic resistance genes. Furthermore, machine learning models are heavily utilized to discover associations between microbial signatures and human diseases. However, challenges remain regarding the compositional nature of the data and achieving strain-level resolution.
6. Phenomics: When Images Meet Molecules
What it measures: The systematic study of phenotypes—the observable physical and biochemical characteristics of an organism, as determined by the interaction of its genetic makeup and the environment.
Why it matters: Phenomics is the bridge between the molecular world (genomes, proteomes) and the visible world. In clinical settings, it involves analyzing high-throughput imaging, digital pathology, and electronic health records (EHRs) to connect a patient’s genetic profile with their actual clinical presentation.
The Data Landscape: Phenomics data is highly multimodal and complex. It encompasses high-resolution microscopy images, full-slide histopathology scans, medical imaging (MRI, CT), and unstructured clinical text from patient records. Large-scale biobanks (like the UK Biobank) are increasingly linking genomic data with deep phenotypic records.
AI Readiness: Very High. Because phenomics relies heavily on images and text, it perfectly intersects with the most mature areas of modern AI: Computer Vision and Natural Language Processing (NLP). Deep learning models are routinely achieving expert-level performance in classifying cellular phenotypes from microscopy, diagnosing cancer from digital pathology slides (e.g., FDA-approved tools like Paige Prostate), and extracting complex phenotypes from EHR narratives at scale.
The Data Explosion and the Limits of Traditional Bioinformatics
The combined output of these omics disciplines has created a data explosion unparalleled in the history of science. Over the past two decades, biological data generation has significantly outpaced Moore’s Law.
Historically, computational biology relied on “traditional bioinformatics”—a suite of algorithmic and statistical tools developed largely in the 1990s and 2000s. These tools, such as BLAST for sequence alignment, Hidden Markov Models for gene prediction, and differential equation models for pathway analysis, were brilliant for their time. They were handcrafted, biologically intuitive, and mathematically rigorous.
However, as a recent 2025 report by IntuitionLabs on “AI Compute Demand in Biotech” highlights, traditional bioinformatics is hitting profound scaling limits. The root causes are twofold: the sheer volume of data and the exponential complexity of biological interactions.
- The Curse of Dimensionality: When you are analyzing 20,000 genes across millions of single cells, traditional statistical methods struggle to separate signal from noise, often failing to capture non-linear relationships.
- The Integration Bottleneck: Traditional tools are highly specialized. A tool built for genomics cannot speak to a tool built for proteomics. Understanding disease requires multi-omics integration, which is mathematically intractable using legacy methods.
- The Handcrafted Feature Limit: Traditional models rely on human scientists defining the features to look for (e.g., “look for this specific k-mer frequency”). Deep learning, conversely, learns its own representations directly from the raw data, discovering subtle patterns that human intuition completely misses.
This is precisely why AI, particularly deep learning and foundation models, is taking over. These models thrive on massive, high-dimensional datasets. They excel at learning complex, non-linear representations, and they are increasingly capable of multimodal integration.
The Importance of FAIR Data
The success of AI in omics is entirely dependent on the quality and accessibility of the underlying data. As the old adage goes: “Garbage in, garbage out.”
The biological community is increasingly recognizing the critical importance of the FAIR data principles—guidelines aimed at ensuring data is Findable, Accessible, Interoperable, and Reusable.
As noted in recent systematic evaluations (e.g., the 2025 ScienceDirect review on metadata for human pluripotent stem cells), while massive datasets exist, inconsistent metadata standardization and a lack of cross-platform coordination severely hinder data reuse. A model trained on poorly annotated data will confidently make entirely useless predictions. The future of Agentic Omics requires not just bigger models, but a fanatical commitment to FAIR data practices and rigorous community standards.
The Path Forward: Toward Agentic Systems
We now have a map of the omics territory. We see that genomics and transcriptomics are highly mature for sequence-based AI; proteomics has been revolutionized by AI structure prediction; phenomics is being transformed by computer vision; while metabolomics and metagenomics present unique, ongoing challenges.
Currently, AI in biology largely operates in silos. An engineer trains a model to predict protein structure, or a different model to annotate cell types.
The ultimate promise of “Agentic Omics”—the driving vision of this series—is moving beyond these isolated tasks. It is about building autonomous AI agents powered by Large Language Models that can reason over biological problems, and dynamically call upon these various domain-specific omics models (AlphaFold, scGPT, DNABERT) as tools.
Imagine an agent that observes a genomic variant, queries an AI tool to predict its structural impact on a protein, uses another tool to assess how that structural change alters the cell’s transcriptomic profile, and finally designs a candidate molecule to rectify the defect.
That is the future we are building toward. In our next post, “Foundation Models Meet Biology: The Transformer Revolution in Life Sciences,” we will dive into the technical architecture that makes this possible, exploring how the very same algorithms that power ChatGPT are learning to speak the language of life.
Glossary
- Omics: A field of study in biology ending in “-omics”, such as genomics, proteomics, or metabolomics, encompassing the comprehensive analysis of a specific layer of biological molecules.
- Central Dogma: The framework describing the flow of genetic information: DNA is transcribed into RNA, which is translated into proteins.
- Foundation Model: A large-scale AI model trained on a vast quantity of unlabelled data (usually via self-supervised learning) that can be adapted (fine-tuned) to a wide range of downstream tasks.
- Transcriptome: The complete set of RNA transcripts produced by the genome under specific circumstances.
- Proteome: The entire complement of proteins that is or can be expressed by a cell, tissue, or organism.
- Metabolome: The complete set of small-molecule chemicals found within a biological sample.
- FAIR Principles: Guidelines ensuring data is Findable, Accessible, Interoperable, and Reusable.
- BAM/VCF/FASTQ: Standardized file formats for storing genomic sequence data and identified variants.
References
- IntuitionLabs. (2025). AI Compute Demand in Biotech: 2025 Report & Statistics. https://intuitionlabs.ai/articles/ai-compute-demand-biotech
- Lin, Z., et al. (2023). “Evolutionary-scale prediction of atomic-level protein structure with a language model.” Science, 379(6637), 1123-1130.
- Abramson, J., et al. (2024). “Accurate structure prediction of biomolecular interactions with AlphaFold 3.” Nature, 630, 493–500.
- Zhou, Z., et al. (2024). “DNABERT-2: Efficient Foundation Model and Benchmark for Genomic Data Analysis.” ICLR 2024.
- Cui, H., et al. (2024). “scGPT: toward building a foundation model for single-cell multi-omics using generative AI.” Nature Methods.
- University of Washington Medicine Newsroom. (2026). NIH All of Us Research Program releases genomic dataset. https://newsroom.uw.edu/news-releases/nih-all-us-research-program-releases-genomic-dataset
- ScienceDirect. (2025). “How FAIR is metadata for human pluripotent stem cells?”. https://www.sciencedirect.com/science/article/pii/S2213671125002486
- Springer Nature. (2025). “Bioinformatics and artificial intelligence in genomic data analysis: current advances and future directions.” Molecular Genetics and Genomics.