In the first post of this series, we mapped the omics landscape: genomics, transcriptomics, proteomics, metabolomics, metagenomics, phenomics. The next question is obvious: why did AI suddenly get so good at several of these fields at once?
The short answer is that biology turned out to be unusually compatible with the same family of models that transformed natural language processing. DNA, RNA, proteins, and even single-cell expression matrices are not “language” in any literal sense, but they are structured symbol systems with long-range dependencies, rich context, and vast quantities of unlabeled data. That is exactly the setting where self-supervised foundation models thrive.
This post explains why transformer-era AI translated so effectively into life sciences, what had to change for biological data, and where the analogy to language starts to fail. The key point is not that biology is secretly English with four letters. It is that modern biology produces massive, partially structured, partially observed information streams — and transformer-style models are unusually good at learning compressed, reusable representations from such streams.
Why biology looked attractive to foundation-model researchers
Three things made biology ripe for foundation models.
First, biology has enormous unlabeled corpora. Public databases contain hundreds of millions of protein sequences, millions of genomes, and now tens to hundreds of millions of single-cell profiles. Unlike many clinical ML tasks, these raw sequences and matrices exist even when expert labels are sparse. That matters because self-supervised learning works best when there is abundant raw data but expensive downstream annotation. A 2025 National Science Review review argues this is one reason foundation models gained traction so quickly in bioinformatics: they can exploit large-scale unlabeled repositories that classical supervised pipelines leave mostly untapped [1].
Second, biological systems exhibit long-range dependence. A nucleotide hundreds of bases away can affect transcription-factor binding. Amino acids distant in sequence can be adjacent in folded 3D structure. In single-cell data, the importance of one gene depends on the coordinated state of many others. Transformer architectures were built precisely to model context-dependent interactions across sequences or token sets.
Third, biological data increasingly arrives in standardized digital formats that are reusable across tasks: FASTA for sequences, PDB/mmCIF for structures, AnnData for single-cell matrices, and so on. That standardization is not glamorous, but it makes pretraining possible.
The result is that biology now has what language modeling had a few years earlier: huge corpora, portable representations, and a backlog of downstream tasks suffering from limited labels.
The central analogy: DNA as code, proteins as sentences, cells as documents
The seductive story is that biology is a language problem.
There is some truth to that. DNA is a sequence over a four-character alphabet. Proteins are sequences over twenty amino acids. Regulatory motifs recur in patterns. Mutations change meaning depending on context. Single-cell models often treat a cell like a “document” and genes like “words,” then learn embeddings that capture latent state.
That framing has been productive because it suggests concrete technical moves:
- pretrain on huge unlabeled corpora
- learn contextual embeddings
- transfer those embeddings to many downstream tasks
- use masked-token or next-token objectives instead of full manual annotation
This recipe works. DNABERT-2, Nucleotide Transformer, Evo, ESM-family models, scGPT, and related systems are all downstream of that insight [2-6].
But the analogy is only useful if we do not oversell it.
Natural language is optimized for communication between agents. Biological sequences are not. They are products of evolution, constraint, accident, redundancy, and physics. A genome is not “trying” to be readable. Protein function is not determined by syntax alone but by folding, environment, cofactors, and cellular context. Single-cell matrices are not true sequences at all; gene ordering is mostly imposed by the analyst, not the biology.
So the right claim is narrower: biology is often learnable with language-model machinery, but it is not reducible to language. That distinction matters because it explains both the breakthroughs and the persistent blind spots.
What had to change: tokenization is not a side detail in biology
Tokenization was one of the earliest places where naïve NLP transfer ran into biological reality.
DNA tokenization: from k-mers to subword-like compression
Early DNA language models often used k-mers, fixed-length chunks such as 3-mers or 6-mers, because genomes have no spaces or obvious word boundaries. K-mers are simple, but they are also inefficient. Overlapping k-mers duplicate information, enlarge sequence length, and impose arbitrary boundaries that may not align with biological motifs.
DNABERT-2 made a strong case for replacing fixed k-mers with byte-pair encoding (BPE)-style tokenization adapted to genomic data [2]. The model also paired this with architectural changes such as ALiBi positional bias to improve efficiency and length generalization. On the GUE benchmark, DNABERT-2 achieved performance comparable to state-of-the-art genomic models while being substantially more efficient [2]. The point is broader than one paper: in biology, tokenization is not preprocessing trivia. It determines what patterns the model can express and how much sequence context it can afford to read.
Protein tokenization: simpler alphabet, harder semantics
Proteins are easier to tokenize superficially because amino acids already form a discrete alphabet of size 20. But the semantics are harder. In language, nearby words often capture local meaning. In proteins, residues far apart in sequence may define the binding pocket or structural core together. Protein language models therefore benefited quickly from transformer-style contextualization, but success depended on scale and the ability to recover information about structure and function from sequence statistics.
Single-cell tokenization: not truly sequential data
Single-cell models face a different problem: expression matrices are not naturally ordered sentences. scGPT and related models convert gene-expression profiles into token-value representations, often ranking genes or combining gene identity with expression information [4]. This is clever, but it also means the model’s “language” is partially constructed by design choices. The impressive performance of single-cell foundation models should therefore be interpreted as proof that transformers can learn reusable cellular representations — not proof that cells are literally language-like.
Attention, context length, and why long-range biology is still hard
One reason transformers matter in biology is their ability to model interactions across context. But biology immediately exposed a core limitation: many biologically important contexts are much longer than standard NLP windows.
Consider the scales involved:
- enhancers can regulate genes over long genomic distances
- chromatin organization creates 3D contacts not obvious in 1D sequence
- proteins can exceed the context lengths used by smaller language models
- whole genomes are vastly longer than the windows most transformers can process directly
This is why biological foundation models increasingly diverged from vanilla transformer design.
DNABERT-2 improved positional handling and efficiency [2]. Nucleotide Transformer scaled transformer pretraining across large genomic corpora and showed strong transfer across a benchmark of 18 genomics tasks [3]. Evo went further by using the StripedHyena architecture rather than standard dense attention, specifically to support long-context genomic modeling at single-nucleotide resolution [5]. The Science paper is notable not only because of the 7B-parameter model size, but because it demonstrated that architecture choices matter as much as parameter count when the task is “read biology at genome scale” rather than “finish a paragraph” [5].
This is a useful corrective to hype. In life sciences, there is no single winner architecture. Transformers dominate the conversation, but hybrid and alternative long-context models may outperform them in some biological regimes.
How biological foundation models are trained
Most biological foundation models use one of three self-supervised strategies.
1. Masked modeling
This is the BERT-style recipe: hide part of the input and ask the model to recover it from context.
It works well when local context is informative and when the goal is to learn representations rather than open-ended generation. DNABERT-style models and many protein language models benefited from masked objectives [2, 3]. scGPT also borrows from this general idea while adapting it to single-cell settings [4].
2. Autoregressive next-token prediction
This is the GPT-style recipe: predict the next token.
Autoregressive training is especially useful when generation matters. Evo is the clearest example in this post. Its 2024 Science paper showed that a long-context generative genomic model could perform competitive zero-shot function prediction across proteins, ncRNAs, and regulatory DNA while also generating functional CRISPR-Cas and transposon systems [5]. That is a strong result because it links representation learning and design in one framework.
3. Multimodal masked generation and alignment
The frontier is increasingly multimodal rather than sequence-only. Protein models now attempt to reason jointly over sequence, structure, and function. ESM3, released in 2024, is an important example: a multimodal generative model over protein sequence, structure, and function tracks, enabling promptable generation and editing workflows [6]. Although the field is still sorting out which of these systems are most reproducible and broadly accessible, the direction is clear. Biology is moving from single-modality pretraining toward foundation models that align several views of the same molecule or cell.
A 2025 Nature Methods perspective on multimodal foundation transformers for multiscale genomics makes a similar argument at the genomics level: future systems will have to integrate sequence, regulation, chromatin context, and higher-order biological information rather than treat raw sequence as sufficient on its own [7].
The major model families in 2026
By 2026, it is more useful to think in families than in single flagship models.
Genomic foundation models
The genomic family includes DNABERT-2, Nucleotide Transformer, HyenaDNA-like long-context systems, and Evo-class models [2, 3, 5, 8].
Their major use cases include:
- promoter and enhancer prediction
- splice-site and regulatory-element prediction
- variant effect prioritization
- gene-expression prediction from sequence
- representation learning for low-label genomic tasks
The 2025 Nature Communications benchmark comparing five DNA foundation models is especially important because it cuts through marketing [8]. Across sequence classification, expression prediction, variant effect quantification, and TAD-related tasks, performance varied substantially by task. No model dominated everything. That is a healthy sign for the field: we are past the stage where one leaderboard screenshot should settle the matter.
Protein foundation models
Protein AI remains the most mature branch of biological foundation modeling. Earlier systems such as ESM-2 and ProtTrans established that large unsupervised sequence models can recover structure and function signals. The 2024-2025 frontier is more ambitious: models like ESM3 aim to operate jointly over sequence, structure, and functional descriptions, blurring the line between representation model and design engine [6].
This matters because proteins are where biological language models showed their deepest empirical payoff first. Sequence-only learning unexpectedly captured structural information. That discovery helped convince the field that self-supervised pretraining on biology was not merely a metaphorical exercise.
Single-cell foundation models
Single-cell models are the fastest-moving category conceptually. scGPT, published in Nature Methods in 2024, pretrained on more than 33 million cells and showed transfer across cell-type annotation, batch integration, multi-omic integration, perturbation prediction, and gene-network inference [4]. Since then, the literature has exploded.
A 2025 review in Experimental & Molecular Medicine describes both the promise and the caveats of this space [9]. The promise is obvious: single-cell biology now has sufficiently large corpora to support reusable pretrained representations. The caveats are equally important: data quality is inconsistent, rare cell states are hard, interpretability remains weak, and the “sequence” formalism is partly artificial [9].
That frankness is welcome. Single-cell foundation models are real progress, but they are not yet a solved abstraction.
Where foundation models are already useful
The case for biological foundation models does not rest on chat demos. It rests on transfer.
These models are valuable when pretrained representations reduce the amount of labeled data needed for downstream biology. Nucleotide Transformer showed that large DNA models can be fine-tuned cheaply across multiple genomics tasks and can attend to meaningful genomic elements [3]. DNABERT-2 improved efficiency without giving up benchmark competitiveness [2]. scGPT demonstrated one pretrained model supporting several biologically distinct downstream tasks [4]. Evo showed zero-shot prediction and sequence design in one system [5].
That is the genuine breakthrough: one model family, many downstream assays.
This does not mean specialists disappear. It means the baseline changes. Instead of training a fresh model from scratch for every assay, lab, or cohort, researchers increasingly start from a pretrained biological prior.
Where the analogy breaks down
This is the part that matters most if you want to stay honest.
Biology is physical, not just statistical
A sequence model sees symbols. Biology is executed in space, time, and chemistry. A protein’s function depends on folding, disorder, complexes, ligands, localization, and cellular environment. A genome’s effect depends on chromatin state, developmental timing, and 3D contact structure. Sequence models can infer some of this indirectly, but not all of it.
The same sequence can mean different things in different contexts
Cell state, species, assay protocol, and tissue all change interpretation. This is especially acute in transcriptomics, where identical gene programs can play different roles in different microenvironments. Foundation models help, but they do not erase context dependence.
Benchmarks can overstate generalization
Biology has a long history of evaluation leakage: homologous proteins in train and test, near-duplicate sequences across splits, tissue or batch confounding, easy negatives, and proxy labels that do not reflect biological utility. The 2025 DNA-foundation benchmark is helpful partly because it tries to evaluate across diverse tasks rather than one cherry-picked setting [8]. But the broader warning stands: biological AI is easy to over-claim if evaluation is weak.
Mechanism is not the same as prediction
A model can predict variant effect or cell identity without discovering mechanism. In practice, many biological foundation models are best seen as powerful priors and ranking engines, not as mechanistic theories of life. This is still useful. It is just different from understanding.
The emerging frontier: from foundation models to agentic biology
Why does this matter for the rest of this series? Because foundation models are the substrate on which agentic omics will be built.
An agentic system does not need one universal biology model that knows everything. It needs a strong portfolio of pretrained domain models that can be called, compared, and composed. A future biological agent might:
- use a genomic model to score regulatory variants
- call a protein model to estimate structural consequences
- use a single-cell model to infer disease-state shifts
- consult literature and pathway databases
- synthesize the outputs into a hypothesis and experimental plan
That workflow only becomes plausible once reusable foundation models exist in each modality. In other words, the transformer revolution in life sciences is not the endpoint. It is the tooling layer that makes agentic omics thinkable.
The sober view
The field has earned excitement, but not surrender.
Biological foundation models are already changing representation learning, variant prioritization, protein modeling, and single-cell analysis. Yet the strongest systems still fail on causality, robustness, rare biology, and multi-scale context. The models are real; the hype is also real. The job in 2026 is to separate the two.
My view is that the most durable contribution of this wave will not be any one model name. It will be the normalization of a new workflow: pretrain once at scale, adapt many times, and treat biology as a space where useful latent representations can be learned before labels are available.
That is a profound shift from classical bioinformatics. It is also why the next bottlenecks are less about whether transformers “work” and more about data curation, benchmarking, multimodal integration, and tool orchestration.
In the next post, we move from models to plumbing: how raw biological data becomes AI-ready in the first place.
Glossary
ALiBi — Attention with Linear Biases, a positional-bias method that helps sequence models extrapolate to longer contexts.
Autoregressive model — A model trained to predict the next token in a sequence.
BPE (Byte Pair Encoding) — A tokenization strategy that merges frequent symbol pairs into variable-length tokens; adapted from NLP to genomic sequences.
Embedding — A learned numerical representation of a token, sequence, gene, protein, or cell in a latent space.
Foundation model — A large pretrained model trained on broad data and adapted to many downstream tasks.
GUE benchmark — Genome Understanding Evaluation benchmark used to assess genomic language models across multiple tasks.
k-mer — A substring of length k extracted from a biological sequence.
Masked modeling — Self-supervised training in which some input tokens are hidden and the model learns to recover them.
Multimodal model — A model trained on more than one data type, such as sequence plus structure, or transcriptomics plus spatial information.
Self-supervised learning — Training that derives learning targets from the raw data itself instead of requiring manual labels.
References
- Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C, Wang J. Foundation models in bioinformatics. National Science Review. 2025;12(4):nwaf028. doi:10.1093/nsr/nwaf028.
- Zhou Z, Ji Y, Li W, et al. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes. ICLR 2024. OpenReview: oMLQB4EZE1.
- Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nature Methods. 2025;22(2):287-297. doi:10.1038/s41592-024-02523-z.
- Cui H, Wang C, Maan H, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods. 2024;21(8):1470-1480. doi:10.1038/s41592-024-02201-0.
- Nguyen E, Poli M, Durrant MG, et al. Sequence modeling and design from molecular to genome scale with Evo. Science. 2024;386(6723):eado9336. doi:10.1126/science.ado9336.
- Hayes T, Rao R, Akin H, et al. Simulating 500 million years of evolution with a language model. bioRxiv. 2024. doi:10.1101/2024.07.01.600583.
- Multimodal foundation transformer models for multiscale genomics. Nature Methods. 2025. doi:10.1038/s41592-025-02918-6.
- Benchmarking DNA foundation models for genomic and genetic tasks. Nature Communications. 2025. doi:10.1038/s41467-025-65823-8.
- Single-cell foundation models: bringing artificial intelligence into cell biology. Experimental & Molecular Medicine. 2025. doi:10.1038/s12276-025-01547-5.