Welcome back to Agentic Omics: When AI Reads the Book of Life. In our first post, we mapped the complex, multi-layered territory of modern biological data. We saw that while fields like metabolomics are still wrangling with extreme chemical complexity, disciplines defined by sequences—genomics, transcriptomics, and proteomics—are experiencing a massive influx of AI-ready data.
But data alone isn’t enough. The true catalyst of the current biological AI revolution is a specific architectural breakthrough originally designed to translate English to French: the Transformer.
This post delves into the foundational models now dominating computational biology. We’ll explore why biological sequences are remarkably amenable to language model approaches, detail the key architectural adaptations required to make Transformers work on molecules, and provide a landscape of the biological foundation models currently reshaping the field.
Biology as a “Language” Problem
To understand why Transformers work so well in biology, we have to look at the central dogma of molecular biology through an algorithmic lens.
The core information systems of life are profoundly linear. DNA is a sequence of four nucleotides (A, C, T, G). RNA is transcribed as a similar four-letter sequence. Proteins are translated as linear strings of 20 distinct amino acids. In essence, DNA is code, proteins are sentences, and cells are complex documents expressing an intricate state of being.
For decades, bioinformaticians treated these sequences primarily as strings to be aligned (e.g., using BLAST) or searched for specific handcrafted motifs (like transcription factor binding sites).
Natural Language Processing (NLP) underwent a similar evolution. Early NLP relied on handcrafted rules, rigid grammars, and bag-of-words models that ignored the context of a sentence. Then came the Transformer architecture (introduced in the landmark 2017 paper “Attention Is All You Need”), followed by models like BERT and GPT. Transformers revolutionized NLP because their attention mechanism allows them to weigh the contextual importance of every word in a sequence relative to every other word, learning deep, hidden semantic relationships without human supervision.
Biological sequences possess a similar deep, hidden syntax. A mutation at position 100 on a gene might completely alter the function of a regulatory element at position 5,000 due to the 3D folding of the chromatin. Traditional algorithms struggle with these long-range dependencies. Transformers, however, are explicitly designed to capture them.
Adapting Transformers for the Book of Life
While the analogy holds conceptually, dropping a standard LLM onto a genomic sequence rarely works well out of the box. Biological “language” has distinct properties that necessitate critical architectural adaptations.
1. The Tokenization Challenge: From k-mers to BPE
In NLP, text is broken down into words or sub-words called “tokens.” How do you tokenize DNA, which has no spaces, punctuation, or obvious “words”?
Early genomic language models relied on k-mer tokenization, where the sequence is broken into overlapping chunks of length k (e.g., 3-mers or 6-mers). While intuitive, k-mers have severe limitations. They redundantly process overlapping information, drastically increasing computational overhead, and fail to capture biologically meaningful functional units of varying lengths.
A major breakthrough—demonstrated vividly by models like DNABERT-2 (Zhou et al., ICLR 2024)—was the adaptation of Byte Pair Encoding (BPE) for genomes. BPE is a data compression technique (widely used in models like GPT-4) that iteratively calculates the frequency of adjacent characters and merges the most frequent pairs. In genomics, BPE constructs a vocabulary of the most frequently seen sub-sequences.
As highlighted in recent 2025 benchmarking studies on tokenizer selection, BPE achieves a 4–5x compression ratio on DNA, drastically reducing the sequence length and allowing the model’s attention mechanism to process much larger genomic regions simultaneously.
2. Attention and Long Sequences
Biological sequences are inherently long. The human genome is 3 billion base pairs. Even single genes can span millions of base pairs. Standard transformer attention mechanisms scale quadratically with sequence length—meaning a sequence twice as long requires four times the compute. This creates a hard ceiling on context windows.
To overcome this, biological foundation models employ specialized attention mechanisms. DNABERT-2 introduced Attention with Linear Biases (ALiBi) to replace standard positional embeddings, allowing the model to extrapolate beyond its trained sequence length. Other models, like HyenaDNA, discard standard attention entirely in favor of implicitly parameterized long convolutions, achieving sub-quadratic scaling and context windows of up to 1 million base pairs.
3. Pre-training Strategies: Teaching the Model to Read
How do you train a biological language model when there are no “labels” for most of the sequence? You use self-supervised learning, leveraging the massive databases of raw sequences (like GenBank or UniProt) that have been accumulating for decades.
- Masked Language Modeling (MLM): The model is fed a sequence with random tokens hidden (masked) and tasked with predicting them based on the surrounding context. This forces the model to learn the intrinsic rules of biological grammar. Models like ESM (Meta’s protein language model) and DNABERT utilize MLM.
- Next-Token Prediction: Similar to GPT, the model learns to predict the next nucleotide or amino acid in a sequence. This is the foundation of generative models like ProGen, which can design novel, functional proteins that do not exist in nature.
- Contrastive Learning: Modern multi-modal models often use contrastive learning to align different types of biological data. For example, recent models align a protein’s sequence representation with its structural representation or its functional text description, teaching the model that a specific sequence “means” a specific biological function.
The Landscape of Biological Foundation Models
The rapid convergence of immense biological datasets and scaled transformer architectures has given rise to a rich ecosystem of foundation models. As detailed in the 2024 survey “Progress and opportunities of foundation models in bioinformatics” (Briefings in Bioinformatics), we can broadly categorize these into three domains:
- Protein Language Models (pLMs): The most mature category. Meta’s Evolutionary Scale Modeling (ESM) family, particularly the recent ESM-3, represents a pinnacle of this approach. Trained on billions of protein sequences, these models can predict protein structure, identify binding sites, and assess the impact of mutations with startling accuracy, often matching or complementing structural models like AlphaFold.
- Genomic Language Models (gLMs): Models like DNABERT-2, the Nucleotide Transformer, and the massively ambitious Evo (a 7B-parameter model by the Arc Institute trained on all domains of life) are learning to read the genome. They can predict promoter regions, splice sites, and transcription factor binding affinities directly from unannotated DNA.
- Single-Cell Foundation Models: Perhaps the most exciting frontier is transcriptomics. Models like scGPT and Geneformer are trained on tens of millions of individual cell expression profiles. Instead of sequence tokens, their “tokens” are genes, and their “sentences” are the expression levels of those genes within a single cell. These models can predict how a cell will react to a drug perturbation without ever running a wet-lab experiment.
Limitations: Where the Analogy Breaks Down
While the progress is staggering, we must remain honest about the limitations of treating biology purely as a language.
Biological sequences operate in a physical, 3-dimensional reality governed by thermodynamics, a factor text models completely lack. Two distant DNA sequences might be “adjacent” in 3D space due to chromatin looping—a relationship a purely linear 1D language model will struggle to infer unless specifically provided with epigenetic data.
Furthermore, unlike human language, which was designed by humans for human communication, biological “language” is the emergent result of billions of years of blind, unguided evolution. It is noisy, highly redundant, and filled with evolutionary “junk” that serves structural rather than informational purposes.
Finally, the evaluation gap remains a critical issue. An LLM that hallucinates a historical fact is easily corrected. A biological model that confidently hallucinates an incorrect protein-protein interaction might lead a pharmaceutical company down a multi-million-dollar dead end.
The Road Ahead
Foundation models have unequivocally proven that deep learning can capture the hidden syntax of biological sequences. We are no longer relying on handcrafted statistical tools; we are letting neural networks learn the rules of life directly from the data.
In our next post, “The Data Infrastructure Challenge: From Raw Reads to AI-Ready Datasets,” we will step back from the algorithms and look at the massive engineering pipelines required to actually feed these data-hungry models.
Glossary
- Transformer: A deep learning architecture relying entirely on an attention mechanism to draw global dependencies between input and output.
- Tokenization: The process of breaking down a sequence (text, DNA, proteins) into smaller units (tokens) that a machine learning model can process.
- k-mer: A sub-sequence of length k generated by sliding a window across a longer sequence (e.g., “ATCG” is a 4-mer).
- Byte Pair Encoding (BPE): A data compression algorithm adapted for tokenization that merges the most frequently occurring adjacent pairs of characters or bytes.
- Masked Language Modeling (MLM): A self-supervised training technique where parts of an input sequence are hidden, and the model must predict them using context.
- Contrastive Learning: A machine learning technique that teaches a model to distinguish between similar and dissimilar data points, often used to align multimodal data (like text and sequences).
References
- Li, Q., et al. (2024). “Progress and opportunities of foundation models in bioinformatics.” Briefings in Bioinformatics, 25(6).
- Zhou, Z., et al. (2024). “DNABERT-2: Efficient Foundation Model and Benchmark for Genomic Data Analysis.” ICLR 2024.
- National Science Review. (2025). “Foundation models in bioinformatics.”
- Cui, H., et al. (2024). “scGPT: toward building a foundation model for single-cell multi-omics using generative AI.” Nature Methods.
- Bioinformatics. (2025). “The Impact of Tokenizer Selection in Genomic Language Models.” Oxford Academic.
- Nguyen, E., et al. (2024). “Evo: DNA foundation modeling from molecular to genome scale.” Science.