AI for Genomics: Reading the Book of Life with Transformers

The genome is the ultimate source code. For decades, computational biologists have relied on alignment algorithms, hidden Markov models, and specialized machine learning to decode it. Today, a new paradigm is taking hold: DNA foundation models. By treating the genome as a vast, continuous text and training large language models (LLMs) on billions of nucleotides, researchers are teaching AI to “read” the book of life in its native language.

In this fifth installment of our Agentic Omics series, we examine the state of the art in genomic AI. We explore how models like DNABERT-2, Nucleotide Transformer, Evo, and HyenaDNA are moving beyond sequence classification to predict gene expression, identify regulatory elements, and quantify variant effects. Crucially, we will dissect the architectural innovations that make this possible—and the biological complexities that still confound these models.

The Shift to Language Models in Genomics

Genomic sequences present unique challenges for traditional machine learning. They are incredibly long (the human genome is 3 billion base pairs), highly repetitive, and their “syntax” is poorly understood compared to human language. A mutation in an enhancer region might affect a gene millions of base pairs away through 3D chromosomal looping.

To capture these long-range dependencies and intricate motifs, researchers have turned to transformer architectures. However, applying a standard text transformer directly to DNA immediately runs into two major hurdles: tokenization and context length.

The Tokenization Problem: From k-mers to BPE

In natural language processing (NLP), text is split into subwords or words. In genomics, early models like the original DNABERT used k-mer tokenization, breaking the sequence into overlapping chunks of k nucleotides (e.g., 6-mers like ATCGTA). While intuitive, this approach is computationally expensive and sample-inefficient.

DNABERT-2 (Zhou et al., ICLR 2024) revolutionized this by introducing Byte Pair Encoding (BPE) to DNA. Just as BPE learns common character combinations in text, DNABERT-2 learned common genomic “subwords” across multiple species. This dramatically increased efficiency and allowed the model to process sequences up to 2.5 times faster while achieving state-of-the-art results on the Genome Understanding Evaluation (GUE) benchmark. It demonstrated superior performance across diverse tasks, including epigenetic marks prediction, transcription factor binding prediction, and promoter detection.

The Context Length Challenge: Seeing the Whole Picture

Transformers scale quadratically with sequence length; doubling the context window quadruples the memory requirement. For DNA, where long-range regulatory interactions are essential, standard attention mechanisms max out far too early.

HyenaDNA (Nguyen et al., 2023) addressed this by replacing the standard attention mechanism with a sub-quadratic operator based on the Hyena hierarchy. This allowed the model to handle unprecedented context lengths—up to 1 million tokens—enabling it to “see” entire genes and their distant regulatory enhancers simultaneously. This capability is vital for tasks like predicting long-range enhancer-promoter interactions, which have historically been a blind spot for genomic ML.

The Heavyweights: Evaluating the Models

The landscape of genomic foundation models has matured rapidly. Let’s examine the leading architectures and their distinct capabilities.

1. Nucleotide Transformer: Scale and Diversity

Developed by InstaDeep and collaborators (Dalla-Torre et al., 2023), the Nucleotide Transformer models range from 500 million to 2.5 billion parameters. Their key innovation was the scale and diversity of their pre-training data, encompassing thousands of genomes across diverse species, rather than just human sequences.

Recent evaluations of the Nucleotide Transformer (including V2) demonstrate robust zero-shot capabilities in variant effect quantification and sequence classification. By learning the evolutionary conservation of sequences across the tree of life, the model can infer whether a novel mutation is likely to be deleterious without ever having seen it in a clinical dataset.

2. Evo: The Cross-Domain Generalist

Perhaps the most ambitious model to date is Evo (Nguyen et al., Science 2024), developed by the Arc Institute. With 7 billion parameters, Evo was trained on 300 billion nucleotides spanning all domains of life—bacteria, archaea, eukaryotes, and their viruses.

Unlike previous models that focused purely on DNA, Evo is a true biological generalist. It can predict the effects of mutations on protein function, design synthetic CRISPR-Cas systems, and even generate entirely novel sequences across DNA, RNA, and protein modalities. Its ability to perform zero-shot function prediction across the central dogma makes it a foundational component for future agentic workflows.

3. Variant Effect Prediction: The Clinical Frontier

For clinical genomics, the holy grail is accurately predicting the functional impact of any given variant (Variant Effect Prediction, or VEP). Historically, tools like CADD (Combined Annotation Dependent Depletion) relied on handcrafted features and conservation scores.

Today’s foundation models are absorbing this task. By leveraging their deep representations of genomic syntax, models can quantify variant effects directly from sequence. A recent Nature Communications study (2025) benchmarked models like DNABERT-2 and Nucleotide Transformer V2, showing that zero-shot embeddings from these models can effectively cluster variants by functional impact and predict disease associations. However, these models still struggle with complex structural variants (insertions, deletions, translocations), which disrupt the genomic architecture in ways that point-mutations do not.

The Missing Pieces: Where Models Still Fall Short

Despite these breakthroughs, DNA foundation models are not yet a solved problem. We must be honest about their current limitations.

Structural Variants (SVs): While models excel at interpreting single nucleotide polymorphisms (SNPs), they are remarkably brittle when faced with large-scale structural changes. A 10kb inversion or a chromosomal translocation breaks the learned spatial relationships that the transformer relies on.
True Long-Range Regulatory Interactions: Even with 1-million-token context windows (like HyenaDNA), predicting the functional interaction between an enhancer and a promoter separated by 2 million base pairs remains challenging. The 3D folding of the genome (chromatin conformation) is crucial here, and purely linear sequence models struggle to infer it without multi-modal input (e.g., Hi-C data).
The “Language” Analogy Breaking Down: BPE tokenization (as used in DNABERT-2 and GenomeOcean) is highly efficient, but it does not inherently respect biological boundaries. It may split a crucial regulatory motif or a codon in ways that hinder the model’s ability to learn functionally coherent sequences, particularly for generative tasks.

Toward Agentic Genomics

The true potential of these models will be realized when they are integrated into Agentic Omics frameworks. A DNA foundation model alone is a powerful calculator; an agentic system is a scientist.

Imagine an autonomous agent investigating a patient’s whole-genome sequencing data. It detects an unknown variant in a non-coding region. It queries Evo to predict the variant’s effect on RNA splicing, uses DNABERT-2 to assess its impact on local transcription factor binding, and orchestrates a multi-modal analysis involving protein structure prediction (if a coding region is affected). It then synthesizes these findings into a clinical hypothesis, complete with literature citations.

This is not science fiction. As we will explore in Part III of this series, the software infrastructure to build these biological tool-use agents is already being deployed.

Glossary

BPE (Byte Pair Encoding): A data compression technique adapted for NLP and genomics to find the most common sequences of characters (or nucleotides) and merge them into single tokens.
k-mer: A sequence of k adjacent nucleotides in a DNA strand.
Zero-shot evaluation: Testing a model on a task it was not explicitly trained to perform, relying purely on its general pre-trained knowledge.
Structural Variant (SV): A large-scale alteration in a chromosome, such as an insertion, deletion, inversion, or duplication, typically involving hundreds or thousands of base pairs.
GUE (Genome Understanding Evaluation): A comprehensive benchmark suite for evaluating the performance of genomic foundation models across various tasks like promoter detection and transcription factor binding prediction.

References

Zhou, Z., et al. (2024). DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome. ICLR 2024.
Nguyen, E., et al. (2024). Sequence modeling and design from molecular to genome scale with Evo. Science, 386(6722).
Dalla-Torre, H., et al. (2023). The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv.
Nguyen, E., et al. (2023). HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. Advances in Neural Information Processing Systems (NeurIPS).
Recent Benchmark Study (2025). Benchmarking DNA foundation models for genomic and genetic tasks. Nature Communications.
Review on Regulatory Genomics (2024). Perspective on recent developments and challenges in regulatory and systems genomics. Bioinformatics Advances.

The Shift to Language Models in Genomics#

The Tokenization Problem: From k-mers to BPE#

The Context Length Challenge: Seeing the Whole Picture#

The Heavyweights: Evaluating the Models#

1. Nucleotide Transformer: Scale and Diversity#

2. Evo: The Cross-Domain Generalist#

3. Variant Effect Prediction: The Clinical Frontier#

The Missing Pieces: Where Models Still Fall Short#

Toward Agentic Genomics#

Glossary#

References#