Building Biological Tool-Use Agents: Architecture and Patterns
The vision of agentic omics — autonomous AI systems that orchestrate biological discovery — depends on a deceptively simple capability: tool use. An agent that can reason about biology but cannot access BLAST, AlphaFold, or single-cell analysis pipelines is like a biologist who understands theory but has never touched a pipette.
This post provides a practical architecture for building biological tool-use agents. We cover the essential tool inventory, the unique error-handling challenges of biological data, prompt engineering patterns for biological reasoning, and a reference architecture based on the ReAct (Reason + Act) loop. This is the “how-to” companion to Post 13’s conceptual overview and Post 14’s vision of agentic omics.
The Core Architecture: ReAct Meets Biology
The ReAct pattern — interleaving reasoning traces with actionable tool calls — has emerged as the dominant architecture for agentic systems. For biological applications, the pattern looks like this:
Observation: User asks "What's the structural impact of the BRCA1 R1699Q variant?"
Thought: I need to (1) get the protein sequence, (2) predict wild-type structure,
(3) predict mutant structure, (4) compare them, (5) check literature for known effects
Action: Query UniProt for BRCA1 sequence
Observation: Retrieved sequence: MSK... (1863 amino acids)
Thought: Now I need to predict structure. AlphaFold 3 is best for this, but I should
also run ESM-3 for comparison since it doesn't require MSA.
Action: Call AlphaFold3Server with sequence + R1699Q mutation
Observation: Structure predicted. pLDDT = 92 for wild-type, 87 for mutant.
R1699Q shows local destabilization in BRCT domain.
Thought: The pLDDT drop suggests structural impact. Let me check if this variant
is in ClinVar and what's known clinically.
Action: Query ClinVar API for BRCA1 R1699Q
Observation: ClinVar: Pathogenic (5 submissions), associated with breast cancer risk.
Thought: I have enough information. The variant causes local structural destabilization
in the BRCT domain and is clinically classified as pathogenic.
Final Answer: The BRCA1 R1699Q variant...
This loop — Thought → Action → Observation → Thought — is the engine of biological agency. But biological tool use has unique challenges that generic ReAct implementations don’t address.
The Biological Tool Inventory
A competent biological agent needs access to tools spanning sequence analysis, structure prediction, expression analysis, pathway knowledge, and literature. Here’s the essential inventory, organized by function:
Sequence Analysis Tools
| Tool | Purpose | API Access | Key Considerations |
|---|---|---|---|
| BLAST/BLAST+ | Homology search, sequence similarity | NCBI BLAST API, local installation | Slow for large queries; consider DIAMOND for protein searches (100-1000× faster) |
| HMMER | Profile HMM searches, domain detection | Local installation, EMBL-EBI API | Essential for remote homology detection; pfam-A HMMs are gold standard |
| Clustal Omega / MAFFT | Multiple sequence alignment | EMBL-EBI API, local | MSA quality critical for phylogenetics and conservation analysis |
| DNABERT-2 | DNA sequence embeddings, variant effect | Hugging Face, local inference | 2024 ICLR paper shows SOTA on Genome Understanding Evaluation benchmark |
| Evo | 7B-parameter model for DNA/RNA/protein | Arc Institute (research access) | Trained on 300B nucleotides; can generate functional sequences across domains of life |
Practical note: For production agents, cache BLAST results and MSA computations. Homology searches are expensive and frequently repeated.
Structure Prediction and Analysis
| Tool | Purpose | API Access | Key Considerations |
|---|---|---|---|
| AlphaFold 3 | Protein structure, complexes with DNA/RNA/ligands | Google DeepMind server (free tier), local (open-source) | 2024 Nature paper; best for complexes; ligand docking accuracy debated |
| ESM-3 | Protein structure + function from sequence | EvolutionaryScale API, local (research license) | 15B parameters; no MSA required; faster than AlphaFold but slightly less accurate |
| ESMFold | Fast structure prediction | Hugging Face, local | Uses ESM-2 language model; good for quick screening |
| RFdiffusion | De novo protein design | Baker Lab (local installation) | Generative diffusion model; requires significant GPU resources |
| ProteinMPNN | Sequence design for given structure | Local installation | State-of-the-art for inverse folding; pairs well with RFdiffusion |
| Foldseek | Fast structure similarity search | Local, web server | 100,000× faster than structural alignment; essential for large-scale searches |
Critical limitation: AlphaFold 3’s accuracy for protein-ligand interactions remains contested. A 2025 benchmark in Nature Methods found AF3’s ligand docking RMSD averaged 2.8Å — usable for screening but not for lead optimization without experimental validation.
Expression and Single-Cell Analysis
| Tool | Purpose | API Access | Key Considerations |
|---|---|---|---|
| DESeq2 / edgeR | Differential expression analysis | R packages (local) | Gold standard for bulk RNA-seq; requires count matrices |
| Scanpy | Single-cell analysis pipeline | Python package (local) | Standard for scRNA-seq; integrates with scGPT |
| scGPT | Single-cell foundation model | Hugging Face, local | 33M+ cells pre-trained; cell type annotation, perturbation prediction |
| Geneformer | Gene expression transformer | Hugging Face, local | 30M single-cell transcriptomes; transfer learning for network inference |
| scVI | Variational autoencoder for single-cell | Python package | Excellent for batch correction and data integration |
Pathway and Network Analysis
| Tool | Purpose | API Access | Key Considerations |
|---|---|---|---|
| KEGG API | Pathway databases, enrichment | REST API (free for academics) | Gold standard pathway database; licensing restrictions for commercial use |
| Reactome | Curated pathway database | REST API, download | Open access; more detailed than KEGG for signaling pathways |
| STRING | Protein-protein interaction networks | REST API | Confidence-scored PPIs; integrates experimental and predicted interactions |
| Cytoscape | Network visualization and analysis | Desktop app, JavaScript API | Essential for visualizing complex biological networks |
| g:Profiler | Functional enrichment analysis | REST API, web server | GO, KEGG, Reactome enrichment; multiple testing correction built-in |
Literature and Knowledge Retrieval
| Tool | Purpose | API Access | Key Considerations |
|---|---|---|---|
| PubMed/Entrez | Literature search | NCBI E-utilities API | 36M+ citations; essential but requires careful query construction |
| Semantic Scholar | AI-enhanced literature search | REST API | Better at finding relevant papers than keyword search; citation graphs |
| Europe PMC | Open access literature | REST API | Includes preprints; good for finding open-access full texts |
| UniProt | Protein knowledgebase | REST API | Curated protein information; cross-references to PDB, Pfam, KEGG |
| ClinVar | Clinical variant interpretations | FTP download, API | Critical for clinical genomics; variant pathogenicity classifications |
Chemical and Drug Discovery
| Tool | Purpose | API Access | Key Considerations |
|---|---|---|---|
| RDKit | Cheminformatics, molecular properties | Python package (open-source) | Industry standard; molecular fingerprints, similarity, ADMET prediction |
| DeepChem | Deep learning for chemistry | Python package | Pre-trained models for property prediction, virtual screening |
| AutoDock Vina | Molecular docking | Local installation | Widely used docking software; requires protein structure + ligand |
| Open Targets | Drug target validation | GraphQL API | Integrates genetics, genomics, chemicals for target-drug associations |
Handling Biological Tool Failures
Biological tools fail differently than software APIs. A BLAST search doesn’t return a 404 — it returns no significant hits. An AlphaFold prediction doesn’t crash — it returns a structure with low confidence. Agents must handle these “soft failures” gracefully.
Confidence-Aware Reasoning
Every biological prediction comes with uncertainty. Agents must propagate confidence scores through their reasoning:
# Pseudocode for confidence-aware tool use
result = alphafold.predict(sequence)
if result.pLDDT < 70:
agent.thought("Low confidence prediction (pLDDT={}). Should note uncertainty.".format(result.pLDDT))
agent.final_answer(include_uncertainty=True)
elif result.pLDDT < 90:
agent.thought("Moderate confidence. Structure likely correct at domain level but loop regions uncertain.")
else:
agent.thought("High confidence prediction. Can use for downstream analysis.")
Key thresholds from CASP15/16:
- pLDDT > 90: High confidence (atomic-level accuracy)
- pLDDT 70-90: Confident at domain level
- pLDDT 50-70: Low confidence (use with caution)
- pLDDT < 50: Very low confidence (likely disordered)
Partial Results and Contradictory Evidence
Biological tools often return partial or contradictory results. Consider this scenario:
Agent query: "Is TP53 R248Q pathogenic?"
Tool results:
- ClinVar: Pathogenic (47 submissions)
- PolyPhen-2: Probably damaging (score 0.998)
- SIFT: Deleterious (score 0.00)
- AlphaFold 3: Local destabilization at mutation site
- Literature search: 2,341 papers mention this variant
But:
- Some functional assays show retained transactivation activity
- Population frequency in gnomAD: 0.0001 (very rare but present)
The agent must synthesize this into a nuanced answer:
“TP53 R248Q is classified as pathogenic in ClinVar with strong consensus (47 submissions). Computational predictors (PolyPhen-2, SIFT) and structural analysis (AlphaFold 3) support pathogenicity through loss of DNA contact and local destabilization. However, some functional assays suggest partial retained activity, and the variant appears at very low frequency in gnomAD, suggesting it may have incomplete penetrance or context-dependent effects. Clinical interpretation should consider family history and co-occurring variants.”
This is biological reasoning — not just reporting tool outputs, but integrating them with appropriate caveats.
Timeout and Rate Limit Handling
Many biological APIs have strict rate limits:
- NCBI E-utilities: 3 requests/second without API key, 10/second with key
- EMBL-EBI: Varies by service; some require job submission + polling
- AlphaFold server: Queue-based; predictions can take hours for large proteins
Agents must implement:
- Request queuing — batch requests when possible
- Exponential backoff — respect rate limits gracefully
- Alternative tool fallbacks — if AlphaFold server is down, try ESMFold
- Progress reporting — for long-running jobs, report status to user
Prompt Engineering for Biological Agents
The system prompt shapes how the agent reasons about biology. Here’s a reference template:
You are a biological research assistant with access to computational tools for
sequence analysis, structure prediction, expression analysis, and literature search.
Your role is to help researchers answer biological questions by:
1. Breaking complex questions into tractable sub-questions
2. Selecting appropriate tools for each sub-question
3. Interpreting tool outputs with appropriate uncertainty
4. Synthesizing results into coherent, evidence-based answers
Guidelines:
- Always cite specific databases, tools, and papers when making claims
- Report confidence scores (pLDDT, E-values, p-values) when available
- Distinguish between computational predictions and experimental evidence
- Acknowledge limitations — no tool is perfect
- For clinical questions, emphasize that computational predictions are not
diagnostic and require clinical validation
- When uncertain, say so. Do not hallucinate biological facts.
Available tools:
[Tool descriptions with usage examples]
Example reasoning trace:
[Show a complete ReAct loop for a representative question]
Few-Shot Examples for Biological Reasoning
Include worked examples in the prompt to demonstrate the expected reasoning style:
Example 1: Variant Interpretation
User: "What's the impact of the EGFR L858R mutation?"
Thought: This is a well-studied oncogenic mutation in lung cancer. I should:
(1) Confirm the variant exists in ClinVar
(2) Check structural impact with AlphaFold
(3) Search literature for functional studies and clinical significance
(4) Note any approved therapies targeting this mutation
Action: Query ClinVar for EGFR L858R
Observation: ClinVar: Pathogenic (89 submissions), associated with lung adenocarcinoma
Action: Query AlphaFold 3 for EGFR structure with L858R mutation
Observation: L858R is in the kinase domain. Mutation stabilizes active conformation.
Action: Search PubMed for "EGFR L858R tyrosine kinase inhibitor"
Observation: 1,200+ papers. L858R sensitizes to gefitinib and erlotinib (first-gen TKIs).
Final Answer: EGFR L858R is a pathogenic, oncogenic mutation in the kinase domain
that stabilizes the active conformation. It's found in ~15% of lung adenocarcinomas
and confers sensitivity to EGFR tyrosine kinase inhibitors (gefitinib, erlotinib).
This is one of the best-characterized driver mutations in cancer, with extensive
clinical validation and FDA-approved targeted therapies.
Domain-Specific Constraints
Add constraints specific to biological reasoning:
Constraints:
- Never claim a variant is "definitely pathogenic" without clinical evidence
- Distinguish between "predicted to be damaging" (computational) and
"experimentally validated" (functional assays)
- For drug-target interactions, specify whether evidence is computational
(docking), in vitro (binding assays), in vivo (animal models), or clinical
- Population genetics claims must reference specific databases (gnomAD, 1000 Genomes)
and specify ancestry groups when relevant
Reference Architecture: Building the Agent
Here’s a reference architecture for a biological tool-use agent using LangGraph:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, List
import operator
class AgentState(TypedDict):
messages: Annotated[List[dict], operator.add]
tool_outputs: Annotated[List[dict], operator.add]
reasoning_trace: Annotated[List[str], operator.add]
confidence_scores: Annotated[dict, operator.add]
final_answer: str
# Define nodes
def reasoner(state: AgentState):
"""LLM node that generates thoughts and selects tools"""
# Prompt includes system instructions + conversation history + tool descriptions
response = llm.invoke(state["messages"] + system_prompt)
# Parse response for thought + action selection
return {"reasoning_trace": [response.thought], "messages": [response]}
def tool_executor(state: AgentState):
"""Executes selected biological tools"""
tool_name = state["messages"][-1]["tool_choice"]
tool_params = state["messages"][-1]["tool_params"]
if tool_name == "blast":
result = run_blast(tool_params["sequence"], tool_params["database"])
elif tool_name == "alphafold":
result = run_alphafold(tool_params["sequence"])
elif tool_name == "clinvar":
result = query_clinvar(tool_params["variant"])
# ... other tools
return {"tool_outputs": [result], "confidence_scores": {tool_name: result.confidence}}
def synthesizer(state: AgentState):
"""Synthesizes tool outputs into final answer"""
# Check if we have enough information
if len(state["tool_outputs"]) < min_required_tools:
return {"messages": [{"role": "assistant", "content": "Need more data..."}]}
# Generate final answer with citations
answer = llm.invoke(build_synthesis_prompt(state))
return {"final_answer": answer}
# Build graph
workflow = StateGraph(AgentState)
workflow.add_node("reasoner", reasoner)
workflow.add_node("tool_executor", tool_executor)
workflow.add_node("synthesizer", synthesizer)
workflow.set_entry_point("reasoner")
workflow.add_edge("reasoner", "tool_executor")
workflow.add_edge("tool_executor", "synthesizer")
workflow.add_edge("synthesizer", END)
agent = workflow.compile()
Memory and Provenance
Biological agents must track:
- Tool call provenance — which tool produced which result, with what parameters
- Version tracking — AlphaFold 2 vs. 3, GRCh37 vs. GRCh38 genome builds
- Temporal context — when was this analysis run? (databases update)
Example memory structure:
{
"session_id": "bio-agent-20260311-001",
"query": "BRCA1 R1699Q structural impact",
"tool_calls": [
{
"tool": "uniprot",
"params": {"accession": "P38398"},
"timestamp": "2026-03-11T09:45:22Z",
"result_hash": "sha256:abc123..."
},
{
"tool": "alphafold3",
"params": {"sequence": "MSK...", "mutation": "R1699Q"},
"timestamp": "2026-03-11T09:47:15Z",
"result_hash": "sha256:def456...",
"confidence": {"pLDDT": 87}
}
],
"genome_build": "GRCh38",
"database_versions": {
"clinvar": "2026-02-28",
"uniprot": "2026-03-01"
}
}
This enables reproducibility: six months later, you can re-run the analysis and see if database updates changed the conclusion.
Open-Source Implementations
Several open-source projects demonstrate biological agent architectures:
BioAgents (Scientific Reports, 2025)
A multi-agent system for bioinformatics analysis published in Scientific Reports (November 2025). Key features:
- Specialized agents for genomics, proteomics, and pathway analysis
- Consensus mechanism for conflicting predictions
- Integration with BLAST, AlphaFold, and KEGG
- Limitation: Focused on batch analysis rather than interactive querying
The Virtual Lab (bioRxiv, 2024)
A modular agent architecture for antibody design with experimental validation:
- Assigns tasks to different agents (design, validation, optimization)
- Tools include ESM, AlphaFold-Multimer, and Rosetta
- Notable: Actually validated predictions experimentally — designed nanobodies showed binding in vitro
- Limitation: Narrow scope (antibodies only)
ChemCrow (Nature Machine Intelligence, 2024)
While focused on chemistry rather than biology, ChemCrow’s architecture is highly relevant:
- 13 tools for organic synthesis (reaction prediction, retrosynthesis, property calculation)
- Grafts LLM onto specialized chemistry tools via action-queue API
- Successfully planned and executed multi-step syntheses
- Lesson for biology: The “LLM as orchestrator” pattern works when tools are well-defined and reliable
LangChain Bioinformatics Tools
Community-contributed LangChain tools for biology:
BioPythonTools: Wraps BioPython functions (sequence manipulation, BLAST)AlphaFoldTool: Interfaces with AlphaFold serversPubmedTool: Literature search and summarization- Status: Community-maintained; quality varies; good starting point for customization
Challenges and Limitations
Building biological agents is hard. Here are the key challenges:
1. Tool Reliability Variability
BLAST is rock-solid (30+ years of development). AlphaFold is highly reliable for monomeric proteins. But many bioinformatics tools:
- Have undocumented edge cases
- Fail silently on unusual inputs
- Return results that require expert interpretation
Mitigation: Implement result validation layers. For example, check that BLAST E-values are in expected ranges, that AlphaFold pLDDT scores are internally consistent, that variant coordinates match the reference genome.
2. Computational Cost
Running AlphaFold on a 2,000-amino-acid protein can take hours. scGPT inference on millions of cells requires significant GPU memory. Agents must:
- Queue long-running jobs
- Provide progress updates
- Offer approximate alternatives (ESMFold instead of AlphaFold for quick screening)
3. Hallucination Risk
LLMs hallucinate. In biology, hallucinations can be dangerous:
- Fabricated gene-disease associations
- Incorrect variant classifications
- Non-existent drug interactions
Mitigation:
- Ground all factual claims in tool outputs
- Implement fact-checking passes (query database again to verify)
- For clinical applications, require human review before any output reaches patients
4. Population Bias
Most genomic databases are >80% European ancestry. Agents trained or calibrated on these databases will perform worse for other populations:
- Polygenic risk scores have reduced accuracy in non-European populations
- Variant frequency estimates from gnomAD are skewed
- Reference genomes don’t capture population-specific variation
Mitigation:
- Always report ancestry context for population genetics claims
- Flag when analysis relies on Eurocentric databases
- Use diverse reference panels when available (All of Us, H3Africa)
5. The “Last Mile” Problem
Agents can generate hypotheses and predictions, but biological truth requires experimental validation. The gap between computational prediction and wet-lab confirmation remains wide:
- AlphaFold structures are predictions, not measurements
- Drug-target binding predictions require biochemical validation
- Gene expression predictions need qPCR or RNA-seq confirmation
Honest framing: Agents should present computational results as predictions requiring validation, not as established facts.
Getting Started: A Minimal Viable Agent
If you’re building your first biological agent, start small:
- Pick one domain — variant interpretation, protein structure analysis, or literature review
- Select 3-5 core tools — don’t try to integrate everything at once
- Implement basic ReAct loop — thought, action, observation
- Add confidence tracking — propagate uncertainty through reasoning
- Test on known examples — use well-characterized variants/proteins with established answers
- Iterate — add tools, improve prompts, handle edge cases
Example minimal variant interpretation agent:
- Tools: ClinVar lookup, AlphaFold structure prediction, PubMed search
- Input: Gene name + variant (e.g., “BRCA1 R1699Q”)
- Output: Pathogenicity assessment with evidence summary
- Validation: Compare against ClinVar classifications for known variants
Once this works reliably, expand to multi-omics integration, drug-target prediction, or experimental design.
Conclusion
Biological tool-use agents are not science fiction — they’re being built today. The architecture is clear: ReAct loops orchestrating domain-specific tools, with careful attention to confidence propagation, error handling, and provenance tracking. The tools exist: AlphaFold, ESM, BLAST, scGPT, and dozens of specialized databases and APIs.
But the hard work is in the details: handling partial results, managing computational costs, preventing hallucinations, and maintaining scientific rigor. The agents that succeed will be those that embrace uncertainty, document limitations, and keep humans in the loop for high-stakes decisions.
The next post in this series examines a specific high-value application: agents for drug discovery, where the agentic omics vision meets the multi-billion-dollar reality of pharmaceutical development.
Glossary
| Term | Definition |
|---|---|
| ReAct | Reason + Act pattern for AI agents; interleaves reasoning traces with tool calls |
| pLDDT | Predicted Local Distance Difference Test; AlphaFold confidence score (0-100) |
| E-value | Expectation value in BLAST; number of hits expected by chance (lower = more significant) |
| MSA | Multiple Sequence Alignment; aligning homologous sequences to identify conserved regions |
| Ortholog | Genes in different species that evolved from a common ancestral gene |
| Paralog | Genes related by duplication within a genome |
| Pathogenic | Disease-causing (clinical variant classification) |
| VUS | Variant of Uncertain Significance; insufficient evidence for pathogenicity classification |
| PPI | Protein-Protein Interaction; physical associations between proteins |
| ADMET | Absorption, Distribution, Metabolism, Excretion, Toxicity; drug-like properties |
| Docking | Computational prediction of how a small molecule binds to a protein |
| In silico | Performed on computer (vs. in vitro = in glass/lab, in vivo = in living organism) |
| gnomAD | Genome Aggregation Database; population frequency data for genetic variants |
| ClinVar | Database of clinically interpreted genetic variants |
| LangGraph | Framework for building stateful, multi-actor AI applications (built on LangChain) |
References
-
Bran, A. M., et al. “ChemCrow: Augmenting large-language models with chemistry tools.” Nature Machine Intelligence (2024). DOI: 10.1038/s42256-024-00823-x
-
Scientific Reports. “BioAgents: Bridging the gap in bioinformatics analysis with multi-agent systems.” Scientific Reports 15, Article number: 25919 (2025). DOI: 10.1038/s41598-025-25919-z
-
Frontiers in Artificial Intelligence. “AI, agentic models and lab automation for scientific discovery — the beginning of scAInce.” Front. Artif. Intell. (2025). DOI: 10.3389/frai.2025.1649155
-
LangChain. “LangGraph: Agent Orchestration Framework.” https://www.langchain.com/langgraph (2024-2025)
-
Jumper, J., et al. “Highly accurate protein structure prediction with AlphaFold.” Nature 596, 583-589 (2021). DOI: 10.1038/s41586-021-03819-2
-
Abramson, J., et al. “Accurate structure prediction of biomolecular interactions with AlphaFold 3.” Nature 630, 493-500 (2024). DOI: 10.1038/s41586-024-07487-w
-
Lin, Z., et al. “Evolutionary-scale prediction of atomic-level protein structure with a language model.” Science 379, 1123-1130 (2023). DOI: 10.1126/science.ade2574
-
Cui, H., et al. “scGPT: toward building a foundation model for single-cell multi-omics using generative AI.” Nature Methods 21, 1470-1481 (2024). DOI: 10.1038/s41592-024-02305-9
-
Zhou, Z., et al. “DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome Understanding.” ICLR (2024). arXiv:2306.15006
-
Nguyen, E., et al. “Evo: A 7B-parameter model trained on 300B nucleotides spanning all domains of life.” Science (2024). DOI: 10.1126/science.adn1067
-
Briefings in Bioinformatics. “Streamline automated biomedical discoveries with agentic bioinformatics.” Brief. Bioinform. 26(5): bbaf505 (2025). DOI: 10.1093/bib/bbaf505
-
Nature. “Will self-driving ‘robot labs’ replace biologists? Paper sparks debate.” Nature (2026). DOI: 10.1038/d41586-026-00453-8
-
Wang, Y., et al. “The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation.” bioRxiv (2024). DOI: 10.1101/2024.11.11.623004
-
MDPI. “AlphaFold3: An Overview of Applications and Performance Insights.” Int. J. Mol. Sci. 26(8): 3671 (2025). DOI: 10.3390/ijms26083671
-
FEBS Open Bio. “An outlook on structural biology after AlphaFold: tools, limits and perspectives.” FEBS Open Bio (2024). DOI: 10.1002/2211-5463.13902