Introduction: Beyond the Chatbot

When you ask ChatGPT a question, it answers. When you ask an agentic AI system a question, it acts. This distinction — between passive assistance and autonomous execution — marks one of the most significant shifts in artificial intelligence since the transformer architecture itself.

Agentic AI systems are not merely more sophisticated chatbots. They are autonomous entities capable of perception, reasoning, planning, tool use, action, and memory. They can independently execute multi-step workflows, make decisions when faced with uncertainty, and adapt their approach based on feedback from the environment. In scientific contexts, this means agents that can read literature, formulate hypotheses, design experiments, execute computational analyses, interpret results, and iterate — all with varying degrees of human oversight.

This post, the thirteenth in our Agentic Omics series, defines the agentic AI paradigm and surveys its emergence in scientific discovery. We examine what makes an AI system “agentic,” trace the evolution from prompt engineering to autonomous agents, review key frameworks and implementations in chemistry and biology, and critically assess what “autonomy” actually means in practice. This foundation is essential for understanding the Agentic Omics vision we will articulate in Post 14: the orchestration of LLM reasoning with domain-specific biological AI models.

Defining Agentic AI: Perception, Reasoning, Planning, Action, Memory

The term “agent” has a rich history in philosophy and artificial intelligence. At its core, an agent is an entity with the ability to act — to take actions in response to sensory input, whether operating in physical, virtual, or mixed-reality environments [1]. Agentic AI introduces a paradigm of embodied intelligence, where intelligence emerges from the interaction between autonomy, learning, memory, perception, planning, decision-making, and action [2].

For practical purposes in scientific applications, we can define agentic AI by six core capabilities:

1. Perception: The ability to gather information from the environment. For a biological agent, this might mean reading scientific papers from PubMed, querying databases like UniProt or PDB, parsing experimental data from files, or receiving input from laboratory instruments.

2. Reasoning: The capacity to draw inferences, evaluate evidence, and make judgments. LLMs provide general reasoning capabilities, but scientific agents often need domain-specific reasoning — understanding that a p-value < 0.05 indicates statistical significance, recognizing that a protein structure prediction with low confidence should be treated cautiously, or identifying potential confounding factors in an experimental design.

3. Planning: The ability to decompose complex goals into sequences of actions. Given a goal like “identify potential drug targets for this cancer mutation,” an agent must plan: first retrieve mutation details from ClinVar, then predict structural impact with AlphaFold, then assess functional consequence with ESM, then search for existing inhibitors in ChEMBL, and so on.

4. Tool Use: The capability to invoke external functions and APIs. Scientific agents need access to specialized tools: BLAST for sequence homology, AlphaFold for structure prediction, DESeq2 for differential expression, RDKit for chemical property calculation, and many more. Tool use transforms LLMs from text generators into action executors.

5. Action: The execution of decisions in the environment. This might mean writing files, submitting jobs to compute clusters, sending API requests, controlling laboratory robots, or generating reports. Actions change the state of the world.

6. Memory: The ability to retain information across interactions. Scientific work requires maintaining context: What experiments have already been tried? What were the results? What hypotheses have been ruled out? Memory can be short-term (within a session) or long-term (persisted across sessions), and can include episodic memory (specific experiences) and semantic memory (general knowledge).

These capabilities are not unique to agentic AI — traditional software systems have many of them. What distinguishes agentic AI is the integration of these capabilities through an LLM-based reasoning layer that can flexibly adapt to novel situations, handle ambiguity, and make judgment calls that would previously have required human intervention.

The Evolution: From Prompt Engineering to Autonomous Agents

The path to agentic AI has been iterative, with each stage addressing limitations of the previous approach:

Stage 1: Prompt Engineering (2022–2023)

The initial approach to leveraging LLMs involved carefully crafting prompts to elicit desired responses. Techniques like chain-of-thought prompting [3], few-shot learning, and role specification improved performance on reasoning tasks. However, prompt engineering has fundamental limitations:

  • No tool access: LLMs could only generate text, not interact with external systems.
  • No memory: Each query was independent; the model couldn’t remember previous interactions.
  • No planning: Complex tasks requiring multiple steps had to be manually decomposed by users.
  • Hallucination: Models confidently generated incorrect information, especially for specialized domains.

Stage 2: Retrieval-Augmented Generation (RAG) (2023–2024)

RAG systems addressed the knowledge limitation by retrieving relevant documents from external databases and providing them as context to the LLM [4]. For scientific applications, this meant agents could access up-to-date literature, databases, and experimental results. However, RAG systems still lacked:

  • Action capability: They could retrieve and synthesize information but not execute tasks.
  • Multi-step reasoning: Complex workflows still required manual orchestration.
  • Adaptive behavior: The retrieval strategy was typically fixed, not learned from feedback.

Stage 3: Function Calling and Tool Use (2024)

The introduction of function calling capabilities allowed LLMs to invoke external APIs and tools [5]. Models like GPT-4 could parse user requests, determine which functions to call, and integrate the results into responses. This was a crucial step toward agency:

  • Tool integration: Agents could now query databases, run computations, and control external systems.
  • Structured outputs: Function calling enforced structured responses, reducing hallucination.
  • Composability: Multiple tools could be combined to accomplish complex tasks.

However, function calling still required the user to specify which tools were available and often required manual sequencing of tool calls.

Stage 4: Autonomous Agents (2024–2025)

The current frontier is autonomous agents that can independently plan and execute multi-step workflows. Frameworks like LangChain [6], LangGraph [7], AutoGen [8], and CrewAI [9] provide infrastructure for:

  • Autonomous planning: Agents decompose goals into subtasks and determine the sequence of actions.
  • Iterative execution: Agents can execute actions, observe results, and adjust their approach based on feedback.
  • Memory management: Agents maintain context across multiple steps and sessions.
  • Multi-agent collaboration: Multiple specialized agents can work together on complex problems.

The key breakthrough is the ReAct (Reason + Act) paradigm [10], where agents alternate between reasoning (thinking about what to do) and acting (executing tools), creating a loop that enables autonomous problem-solving.

Key Agentic Frameworks: LangChain, LangGraph, AutoGen, CrewAI

Several frameworks have emerged to support agentic AI development. Each has different strengths for scientific applications:

LangChain and LangGraph

LangChain is a framework for building LLM-powered applications, providing abstractions for prompts, chains, agents, and memory [6]. LangGraph extends LangGraph with graph-based agent orchestration, enabling stateful, multi-agent workflows with loops and branching [7].

Key features:

  • State management: LangGraph maintains explicit state across agent interactions, crucial for scientific workflows where context matters.
  • Graph-based flows: Workflows can be represented as directed graphs, with nodes representing agent actions and edges representing state transitions.
  • Human-in-the-loop: LangGraph supports interruption points where human review is required before proceeding.
  • Tool integration: Extensive library of pre-built tools for common tasks.

Scientific applications: LangGraph has been used for automated literature review, experimental design assistance, and multi-step data analysis pipelines. The graph-based approach is particularly well-suited for representing scientific workflows, which often have conditional branches (e.g., “if p-value < 0.05, proceed to pathway analysis; otherwise, reconsider hypothesis”).

LangChain and LangGraph reached version 1.0 in November 2025, marking their maturation as production-ready frameworks [11].

AutoGen

AutoGen, developed by Microsoft Research, focuses on multi-agent conversation and collaboration [8]. Agents can converse with each other, delegate tasks, and collectively solve problems.

Key features:

  • Conversational agents: Agents communicate through natural language, making multi-agent systems interpretable.
  • Flexible topologies: Supports one-to-one, one-to-many, and many-to-many agent interactions.
  • Code execution: Agents can write and execute code, enabling computational tasks.
  • Human participation: Humans can join conversations as agents, providing guidance or oversight.

Scientific applications: AutoGen has been used for collaborative hypothesis generation, where different agents represent different scientific perspectives (e.g., a “statistician” agent critiquing experimental design, a “domain expert” agent providing biological context).

CrewAI

CrewAI focuses on role-based agent teams, where each agent has a defined role, goal, and backstory [9]. This approach is inspired by human organizational structures.

Key features:

  • Role specification: Agents are defined by their role (e.g., “Senior Bioinformatician”), goal, and backstory.
  • Task delegation: Tasks are assigned to agents based on their roles.
  • Sequential and hierarchical processes: Supports both linear workflows and nested task structures.

Scientific applications: CrewAI is well-suited for representing scientific teams, where different specialists contribute their expertise to a common goal.

Scientific Agents in Action: ChemCrow, CoScientist, and BioAgent

Several pioneering systems demonstrate the potential of agentic AI in scientific discovery:

ChemCrow: Autonomous Chemistry Tool Orchestration

ChemCrow, developed by researchers at EPFL and published in 2024, is an autonomous agent for chemistry research [12]. Built on LangChain, ChemCrow integrates 18 expert-designed tools covering organic synthesis, drug discovery, and materials design.

Capabilities:

  • Literature search: Queries scientific databases and patents for relevant information.
  • Reaction planning: Suggests synthetic routes for target molecules.
  • Property prediction: Estimates physicochemical properties using ML models.
  • Experimental guidance: Provides step-by-step instructions for laboratory procedures.

Demonstrated achievements:

  • Successfully planned the synthesis of an insect repellent (DEET) from readily available precursors.
  • Identified a novel organocatalyst for an asymmetric reaction by analyzing literature and predicting catalyst performance.
  • Guided the discovery of a new chromophore by iteratively proposing and evaluating molecular structures.

Key insight: ChemCrow demonstrates that LLMs can effectively orchestrate domain-specific tools, even when the LLM itself lacks deep chemistry knowledge. The agent’s value comes from integration — knowing which tools to use, in what order, and how to interpret their outputs.

CoScientist: Autonomous Experimental Design and Execution

CoScientist, developed at Carnegie Mellon University and published in Nature in 2023, represents a more advanced level of autonomy [13]. Powered by GPT-4, CoScientist can autonomously design, plan, and execute complex chemistry experiments.

Architecture:

  • Web browsing: Searches scientific literature and documentation.
  • Code execution: Writes and runs Python code for data analysis.
  • Laboratory automation: Controls robotic experimental platforms.
  • Iterative refinement: Adjusts experimental parameters based on results.

Demonstrated achievements:

  • Discovered a novel palladium-catalyzed cross-coupling reaction by systematically exploring reaction conditions.
  • Optimized reaction yield through iterative experimentation, outperforming human-designed protocols.
  • Generated complete experimental reports, including methods, results, and interpretation.

Significance: CoScientist is notable for closing the loop between computational planning and physical execution. The agent doesn’t just suggest experiments — it performs them, observes outcomes, and adapts its approach. This represents a step toward self-driving laboratories.

CoScientist’s capabilities were extended in 2024–2025 through integration with the Carnegie Mellon Cloud Lab, providing remote access to over 200 pieces of laboratory equipment [14].

BioAgent and Biology-Specific Implementations

While chemistry has seen more advanced agentic implementations, biology-specific agents are emerging:

BioAgent frameworks: Several research groups have developed BioAgent systems for tasks like literature mining, hypothesis generation, and experimental design in molecular biology. These agents typically integrate tools for:

  • Sequence analysis (BLAST, HMMER)
  • Structure prediction (AlphaFold, ESMFold)
  • Pathway analysis (KEGG, Reactome)
  • Literature search (PubMed, Semantic Scholar)

Agent Laboratory: A 2025 system demonstrated automated research workflows including data preparation, experimentation, and report writing [15]. However, performance dropped significantly in the literature review phase, highlighting the challenges of automating structured scientific reasoning.

Limitations: Biology presents unique challenges for agentic AI:

  • Higher complexity: Biological systems are more complex and less predictable than chemical reactions.
  • Longer feedback loops: Biological experiments often take days or weeks, compared to hours for chemistry.
  • Greater variability: Biological reproducibility is notoriously challenging, complicating agent learning.
  • Ethical constraints: Human and animal research requires oversight that limits full autonomy.

What “Autonomy” Means in Practice: Levels of Automation

The term “autonomy” is often used loosely. For scientific applications, it’s useful to distinguish levels of automation, analogous to the SAE levels for self-driving cars:

Level 0: No Automation

  • Human performs all tasks.
  • AI may provide reference information (e.g., search engines).

Level 1: Tool Assistance

  • AI provides individual tools (e.g., AlphaFold for structure prediction).
  • Human decides when and how to use each tool.
  • Human integrates results and makes decisions.

Level 2: Partial Automation

  • AI can execute multi-step workflows with human specification.
  • Human defines the goal and workflow; AI executes.
  • Human reviews results and makes decisions.

Level 3: Conditional Autonomy

  • AI can independently plan and execute workflows for well-defined tasks.
  • Human provides high-level goals; AI determines the approach.
  • Human reviews and approves key decisions (human-on-the-loop).
  • AI can handle exceptions within defined parameters.

Level 4: High Autonomy

  • AI operates autonomously for extended periods within a domain.
  • Human provides objectives; AI handles planning, execution, and interpretation.
  • Human intervention only for edge cases or system failures.
  • AI can adapt to novel situations within its domain.

Level 5: Full Autonomy

  • AI operates autonomously across all scientific tasks.
  • No human intervention required.
  • AI can formulate its own research questions and pursue them.
  • This level does not currently exist and may not be desirable for scientific applications.

Current state: Most scientific agentic AI systems operate at Level 2–3. ChemCrow and CoScientist demonstrate Level 3 capabilities within their domains — they can autonomously plan and execute experiments, but human oversight is still essential for validation and interpretation.

Why full autonomy is neither desirable nor achievable for clinical applications:

  • Accountability: Clinical decisions affect patient lives; humans must remain accountable.
  • Uncertainty: Biological systems are inherently uncertain; human judgment is needed for edge cases.
  • Ethics: Ethical decisions require human values and context.
  • Regulation: Regulatory frameworks require human oversight for clinical applications.

The goal of agentic omics is not to replace human scientists but to amplify them — handling routine tasks, surfacing insights from large datasets, and enabling scientists to focus on creative hypothesis generation and interpretation.

The LLM as Orchestrator: Combining General Reasoning with Domain-Specific Tools

The architecture of scientific agentic AI follows a consistent pattern: an LLM serves as the “brain” or orchestrator, coordinating specialized tools that provide domain-specific capabilities.

┌─────────────────────────────────────────────────────────────┐
│                      LLM Orchestrator                        │
│  (General reasoning, planning, natural language interface)   │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  AlphaFold    │   │     BLAST     │   │   PubMed      │
│  (Structure)  │   │ (Homology)    │   │ (Literature)  │
└───────────────┘   └───────────────┘   └───────────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              │
                              ▼
                   ┌─────────────────────┐
                   │   Integrated Result │
                   │  (Interpreted by    │
                   │   LLM + Tools)      │
                   └─────────────────────┘

Why this architecture works:

  1. LLMs excel at: Natural language understanding, general reasoning, planning, tool selection, result integration, and communication.

  2. Domain-specific tools excel at: Precise computations (AlphaFold for structure), database queries (BLAST for homology), statistical analysis (DESeq2 for differential expression), and specialized tasks that require training data or algorithms beyond the LLM’s knowledge.

  3. Together: The LLM provides flexibility and adaptability; the tools provide accuracy and domain expertise. The combination is more capable than either alone.

Example workflow: Given a cancer mutation, an agentic system might:

  1. Query ClinVar for known clinical significance (tool: ClinVar API)
  2. Predict structural impact using AlphaFold 3 (tool: AlphaFold API)
  3. Assess functional consequence using ESM-3 embeddings (tool: ESM API)
  4. Search for existing inhibitors in ChEMBL (tool: ChEMBL API)
  5. Generate a summary report integrating all findings (LLM reasoning)

The LLM doesn’t need to know how AlphaFold works — it needs to know when to use AlphaFold, how to interpret its outputs, and what to do next based on those outputs.

Challenges and Limitations

Despite promising demonstrations, agentic AI for scientific discovery faces significant challenges:

Hallucination in Biological Contexts

LLMs can generate plausible-sounding but incorrect information. In scientific contexts, hallucination can have serious consequences:

  • Incorrect citation of non-existent papers
  • Misinterpretation of experimental results
  • Fabrication of biological facts

Mitigation strategies:

  • Ground all claims in retrieved evidence (RAG)
  • Require citations for factual claims
  • Implement verification chains where agents check each other’s work
  • Use structured outputs and tool validation

Computational Cost

Agentic workflows can be expensive:

  • Multiple LLM calls per workflow (planning, tool selection, interpretation)
  • Expensive tool calls (AlphaFold predictions, large database queries)
  • Iterative refinement multiplies costs

Considerations:

  • Cost-benefit analysis: Is the agent’s output worth the expense?
  • Optimization: Cache results, use smaller models for simple tasks
  • Prioritization: Reserve agentic workflows for high-value tasks

Result Validation

How do we know the agent’s conclusions are correct?

  • Agents can make reasoning errors even with correct tool outputs
  • Tool outputs may be misinterpreted
  • Confirmation bias: agents may seek evidence supporting initial hypotheses

Approaches:

  • Multi-agent debate: agents critique each other’s reasoning
  • Human review: critical decisions require human approval
  • Benchmarking: evaluate agent performance on known problems
  • Uncertainty quantification: agents should express confidence levels

Reproducibility

Scientific reproducibility is already a challenge; agentic AI adds complexity:

  • LLM outputs are stochastic (different runs may produce different results)
  • Tool versions and databases change over time
  • Agent memory and state may not be fully captured

Requirements:

  • Complete logging of agent actions and decisions
  • Version control for tools and models
  • Seed setting for reproducibility where possible
  • Detailed provenance tracking

The Literature Review Bottleneck

A surprising finding from Agent Laboratory and similar systems is that literature review automation remains challenging [15]. Reasons include:

  • Scientific papers are complex and nuanced
  • Relevant information may be spread across many papers
  • Understanding requires domain expertise
  • Contradictory findings must be reconciled

Current approaches:

  • Specialized literature search agents
  • Multi-agent systems where one agent searches and another evaluates
  • Human-AI collaboration where AI surfaces candidates and humans evaluate

The Scientist in the Loop: Why Human Oversight Remains Essential

Given these challenges, the most practical and responsible approach to agentic AI in science is human-in-the-loop or human-on-the-loop rather than full autonomy:

Human-in-the-loop: Humans are actively involved in each step, reviewing and approving agent actions before execution. This is appropriate for high-stakes decisions (clinical applications, novel experimental protocols).

Human-on-the-loop: Humans monitor agent activity and intervene when necessary, but agents operate autonomously within defined parameters. This is appropriate for routine tasks with established protocols.

Benefits of human involvement:

  • Accountability: Humans remain responsible for scientific conclusions.
  • Creativity: Humans excel at creative hypothesis generation and recognizing unexpected patterns.
  • Ethics: Humans provide ethical judgment and contextual understanding.
  • Error correction: Humans can catch agent mistakes before they propagate.

The future: As agentic AI matures, the human role will shift from executor to supervisor to collaborator. Scientists will spend less time on routine data analysis and more time on creative problem-solving, experimental design, and interpretation.

Conclusion: The Agentic Turn in Scientific Discovery

Agentic AI represents a fundamental shift in how we think about AI in science. Rather than passive tools that respond to queries, agentic systems are active participants in the scientific process — planning, executing, and iterating on research workflows with varying degrees of autonomy.

Key takeaways:

  1. Agentic AI is defined by capabilities: Perception, reasoning, planning, tool use, action, and memory distinguish agents from chatbots.

  2. The evolution has been iterative: From prompt engineering to RAG to function calling to autonomous agents, each stage addressed limitations of the previous approach.

  3. Frameworks enable development: LangChain, LangGraph, AutoGen, and CrewAI provide infrastructure for building agentic systems.

  4. Scientific agents exist today: ChemCrow and CoScientist demonstrate autonomous experimental design and execution in chemistry; biology-specific agents are emerging.

  5. Autonomy exists on a spectrum: Most scientific agents operate at Level 2–3 autonomy, with human oversight remaining essential.

  6. The LLM orchestrates domain tools: The architecture combines general LLM reasoning with specialized biological tools (AlphaFold, ESM, BLAST, etc.).

  7. Challenges remain: Hallucination, cost, validation, reproducibility, and literature review automation are active areas of research.

  8. Humans remain essential: The goal is amplification, not replacement — freeing scientists to focus on creativity and insight.

In Post 14, we will build on this foundation to articulate the Agentic Omics vision: integrating LLM reasoning with the domain-specific biological AI models we’ve surveyed throughout this series (AlphaFold, scGPT, ESM, DNABERT-2, and others) to create autonomous systems capable of end-to-end biological discovery.


Glossary

Term Definition
Agentic AI AI systems capable of autonomous perception, reasoning, planning, tool use, action, and memory to execute multi-step workflows.
LLM Orchestrator A large language model that coordinates specialized tools, determining when and how to invoke each tool and integrating their outputs.
ReAct Paradigm A reasoning approach where agents alternate between Reasoning (thinking about what to do) and Acting (executing tools) in a loop.
Function Calling LLM capability to invoke external APIs and tools, returning structured outputs that can be integrated into responses.
RAG (Retrieval-Augmented Generation) A technique where relevant documents are retrieved from external databases and provided as context to an LLM, improving accuracy and reducing hallucination.
Multi-Agent System A system where multiple AI agents interact, collaborate, or debate to solve complex problems that span multiple domains.
Human-in-the-Loop An approach where humans actively review and approve agent actions before execution, maintaining oversight and accountability.
Human-on-the-Loop An approach where humans monitor agent activity and intervene when necessary, but agents operate autonomously within defined parameters.
LangGraph A graph-based agent orchestration framework that enables stateful, multi-agent workflows with loops, branching, and human-in-the-loop support.
Self-Driving Laboratory A laboratory where AI agents design experiments, control robotic platforms, analyze results, and iterate with minimal human intervention.

References

[1] Schlosser, M. (2019). Agency. Stanford Encyclopedia of Philosophy.

[2] Huang, W. et al. (2024). Embodied Intelligence: A Survey of Agentic AI Systems. arXiv:2401.XXXXX.

[3] Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.

[4] Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.

[5] Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023.

[6] LangChain. (2025). LangChain Framework Documentation. https://github.com/langchain-ai/langchain

[7] LangGraph. (2025). LangGraph: Build Resilient Language Agents as Graphs. https://github.com/langchain-ai/langgraph

[8] Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.

[9] CrewAI. (2025). CrewAI: Role-Based Agent Teams. https://github.com/joaomdmoura/crewAI

[10] Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

[11] LangChain Blog. (2025). LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones. https://blog.langchain.com/langchain-langgraph-1dot0/

[12] Bran, A. et al. (2024). ChemCrow: Augmenting Large Language Models with Chemistry Tools. Nature Machine Intelligence, 6, 365–376.

[13] Boiko, D. et al. (2023). Autonomous Chemical Research with Large Language Models. Nature, 624, 570–578.

[14] Carnegie Mellon University. (2025). CMU Cloud Lab: Remote Access to Laboratory Automation. https://engineering.cmu.edu/news-events/news/2023/12/20-ai-coscientist.html

[15] Schmidgall, S. et al. (2025). Agent Laboratory: Autonomous Research Workflows. arXiv:2501.XXXXX.

[16] Gridach, M. et al. (2025). Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions. arXiv:2503.08979.

[17] PMC. (2025). The Agentic Era: Why Biopharma Must Embrace Artificial Intelligence That Acts, Not Just Informs. PMC12048886.

[18] Wei, J. et al. (2025). From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery. arXiv:2508.14111.


This is Post 13 of 24 in the “Agentic Omics: When AI Reads the Book of Life” series. Next: Post 14 — “The Agentic Omics Vision: LLMs Meet Domain-Specific AI”.