Public

WebDev

Files 3 items

figs

Jan 27, 2026

FOLDER Last modified by WebDev

LICENSE 1.06 KB

Jan 27, 2026

FILE Last modified by WebDev

README.md 115.12 KB

Jan 27, 2026

MD Last modified by WebDev

README.md 115.12 KB

View Raw JSON

Awesome Agentic Reasoning Papers

![Awesome](https://awesome.re) ![arXiv](https://arxiv.org/abs/2601.12538) ![Coverage](https://x.com/wei_tianxin/status/2014133714976985538) ![Hugging Face #1 Paper of the Day](https://huggingface.co/papers/2601.12538)

![License: MIT](https://opensource.org/licenses/MIT) ![Contributions Welcome](https://github.com/weitianxin/Awesome-Agentic-Reasoning/blob/main/CONTRIBUTING.md) !Last Commit !Visitors

This repository organizes research by thematic areas that integrate reasoning with action, including planning, tool use, search, self-evolution through memory and feedback, multi-agent systems, and real-world applications and benchmarks.

📄 Based on the survey: Agentic Reasoning for Large Language Models: A Survey

!Framework overview

🔔 News

[01/21/26] 🚀 We have released a comprehensive survey on Agentic Reasoning for Large Language Models! The paper is now available on arxiv and HuggingFace. We welcome contributions from the community to help expand and improve our survey 🤗!

📋 Table of Contents

🏗️ Foundational Agentic Reasoning

🗺️ Planning Reasoning

🛠️ Tool-Use Optimization

🔍 Agentic Search

🧬 Self-evolving Agentic Reasoning

🔄 Agentic Feedback Mechanisms

🧠 Agentic Memory

🚀 Evolving Foundational Agentic Capabilities

👥 Collective Multi-agent Reasoning

🎭 Role Taxonomy of Multi-Agent Systems (MAS)

🤝 Collaboration and Division of Labor

🌱 Multi-Agent Memory and Evolution

🎨 Applications

💻 Math Exploration & Vibe Coding Agents

🔬 Scientific Discovery Agents

🤖 Embodied Agents

🏥 Healthcare & Medicine Agents

🌐 Autonomous Web Exploration & Research Agents

📊 Benchmarks

⚙️ Core Mechanisms of Agentic Reasoning

Tool Use

Memory and Planning

Multi-Agent System

🎯 Applications of Agentic Reasoning

Embodied Agents

Scientific Discovery Agents

Autonomous Research Agents

Medical and Clinical Agents

Web Agents

General Tool-Use Agents

🌟 Introduction

Bridging thought and action through autonomous agents that reason, act, and learn via continual interaction with their environments. The goal is to enhance agent capabilities by grounding reasoning in action.

We organize agentic reasoning into three layers, each corresponding to a distinct reasoning paradigm under different environmental dynamics:

🔹 Foundational Reasoning. Core single-agent abilities (planning, tool-use, search) in environments

🔹 Self-Evolving Reasoning. Adaptation through feedback, memory, and learning in dynamic settings

🔹 Collective Reasoning. Multi-agent coordination, role specialization, and collaborative intelligence

Across these layers, we further identify complementary reasoning paradigms defined by their optimization settings.

🔸 In-Context Reasoning. Test-time scaling through structured orchestration and adaptive workflows

🔸 Post-Training Reasoning. Behavior optimization via RL and supervised fine-tuning

🤝 Contributing

This collection is an ongoing effort. We are actively expanding and refining its coverage, and welcome contributions from the community. You can:

Submit a pull request to add papers or resources

Open an issue to suggest additional papers or resources

Email us at [email protected], [email protected], [email protected]

We regularly update the repository to include new research.

📝 Citation

If you find this repository or paper useful, please consider citing the survey paper:

BIBTEX

@article{wei2026agentic,
  title={Agentic Reasoning for Large Language Models},   author={Wei, Tianxin and Li, Ting-Wei and Liu, Zhining and Ning, Xuying and Yang, Ze and Zou, Jiaru and Zeng, Zhichen and Qiu, Ruizhong and Lin, Xiao and Fu, Dongqi and others},   journal={arXiv preprint arXiv:2601.12538},   year={2026} }

🏗️ Foundational Agentic Reasoning

🗺️ Planning Reasoning

!plan

In-context Planning

In-Context Tool-Integration

Interleaving Reasoning and Tool Use

| Paper | Year | | --- | --- | | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | NeurIPS 2022 | | ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models | EMNLP 2023 | | MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting | ACL 2023 | | Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions | ACL 2023 | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | ART: Automatic Multi-step Reasoning and Tool-use for Large Language Models | 2023 |

Optimizing Context for Tool Interaction

| Paper | Year | | --- | --- | | Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models | 2023 | | EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction | NAACL 2025 | | GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution | EACL 2024 | | AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning | NeurIPS 2024 |

Post-training Tool-Integration

Bootstrapping of Tool Use via SFT

| Paper | Year | | --- | --- | | Toolformer: Language Models Can Teach Themselves to Use Tools | NeurIPS 2023 | | ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs | ICLR 2024 | | ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases | 2023 | | Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | NeurIPS 2023 | | RestGPT: Connecting Large Language Models with Real-World RESTful APIs | 2023 | | ADaPT: As-Needed Decomposition and Planning with Language Models | 2023 | | Agent Lumos: Unified and Modular Training for Open-Source Language Agents | 2023 | | Learning to Use Tools via Cooperative and Interactive Agents | 2024 | | Understanding the Effects of RLHF on LLM Generalisation and Diversity | 2023 | | Preserving Diversity in Supervised Fine-Tuning of Large Language Models | 2024 | | Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification | EMNLP 2024 | | Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning | 2025 | | iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use | 2025 | | START: Self-taught Reasoner with Tools | 2025 |

Mastery of Tool Use via RL

Orchestration-based Tool-Integration

Agentic Pipelines for Tool Orchestration

| Paper | Year | | --- | --- | | ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback | 2025 | | Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning | KDD 2025 | | OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning | 2025 | | Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models | 2025 | | PyVision: Agentic Vision with Dynamic Tooling | 2025 | | Learning to Use Tools via Cooperative and Interactive Agents | 2024 | | El Agente: An Autonomous Agent for Quantum Chemistry | 2025 |

Tool Representations for Orchestration

| Paper | Year | | --- | --- | | ToolExpNet: Optimizing Multi-Tool Selection in LLMs with Similarity and Dependency-Aware Experience Networks | ACL (Findings) 2025 | | T^2Agent: A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search | 2025 | | ToolChain: Efficient Action Space Navigation in Large Language Models with A Search | 2023 | | ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval | COLING 2024 |

🔍 Agentic Search

!search

In-Context Search

Interleaving Reasoning and Search

| Paper | Year | | --- | --- | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | Measuring and Narrowing the Compositionality Gap in Language Models | 2022 | | Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions | 2022 | | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | NeurIPS Workshop 2023 | | Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-Adaptive Planning Agent | 2024 | | DeepRAG: Thinking to Retrieve Step by Step for Large Language Models | 2025 | | MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains | NeurIPS Workshop 2025 |

Structure-Enhanced Search

| Paper | Year | | --- | --- | | Agent-G: An Agentic Framework for Graph Retrieval Augmented Generation | 2025 | | MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains | NeurIPS Workshop 2025 | | GeAR: Graph-Enhanced Agent for Retrieval-Augmented Generation | 2024 | | Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection | 2025 |

Post-Training Search

SFT-Based Agentic Search

| Paper | Year | | --- | --- | | Toolformer: Language Models Can Teach Themselves to Use Tools | NeurIPS 2023 | | INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning | 2024 | | RAG-Studio: Towards In-Domain Adaptation of Retrieval Augmented Generation through Self-Alignment | EMNLP (Findings) 2024 | | RAFT: Adapting Language Model to Domain Specific RAG | 2024 | | Search-o1: Agentic search-enhanced large reasoning models | 2025 | | RA-DIT: Retrieval-Augmented Dual Instruction Tuning | ICLR 2023 | | SFR-RAG: Towards Contextually Faithful LLMs | 2024 |

RL-Based Agentic Search

| Paper | Year | | --- | --- | | WebGPT: Browser-assisted question-answering with human feedback | 2021 | | RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning | 2025 | | Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning | 2025 | | KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering | 2025 | | DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-World Environments | 2025 | | ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning | 2025 | | ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding | 2025 |

🧬 Self-evolving Agentic Reasoning

🔄 Agentic Feedback Mechanisms

!feed

Reflective Feedback

| Paper | Year | | --- | --- | | Reflexion: Language Agents with Verbal Reinforcement Learning | NeurIPS 2023 | | Self-Refine: Iterative Refinement with Self-Feedback | NeurIPS 2023 | | Enable Language Models to Implicitly Learn Self-Improvement From Data | ICLR 2024 | | A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve | TMLR 2025 | | Tree of Thoughts: Deliberate Problem Solving with Large Language Models | NeurIPS 2023 | | Graph of Thoughts: Solving Elaborate Problems with Large Language Models | AAAI 2024 | | Zero-Shot Verification-Guided Chain of Thoughts | 2025 | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | WebGPT: Browser-assisted Question-Answering with Human Feedback | 2021 | | MemGPT: Towards LLMs as Operating Systems | 2023 | | Voyager: An Open-Ended Embodied Agent with Large Language Models | 2023 |

Parametric Adaptation

| Paper | Year | | --- | --- | | AgentTuning: Enabling Generalized Agent Abilities for LLMs | 2023 | | ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent | 2023 | | Re-ReST: Reflection-Reinforced Self-Training for Language Agents | 2024 | | Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes | 2023 | | Deep Reinforcement Learning from Human Preferences | NeurIPS 2017 | | Direct Preference Optimization: Your Language Model is Secretly a Reward Model | NeurIPS 2023 | | Constitutional AI: Harmlessness from AI Feedback | 2022 | | ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection | ACL (Findings) 2025 |

Validator-Driven Feedback

| Paper | Year | | --- | --- | | ReZero: Enhancing LLM search ability by trying one-more-time | 2025 | | Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback | 2025 | | CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning | 2022 | | LEVER: Learning to Verify Language-to-Code Generation with Execution | ICML 2023 | | SWE-bench: Can Language Models Resolve Real-world Github Issues? | ICLR 2024 | | Do As I Can, Not As I Say: Grounding Language in Robotic Affordances | CoRL 2022 | | PaLM-E: An Embodied Multimodal Language Model | ICML 2023 | | Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning | 2025 |

🧠 Agentic Memory

!mem

Agentic Use of Flat Memory

Factual Memory

| Paper | Year | | --- | --- | | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | NeurIPS 2020 | | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | ICLR 2024 | | MemoryBank: Enhancing Large Language Models with Long-Term Memory | 2023 | | LlamaIndex | 2022 | | MemGPT: Towards LLMs as Operating Systems | 2023 | | RET-LLM: Towards a General Read-Write Memory for Large Language Models | 2023 | | SCM: Enhancing Large Language Model with Self-Controlled Memory Framework | 2023 | | Evaluating Very Long-Term Conversational Memory of LLM Agents | 2024 | | LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory | 2024 | | SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals | NAACL 2025 | | FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design | 2023 | | A-mem: Agentic memory for llm agents | 2025 | | In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents | 2025 | | Zep: A Temporal Knowledge Graph Architecture for Agent Memory | 2025 | | MIRIX: Multi-Agent Memory System for LLM-Based Agents | 2025 | | MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models | 2025 | | LightMem: Lightweight and Efficient Memory-Augmented Generation | 2025 | | Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science | 2025 |

Experience Memory

| Paper | Year | | --- | --- | | Agent Workflow Memory | 2024 | | Sleep-time Compute: Beyond Inference Scaling at Test-time | 2025 | | Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory | 2025 | | Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models | 2025 | | ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory | 2025 | | Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory | 2025 |

Structured Use of Memory

| Paper | Year | | --- | --- | | RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph | 2024 | | From Local to Global: A Graph RAG Approach to Query-Focused Summarization | 2024 | | Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory | 2025 | | Zep: A Temporal Knowledge Graph Architecture for Agent Memory | 2025 | | From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs | 2024 | | AutoFlow: Automated Workflow Generation for Large Language Model Agents | 2024 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | FlowMind: Automatic Workflow Generation with LLMs | 2024 | | Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory (M3-Agent) | 2025 | | Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations | 2025 | | Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks | NeurIPS 2024 | | RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents | 2024 |

Post-training Memory Control

| Paper | Year | | --- | --- | | MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent | 2025 | | MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents | 2025 | | Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning | 2025 | | Mem-alpha: Learning Memory Construction via Reinforcement Learning | 2025 | | Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks | 2025 | | Agent Learning via Early Experience | 2025 | | Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents | 2026 | | MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory | 2026 |

🚀 Evolving Foundational Agentic Capabilities

!mem

Self-evolving Planning

| Paper | Year | | --- | --- | | Self-challenging language model agents | 2025 | | Self-rewarding language models | ICML 2024 | | RLSR: Reinforcement Learning from Self Reward | 2025 | | Self: Self-evolution with language feedback | 2023 | | Training language models to self-correct via reinforcement learning | 2024 | | TextGrad: Differentiable Text Feedback for Language Models | 2024 | | AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning | 2025 | | AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation | 2024 | | Reflexion: Language agents with verbal reinforcement learning | NeurIPS 2023 | | Adaplanner: Adaptive planning from feedback with language models | NeurIPS 2023 | | Self-refine: Iterative refinement with self-feedback | NeurIPS 2023 | | A self-improving coding agent | 2025 | | Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning | 2025 | | DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning | 2025 |

Self-evolving Tool-use

| Paper | Year | | --- | --- | | Large Language Models as Tool Makers | ICLR 2024 | | CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets | ICLR 2024 | | CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models | EMNLP 2023 | | LLM Agents Making Agent Tools | 2025 |

Self-evolving Search for Memory Retrieval

| Paper | Year | | --- | --- | | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | NeurIPS 2020 | | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | ICLR 2024 | | MemoryBank: Enhancing Large Language Models with Long-Term Memory | 2023 | | MemGPT: Towards LLMs as Operating Systems | 2023 | | Agent Workflow Memory | 2024 | | Dynamic Cheatsheet: Test-time learning with adaptive memory | 2025 | | Reflexion: Language agents with verbal reinforcement learning | NeurIPS 2023 | | ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory | 2025 | | Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models | 2025 | | AutoFlow: Automated Workflow Generation for Large Language Model Agents | 2024 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | FlowMind: Automatic Workflow Generation with LLMs | 2024 | | RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph | 2024 | | From Local to Global: A Graph RAG Approach to Query-Focused Summarization | 2024 | | Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory | 2025 | | Zep: A Temporal Knowledge Graph Architecture for Agent Memory | 2025 | | MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models | 2025 | | Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks | 2025 |

👥 Collective Multi-agent Reasoning

!mem

🤝 Collaboration and Division of Labor

!collab

In-context Collaboration

Manually Crafted Pipelines

| Paper | Year | | --- | --- | | AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving | 2025 | | MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework | ICLR 2024 | | SurgRAW: Multi-agent workflow with chain-of-thought reasoning for surgical intelligence | 2025 | | Collab-RAG: Boosting retrieval-augmented generation for complex question answering via white-box and black-box llm collaboration | 2025 | | MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning | 2025 | | Chain of Agents: Large Language Models Collaborating on Long-Context Tasks | NeurIPS 2024 | | AutoAgents: a framework for automatic agent generation | IJCAI 2024 | | RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning | 2025 | | SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents | 2024 | | MDocAgent: A multi-modal multi-agent framework for document understanding | 2025 |

LLM-Driven Pipelines

| Paper | Year | | --- | --- | | AutoML-Agent: A multi-agent llm framework for full-pipeline automl | 2024 | | Magentic-One: A generalist multi-agent system for solving complex tasks | 2024 | | MAS-GPT: Training LLMs to build LLM-based multi-agent systems | 2025 | | MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines | 2025 | | Agent-oriented planning in multi-agent systems | 2024 | | AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering | 2025 | | Talk to Right Specialists: Routing and planning in multi-agent system for question answering | 2025 |

Theory-of-Mind-Augmented Collaboration

| Paper | Year | | --- | --- | | Theory of mind for multi-agent collaboration via large language models | 2023 | | Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models | 2024 | | MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning | 2024 | | How large language models encode theory-of-mind: a study on sparse parameter patterns | npj Artificial Intelligence 2025 | | Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection | 2025 | | BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind | 2025 |

Post-training Collaboration

Multi-agent Prompt Optimization

| Paper | Year | | --- | --- | | AutoAgents: A Framework for Automatic Agent Generation | IJCAI 2024 | | Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration | NAACL 2024 | | DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines | 2023 | | Multi-agent Design: Optimizing Agents with Better Prompts and Topologies | 2025 | | Automatic Prompt Optimization with "Gradient Descent" and Beam Search | 2023 |

Graph-based Topology Generation

| Paper | Year | | --- | --- | | Learning Multi-Agent Communication from Graph Modeling Perspective | 2024 | | G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks | 2024 | | Graph Diffusion for Robust Multi-Agent Coordination | ICML 2025 | | Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems | 2024 | | Adaptive Graph Pruning for Multi-Agent Communication | 2025 | | G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-Agent Systems | 2025 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | Multi-agent Design: Optimizing Agents with Better Prompts and Topologies | 2025 | | Multi-Agent Architecture Search via Agentic Supernet | 2025 | | DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-Agent System | 2025 | | GPTSwarm: Language Agents as Optimizable Graphs | ICML 2024 |

Policy-based Topology Generation

| Paper | Year | | --- | --- | | MASRouter: Learning to Route LLMs for Multi-Agent Systems | 2025 | | RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory | 2025 | | xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning | 2025 | | Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration | 2025 | | LLM Collaboration with Multi-Agent Reinforcement Learning | 2025 | | Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems | 2025 | | Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy | 2025 | | LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation | IEEE RA-L 2025 | | MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning | 2025 | | Reflective Multi-Agent Collaboration Based on Large Language Models) | NeurIPS 2024 | | Sirius: Self-Improving Multi-Agent Systems via Bootstrapped Reasoning | 2025 | | Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains | 2025 | | M3HF: Multi-Agent Reinforcement Learning from Multi-Phase Human Feedback of Mixed Quality | 2025 | | O-MAPL: Offline Multi-Agent Preference Learning | 2025 |

🌱 Multi-Agent Memory and Evolution

!mem

From Single-Agent Evolution to Multi-Agent Evolution

Intra-test-time Evolution

| Paper | Year | | --- | --- | | Reflexion: Language Agents with Verbal Reinforcement Learning | NeurIPS 2023 | | Self-Refine: Iterative Refinement with Self-Feedback | NeurIPS 2023 | | AdaPlanner: Adaptive Planning from Feedback with Language Models | NeurIPS 2023 | | TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution | TiFA 2024 | | Self-Adapting Language Models | 2025 | | TTRL: Test-Time Reinforcement Learning | 2025 | | Ladder: Self-Improving LLMs through Recursive Problem Decomposition | 2025 |

Inter-test-time Evolution

| Paper | Year | | --- | --- | | Self: Self-Evolution with Language Feedback | 2023 | | STaR: Bootstrapping Reasoning with Reasoning | NeurIPS 2022 | | Reasoning Beyond Limits: Advances and Open Problems for LLMs | 2025 | | RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning | 2025 | | DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning | 2025 | | WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning | 2024 | | Why do animals need shaping? A theory of task composition and curriculum learning | 2024 | | SAGE: Self-evolving Agents with Reflective and Memory-augmented Abilities | Neurocomputing 2025 | | MemInsight: Autonomous Memory Augmentation for LLM Agents | 2025 | | Agent Workflow Memory | 2024 |

Multi-agent Evolution

| Paper | Year | | --- | --- | | Self: Self-Evolution with Language Feedback | 2023 | | Training Language Models to Self-Correct via Reinforcement Learning | 2024 | | TextGrad: Automatic "Differentiation" via Text | 2024 | | REMA: Learning to Meta-Think for LLMs with Multi-Agent Reinforcement Learning | 2025 | | Group-in-Group Policy Optimization for LLM Agent Training | 2025 | | Agent Workflow Memory | 2024 | | MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models | 2025 | | Multi-agent Design: Optimizing Agents with Better Prompts and Topologies | 2025 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | Testing Advanced Driver Assistance Systems Using Multi-Objective Search and Neural Networks | ASE 2016 | | Latent Collaboration in Multi-Agent Systems | 2025 |

Multi-agent Memory Management for Evolution

| Paper | Year | | --- | --- | | G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems | 2025 | | Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory | 2025 | | LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning | 2025 | | SEDM: Scalable Self-Evolving Distributed Memory for Agents | 2025 | | Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control | 2025 | | Memory Sharing for Large Language Model based Agents | 2024 | | MIRIX: Multi-Agent Memory System for LLM-Based Agents | 2025 | | LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation | 2025 | | MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning | ALTA 2025 | | Lyfe Agents: Generative agents for low-cost real-time social interactions | 2023 | | Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving | 2025 |

Training Multi-agent to Evolve

| Paper | Year | | --- | --- | | Multi-Agent Evolve: LLM Self-Improve through Co-evolution | 2025 | | CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards | 2025 | | MARFT: Multi-Agent Reinforcement Fine-Tuning | 2025 | | Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs | 2025 | | MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning | 2025 | | MALT: Multi-Agent Learning from Trajectories | 2025 | | MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning | 2025 | | Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques | 2024 | | The Alignment Waltz: Jointly Training Agents to Collaborate for Safety | 2025 |

🎨 Applications

!app

💻 Math Exploration & Vibe Coding Agents

Foundational Agentic Reasoning

| Paper | Year | | --- | --- | | Advancing mathematics by guiding human intuition with AI | Nature 2021 | | Solving olympiad geometry without human demonstrations | Nature 2024 | | Mathematical discoveries from program search with large language models | Nature 2024 | | Mathematical Exploration and Discovery at Scale | 2025 | | Advancing geometry with AI: Multi-agent generation of polytopes | 2025 | | Towards Robust Mathematical Reasoning | EMNLP 2025 | | CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules | ICLR 2024 | | Executable Code Actions Elicit Better LLM Agents | ICML 2024 | | Knowledge-Aware Code Generation with Large Language Models | ICPC 2024 | | CodePlan: Repository-level Coding using LLMs and Planning | FSE 2024 | | Multi-stage guided code generation for Large Language Models | Eng. App. AI 2025 | | CodeTree: Agent-Guided Tree Search for Code Generation with Large Language Models | 2024 | | DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning | 2024 | | Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks | ACL 2025 | | CoRT: Code-integrated Reasoning within Thinking | 2025 | | DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal | 2025 | | Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search | NeurIPS 2024 | | VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning | AAAI 2025 | | Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents | ICML 2025 | | An In-Context Learning Agent for Formal Theorem-Proving | COLM 2024 | | Formal Mathematical Reasoning: A New Frontier in AI | 2024 | | Generative Modelling for Mathematical Discovery | 2025 | | Toolformer: Language Models Can Teach Themselves to Use Tools | NeurIPS 2023 | | ToolCoder: Teach Code Generation Models to use API search tools | 2023 | | ToolGen: Unified Tool Retrieval and Calling via Generation | ICLR 2025 | | CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges | ACL 2024 | | ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation | ICSE 2025 | | CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision | 2025 | | RepoHyper: Better Context Retrieval is All You Need for Repository-Level Code Completion | 2024 | | CodeNav: Beyond Tool-Use to Using Real-World Codebases with LLM Agents | ICLR 2024 | | Optimizing Code Runtime Performance Through Context-Aware Retrieval-Augmented Generation | ICPC 2025 | | Knowledge Graph Based Repository-Level Code Generation | LLM4Code 2025 | | cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree | 2025 |

Self-evolving Agentic Reasoning

| Paper | Year | | --- | --- | | Evaluating Language Models for Mathematics through Interactions | PNAS 2024 | | CLCL: Non-compositional Expression Detection with Contrastive Learning and Curriculum Learning | ACL 2023 | | Is Self-Repair a Silver Bullet for Code Generation? | 2024 | | LeDeX: Learning to Debug with Execution Feedback | NeurIPS 2024 | | Self-Refine: Iterative Refinement with Self-Feedback | NeurIPS 2023 | | A Self-Iteration Code Generation Method Based on Large Language Models | ICPADS 2023 | | Teaching Large Language Models to Self-Debug | ICLR 2024 | | Self-Collaboration Code Generation via ChatGPT | TOSEM 2024 | | L2MAC: Large Language Model Automatic Computer for Extensive Code Generation | 2023 | | Cogito, Ergo Sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation | 2025 |

Collective Multi-agent Reasoning

| Paper | Year | | --- | --- | | AgentCoder: Multi-Agent-Based Code Generation with Iterative Testing and Optimisation | 2023 | | A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement | ASE 2024 | | SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents | ICSE 2025 | | Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization | 2024 | | MapCoder: Multi-Agent Code Generation for Competitive Problem Solving | 2024 | | AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing | 2024 | | QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks | 2025 | | SEW: Self-Evolving Agentic Workflows for Automated Code Generation | 2025 | | Self-Evolving Multi-Agent Collaboration Networks for Software Development | 2024 | | Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement | 2024 | | CodeCoR: An LLM-based Self-Reflective Multi-Agent Framework for Code Generation | 2025 | | SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering | ICML 2025 | | Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation | 2025 |

🔬 Scientific Discovery Agents

Here are the extracted citation tables grouped by their respective sections.

Foundational Agentic Reasoning

| Paper | Year | | --- | --- | | ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning | Digital Discovery 2024 | | Agent-based learning of materials datasets from the scientific literature | Digital Discovery 2024 | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | Biomni: A General-Purpose Biomedical AI Agent | bioRxiv 2025 | | SciAgent: Tool-augmented Language Models for Scientific Reasoning | 2024 | | Chemcrow: Augmenting large-language models with chemistry tools | 2023 | | CACTUS: Chemistry Agent Connecting Tool-Usage to Science | ACS Omega 2024 | | ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving | 2024 | | CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning | 2025 | | TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools | 2025 | | AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning | Nature Communications 2025 | | LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation | 2024 | | HoneyComb: A Flexible LLM-Based Agent System for Materials Science | 2024 | | CRISPR-GPT for Agentic Automation of Gene-editing Experiments | 2024 | | PharmAgents: Building a Virtual Pharma with Large Language Model Agents | 2025 | | ORGANA: A robotic assistant for automated chemistry experimentation and characterization | Matter 2025 | | AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence | 2024 | | Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis | 2024 | | LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery | 2024 | | CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis | BioRxiv 2024 | | BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments | 2024 | | DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction | 2024 | | Accelerating Scientific Research Through a Multi-LLM Framework | 2025 | | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | 2025 | | Large Language Models are Zero Shot Hypothesis Proposers | 2023 | | PaperQA: Retrieval-Augmented Generative Agent for Scientific Research | 2023 | | Language agents achieve superhuman synthesis of scientific knowledge | 2024 | | LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation | 2024 |

Self-evolving Agentic Reasoning

| Paper | Year | | --- | --- | | ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning | 2025 | | Accelerated Inorganic Materials Design with Generative AI Agents | 2025 | | LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery | 2024 | | ChemReasoner: Heuristic Search over a Large Language Model's Knowledge Space using Quantum-Chemical Feedback | 2024 | | LLMatDesign: Autonomous Materials Discovery with Large Language Models | 2024 | | Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents | 2025 |

Collective multi-agent reasoning

| Paper | Year | | --- | --- | | ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning | Digital Discovery 2024 | | PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration | 2025 | | AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence | 2024 | | CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis | BioRxiv 2024 | | Accelerating Scientific Research Through a Multi-LLM Framework | 2025 | | Toward a team of ai-made scientists for scientific discovery from gene expression data | 2024 | | The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation | bioRxiv 2024 |

🤖 Embodied Agents

Foundational Agentic Reasoning

Self-evolving Agentic Reasoning

| Paper | Year | | --- | --- | | LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics | 2024 | | Optimus-1: Hybrid Multimodal Memory Empowered Agents for Long-Horizon Tasks in Minecraft | 2024 | | Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models | EMNLP 2023 | | Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation | 2025 | | Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents | 2024 | | Ella: Embodied Social Agents with Lifelong Memory | 2025 | | Chat with the Environment: Interactive Multimodal Perception using Large Language Models | IROS 2023 | | From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation | 2025 | | Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners | CoRL 2023 | | Octopus: Embodied Vision-Language Programmer from Environmental Feedback | ECCV 2024 | | MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning | 2024 | | Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration | 2024 | | EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM | 2025 | | Voyager: An Open-Ended Embodied Agent with Large Language Models | 2023 |

Collective multi-agent reasoning

| Paper | Year | | --- | --- | | Smart-LLM: Smart Multi-Agent Robot Task Planning with Large Language Models | 2024 | | CaPo: Cooperative Plan Optimization for Multi-Agent Collaboration | 2024 | | COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models | ICRA 2025 | | Theory of mind for multi-agent collaboration via large language models | 2023 | | How large language models encode theory-of-mind: a study on sparse parameter patterns | npj Artificial Intelligence 2025 | | Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models | 2024 | | MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning | 2024 | | EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM | 2025 | | COMBO: Compositional World Models for Embodied Multi-Agent Cooperation | 2025 | | VIKI-R: A VLM-Based Reinforcement Learning Approach for Heterogeneous Multi-Agent Cooperation | 2025 | | RoCo: Dialectic Multi-Robot Collaboration with Large Language Models | 2024 |

🏥 Healthcare & Medicine Agents

Foundational agentic reasoning

| Paper | Year | | --- | --- | | Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology | Nature Medicine 2024 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology | 2025 | | MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow | 2025 | | MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility | 2025 | | ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning | 2024 | | DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making | 2025 | | TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools | 2025 | | AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning | Nature Communications 2025 | | Large language model agents can use tools to perform clinical calculations | NPJ Digital Medicine 2025 | | MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling | 2024 | | MMedAgent: Learning to Use Medical Tools with Multi-modal Agents | 2024 | | VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis | 2024 | | Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning | 2024 | | Adaptive Reasoning and Acting in Medical Language Agents | 2024 | | MedRAX: Medical Reasoning Agent for Chest X-ray | 2025 | | Conversational Health Agents: A Personalized LLM-Powered Agent Framework | 2023 | | MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science | 2025 | | Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education | 2024 | | Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions | MICCAI 2025 | | RAG-Enhanced Collaborative LLM Agents for Drug Discovery | 2025 | | MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs | 2025 |

Self-evolving agentic reasoning

| Paper | Year | | --- | --- | | Epidemic Modeling with Generative Agents | 2023 | | Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions | MICCAI 2025 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | LLMs Can Simulate Standardized Patients via Agent Coevolution | 2024 | | Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education | 2024 | | MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility | 2025 | | DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making | 2025 | | MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science | 2025 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling | 2025 | | Large language model agents can use tools to perform clinical calculations | NPJ Digital Medicine 2025 |

Collective multi-agent reasoning

| Paper | Year | | --- | --- | | MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making | 2024 | | DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue | 2025 | | Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis | 2024 | | ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning | 2024 | | PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology | 2025 | | Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions | MICCAI 2025 | | LLMs Can Simulate Standardized Patients via Agent Coevolution | 2024 | | DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making | 2025 | | MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning | 2024 | | RAG-Enhanced Collaborative LLM Agents for Drug Discovery | 2025 | | GMAI-VL-R1: Harnessing Reinforcement Learning for Multi-Modal Medical Reasoning | 2025 |

🌐 Autonomous Web Exploration & Research Agents

Foundational agentic reasoning

Self-evolving agentic reasoning

| Paper | Year | | --- | --- | | Agent Workflow Memory | 2024 | | VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought | 2024 | | BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions | 2025 | | AutoWebGLM: A Large Language Model-based Web Navigating Agent | 2024 | | AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents | 2024 | | LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications | 2025 | | WebDancer: Towards Automated Web Information Seeking with Large Language Model Agents | 2025 | | WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization | 2025 | | Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation | 2023 | | MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation | 2024 | | Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks | 2025 | | Agent Laboratory: Using LLM Agents as Research Assistants | 2025 | | GPT Researcher | 2023 | | Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents | 2024 | | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | 2025 | | Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents | 2024 | | Reflection-Based Memory For Web navigation Agents | 2025 | | Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems | 2024 | | Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution | 2025 | | WINELL: Wikipedia Never-Ending Updating with LLM Agents | 2025 | | WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection | 2025 | | GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior | 2025 | | History-Aware Reasoning for GUI Agents | 2025 | | MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation | 2025 | | InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection | 2025 | | Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks | 2025 | | CycleResearcher: Improving Automated Research via Automated Review | 2024 | | MLR-Copilot: Autonomous Machine Learning Research based on Large Language Model Agents | 2024 | | Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback | 2025 | | DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments | 2025 |

Collective multi-agent reasoning

| Paper | Year | | --- | --- | | WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration | 2024 | | WINELL: Wikipedia Never-Ending Updating with LLM Agents | 2025 | | Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution | 2025 | | Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents | 2024 | | Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems | 2024 | | Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks | 2025 | | Agentic Web: Weaving the Next Web with AI Agents | 2025 | | CoLA: Collaborative Low-Rank Adaptation | 2025 | | Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration | ACL 2024 | | Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks | 2025 | | Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation | 2025 | | MobileExperts: Orchestrating Tool-Capable Specialists for Mobile Automation | 2024 | | Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use | 2025 | | PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC | 2025 | | AgentRxiv: Towards Collaborative Autonomous Research | 2025 | | Accelerating Scientific Research Through a Multi-LLM Framework | 2025 | | Large Language Models are Zero-Shot Reasoners | NeurIPS 2022 | | Emergent autonomous scientific research capabilities of large language models | Nature 2023 | | Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data | 2024 |

📊 Benchmarks

!bench

⚙️ Core Mechanisms of Agentic Reasoning

Tool Use

Single-Turn Tool Use

| Paper | Year | | --- | --- | | ToolQA: A Dataset for LLM Question Answering with External Tools | NeurIPS 2023 | | Gorilla: Large Language Model Connected with Massive APIs | 2023 | | ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs | ICLR 2024 | | MetaTool: A Benchmark for Controlling Special-purpose Large Language Models | ICLR 2024 | | T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step | ACL 2024 | | GTA: A Benchmark for General Tool Agents | NeurIPS 2024 | | Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models | 2025 |

Multi-Turn Tool Use

| Paper | Year | | --- | --- | | ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases | 2023 | | On the Tool Manipulation Capability of Open-source Large Language Models | 2023 | | API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs | EMNLP 2023 | | Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios | ACL 2024 | | MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models | ICLR 2025 |

Search

Memory and Planning

Long-Horizon Episodic Memory

| Paper | Year | | --- | --- | | PerLTQA: A Persona-based Long-term Memory Benchmark for RAG | 2024 | | ELITR-Bench: A Meeting Assistant Benchmark for Long-Context LLMs | 2024 | | Multi-IF: A Benchmark for Multi-turn Instruction Following | 2024 | | MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs | 2025 | | TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models | 2025 | | StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns | 2025 | | MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation | 2025 |

Multi-session Recall

| Paper | Year | | --- | --- | | Evaluating Very Long-Term Conversational Memory of LLM Agents | 2024 | | MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants | 2024 | | LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory | 2024 | | REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation | 2025 | | Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions | 2025 | | Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents | 2026 | | Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory | 2025 |

Planning and Feedback

| Paper | Year | | --- | --- | | ALFWorld: Aligning Text and Embodied Environments for Interactive Learning | ICLR 2021 | | PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change | NeurIPS 2022 | | ACPBench: Reasoning about Action, Change, and Planning | 2024 | | Text2World: Benchmarking Large Language Models for Symbolic World Model Generation | ACL 2025 | | REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks | 2025 | | TravelPlanner: A Benchmark for Real-World Planning with Language Agents | ICML 2024 | | FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents | 2024 | | UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models | 2025 |

Multi-Agent System

Game-based reinforcement learning evaluation

| Paper | Year | | --- | --- | | MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence | AAAI 2018 | | Pommerman: A Multi-Agent Playground | 2018 | | The StarCraft Multi-Agent Challenge | NeurIPS 2019 | | MineLand: Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs | 2024 | | TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft | 2024 | | Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot | ICML 2021 | | BenchMARL: Benchmarking Multi-Agent Reinforcement Learning | 2023 | | Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence | AAAI 2020 |

Simulation-centric real-world assessment

| Paper | Year | | --- | --- | | SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving | CoRL 2020 | | Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world | NeurIPS 2022 | | A Versatile Multi-Agent Reinforcement Learning Benchmark for Inventory Management | 2023 | | IMP-MARL: a Suite of Environments for Infrastructure Management Planning with Multi-Agent Reinforcement Learning | NeurIPS 2023 | | POGEMA: Partially Observable Grid Environment for Multiple Agents | Arxiv 2022 | | IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning | NeurIPS 2024 | | REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks | 2025 |

Language, Communication, and Social Reasoning

| Paper | Year | | --- | --- | | LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | 2023 | | AvalonBench: Evaluating LLMs Playing the Game of Avalon | 2023 | | Welfare Diplomacy: Benchmarking Language Model Cooperation | 2023 | | MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration | EMNLP 2024 | | BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems | 2024 | | COMMA: A Benchmark for Inter-Agent Communication in Multi-Agent Systems | 2024 | | IntellAgent: A Benchmark for Evaluating Conversational Agents in Realistic Scenarios | 2025 | | MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents | 2025 |

🎯 Applications of Agentic Reasoning

Embodied Agents

| Paper | Year | | --- | --- | | Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks | 2025 | | BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games | NeurIPS 2024 | | ALFWorld: Aligning Text and Embodied Environments for Interactive Learning | ICLR 2021 | | Understanding the Weakness of Large Language Model Agents within a Complex Android Environment | 2024 | | MindAgent: Emergent Gaming Interaction | 2023 | | Playing repeated games with Large Language Models | 2023 | | OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | NeurIPS 2024 |

Scientific Discovery Agents

| Paper | Year | | --- | --- | | DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents | NeurIPS 2024 | | ScienceWorld: Is your Agent Smarter than a 5th Grader? | EMNLP 2022 | | ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery | NeurIPS 2024 | | The AI Scientist: Fully Automated Open-Ended Scientific Discovery | 2024 | | LAB-Bench: Measuring Capabilities of Language Models for Biology Research | 2024 | | MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation | 2023 |

Autonomous Research Agents

| Paper | Year | | --- | --- | | WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? | ICML 2024 | | WorkArena++: Towards Agents that Act Like Employees | 2024 | | OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation | 2024 | | PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change | NeurIPS 2022 | | FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents | 2024 | | ACPBench: Reasoning about Action, Change, and Planning | 2024 | | TRAIL: Trace Reasoning and Agentic Issue Localization | 2025 | | CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization | NeurIPS 2023 | | Agent-as-a-Judge: Evaluate Agents with Agents | 2024 | | InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation | 2025 |

Medical and Clinical Agents

| Paper | Year | | --- | --- | | AgentClinic: a multimodal agent benchmark for clinical environments | NeurIPS 2024 | | MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents | NEJM AI 2025 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning | 2023 | | GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning | 2024 |

Web Agents

| Paper | Year | | --- | --- | | WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents | NeurIPS 2022 | | WebArena: A Realistic Web Environment for Building Autonomous Agents | ICLR 2024 | | OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | NeurIPS 2024 | | AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents | ACL 2024 | | WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? | 2024 | | VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks | NeurIPS 2024 | | WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | ACL 2024 | | Mind2Web: Towards a Generalist Agent for the Web | NeurIPS 2023 | | Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | 2025 | | WebCanvas: Benchmarking Web Agents in Online Canvas | NeurIPS 2024 | | Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks | 2025 | | VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? | 2024 | | WebLINX: Real-World Website Navigation with Multi-Turn Dialogue | CVPR 2024 | | LASER: LLM Agent with State-Space Exploration for Web Navigation | NeurIPS 2023 | | AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Agent for Automated Web Navigation | 2024 | | OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web | 2024 | | BEARCUBS: A benchmark for computer-using web agents | 2025 | | BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents | 2025 | | BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese | 2025 | | Video-Browser: Towards Agentic Open-web Video Browsing | 2025 |

General Tool-Use Agents

| Paper | Year | | --- | --- | | GTA: A Benchmark for General Tool Agents | NeurIPS 2024 | | NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls | 2024 | | Executable Code Actions Elicit Better LLM Agents | ICML 2024 | | RestGPT: Connecting Large Language Models with Real-World RESTful APIs | 2023 | | Search-o1: Agentic Search-Enhanced Large Reasoning Models | 2025 | | Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning | 2025 | | ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints | 2024 | | R-Judge: Benchmarking Safety-Critical Decision Making for LLM Agents | 2024 |

License

This repository is licensed under the MIT License.

Star History

![Star History Chart](https://star-history.com/#weitianxin/Awesome-Agentic-Reasoning&Date)

About

A curated and comprehensive collection of research papers, surveys and resources on agentic reasoning — focusing on how large language models and intelligent agents integrate reasoning with action, planning, tool use, search, memory, self-evolution and multi-agent coordination. This repository organizes key works and thematic areas from foundational concepts to real-world applications in AI-driven agentic workflows.

15 files

1 folders

3.05 MB total size

0 open issues

0 open pull requests

0 watchers

0 forks

0 stars

287 views

Updated Jan 27, 2026

Recent Commits View all

Initial commit - Upload project 'awesome-agentic-reasoning'

WebDev committed Jan 27, 2026

Languages

LICENSE 100.0%

Awesome Agentic Reasoning Papers

🔔 News

📋 Table of Contents

🌟 Introduction

🤝 Contributing

📝 Citation

🏗️ Foundational Agentic Reasoning

🗺️ Planning Reasoning

In-context Planning

Workflow Design

Tree Search / Algorithm Simulation

Process Formalization

Decoupling / Decomposition

External Aid / Tool Use

Post-training Planning

🛠️ Tool-Use Optimization

In-Context Tool-Integration

Interleaving Reasoning and Tool Use

Optimizing Context for Tool Interaction

Post-training Tool-Integration

Bootstrapping of Tool Use via SFT

Mastery of Tool Use via RL

Orchestration-based Tool-Integration

Agentic Pipelines for Tool Orchestration

Tool Representations for Orchestration

🔍 Agentic Search

In-Context Search

Interleaving Reasoning and Search

Structure-Enhanced Search

Post-Training Search

SFT-Based Agentic Search

RL-Based Agentic Search

🧬 Self-evolving Agentic Reasoning

🔄 Agentic Feedback Mechanisms

Reflective Feedback

Parametric Adaptation

Validator-Driven Feedback

🧠 Agentic Memory

Agentic Use of Flat Memory

Factual Memory

Experience Memory

Structured Use of Memory

Post-training Memory Control

🚀 Evolving Foundational Agentic Capabilities

Self-evolving Planning

Self-evolving Tool-use

Self-evolving Search for Memory Retrieval

👥 Collective Multi-agent Reasoning

🤝 Collaboration and Division of Labor

In-context Collaboration

Manually Crafted Pipelines

LLM-Driven Pipelines

Theory-of-Mind-Augmented Collaboration

Post-training Collaboration

Multi-agent Prompt Optimization

Graph-based Topology Generation

Policy-based Topology Generation

🌱 Multi-Agent Memory and Evolution

From Single-Agent Evolution to Multi-Agent Evolution

Intra-test-time Evolution

Inter-test-time Evolution

Multi-agent Evolution

Multi-agent Memory Management for Evolution

Training Multi-agent to Evolve

🎨 Applications

💻 Math Exploration & Vibe Coding Agents

Foundational Agentic Reasoning

Self-evolving Agentic Reasoning

Collective Multi-agent Reasoning

🔬 Scientific Discovery Agents

Foundational Agentic Reasoning

Self-evolving Agentic Reasoning

Collective multi-agent reasoning

🤖 Embodied Agents

Foundational Agentic Reasoning

Self-evolving Agentic Reasoning

Collective multi-agent reasoning

🏥 Healthcare & Medicine Agents

Foundational agentic reasoning

Self-evolving agentic reasoning