Awesome Agentic Reasoning Papers
   
  !Last Commit !Visitors
This repository organizes research by thematic areas that integrate reasoning with action, including planning, tool use, search, self-evolution through memory and feedback, multi-agent systems, and real-world applications and benchmarks.
π Based on the survey: Agentic Reasoning for Large Language Models: A Survey
π News
[01/21/26] π We have released a comprehensive survey on Agentic Reasoning for Large Language Models! The paper is now available on arxiv and HuggingFace. We welcome contributions from the community to help expand and improve our survey π€!
π Table of Contents
π Introduction
Bridging thought and action through autonomous agents that reason, act, and learn via continual interaction with their environments. The goal is to enhance agent capabilities by grounding reasoning in action.
We organize agentic reasoning into three layers, each corresponding to a distinct reasoning paradigm under different environmental dynamics:
πΉ Foundational Reasoning. Core single-agent abilities (planning, tool-use, search) in environments
πΉ Self-Evolving Reasoning. Adaptation through feedback, memory, and learning in dynamic settings
πΉ Collective Reasoning. Multi-agent coordination, role specialization, and collaborative intelligence
Across these layers, we further identify complementary reasoning paradigms defined by their optimization settings.
πΈ In-Context Reasoning. Test-time scaling through structured orchestration and adaptive workflows
πΈ Post-Training Reasoning. Behavior optimization via RL and supervised fine-tuning
π€ Contributing
This collection is an ongoing effort. We are actively expanding and refining its coverage, and welcome contributions from the community. You can:
We regularly update the repository to include new research.
π Citation
If you find this repository or paper useful, please consider citing the survey paper:
@article{wei2026agentic,
title={Agentic Reasoning for Large Language Models}, author={Wei, Tianxin and Li, Ting-Wei and Liu, Zhining and Ning, Xuying and Yang, Ze and Zou, Jiaru and Zeng, Zhichen and Qiu, Ruizhong and Lin, Xiao and Fu, Dongqi and others}, journal={arXiv preprint arXiv:2601.12538}, year={2026} }
ποΈ Foundational Agentic Reasoning
πΊοΈ Planning Reasoning
!plan
In-context Planning
Workflow Design
| Paper | Year | | --- | --- | | LLM+P: Empowering Large Language Models with Optimal Planning Proficiency | 2023 | | PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change | NeurIPS 2023 DB Track | | ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models | 2023 | | LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models | 2024 | | Least-to-Most Prompting Enables Complex Reasoning in Large Language Models | ICLR 2023 | | Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models | ACL 2023 | | Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models | ICML 2024 | | HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face | 2023 | | Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents | 2023 | | PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation | 2024 | | Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks | 2025 | | CodePlan: Repository-level Coding using LLMs and Planning | FSE 2024 | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | Mind2Web: Towards a Generalist Agent for the Web | NeurIPS 2023 | | WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents | 2024 | | Executable Code Actions Elicit Better LLM Agents | ICML 2024 | | Gorilla: Large Language Model Connected with Massive APIs | 2023 | | Reflexion: Language Agents with Verbal Reinforcement Learning | 2023 | | CodeNav: Beyond Tool-Use to Using Real-World Codebases with LLM Agents | ACL 2024 | | MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing | 2025 | | Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection Agents | 2025 | | Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents | 2025 | | REST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent | 2023 | | Self-Planning Code Generation with Large Language Models | TOSEM 2023 | | LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action | CoRL 2022 |
Tree Search / Algorithm Simulation
| Paper | Year | | --- | --- | | Tree of Thoughts: Deliberate Problem Solving with Large Language Models | NeurIPS 2023 | | Tree Search for Language Model Agents | 2024 | | Tree-Planner: Efficient Planning with Large Language Models | ICLR 2024 | | Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning | 2024 | | LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning | 2024 | | Multimodal Chain-of-Thought Reasoning in Language Models | 2023 | | Reasoning with Language Model is Planning with World Model | NeurIPS 2023 | | Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents | 2024 | | Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning | 2024 | | Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning | 2023 | | Large Language Models as Tool Makers | ICLR 2024 | | Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation | 2023 | | Tree of Thoughts: Deliberate Problem Solving with Large Language Models | NeurIPS 2023 | | Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training | 2023 | | Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space | 2025 | | Self-Evaluation Guided Beam Search for Reasoning | NeurIPS 2023 | | PathFinder: Multimodal Multi-Agent Medical Diagnosis Framework | 2025 | | Discriminator-Guided Embodied Planning for LLM Agent | ICLR 2025 | | Stream of Search (SoS): Learning to Search in Language | 2024 | | System-1.x: Learning to Balance Fast and Slow Planning with Language Models | 2024 | | Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems | 2024 | | Intelligent Virtual Assistants with LLM-based Process Automation | 2023 | | Agent S: An Open Agentic Framework that Uses Computers Like a Human | 2024 | | HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking | 2025 | | Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling | ACL 2025 | | Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution | 2025 | | Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning | 2025 | | SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement | ICLR 2025 | | BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs | 2024 | | Do As I Can, Not As I Say: Grounding Language in Robotic Affordances | CoRL 2022 | | Inner Monologue: Embodied Reasoning through Planning with Language Models | CoRL 2022 |
Process Formalization
| Paper | Year | | --- | --- | | Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning | NeurIPS 2023 | | Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models | NeurIPS 2024 | | Thought of Search: Planning with Language Models Through The Lens of Efficiency | NeurIPS 2024 | | CodePlan: Repository-level Coding using LLMs and Planning | FSE 2024 | | Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming | 2024 | | From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle | 2024 |
Decoupling / Decomposition
| Paper | Year | | --- | --- | | ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models | NeurIPS 2023 | | DiffuserLite: Towards Real-time Diffusion Planning | 2024 | | Goal-Space Planning with Subgoal Models | JMLR 2024 | | Agent-Oriented Planning in Multi-Agent Systems | 2024 | | GoPlan: Goal-Conditioned Offline Reinforcement Learning by Planning with Learned Models | 2023 | | RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning | ICLR 2025 | | HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking | 2025 | | VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning | 2024 | | Beyond Autoregression: Discrete Diffusion for Complex Reasoning | 2024 | | PlanAgent: A Multi-modal Large Language Agent for Vehicle Motion Planning | 2024 | | LLaMAR: Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments | 2024 |
External Aid / Tool Use
| Paper | Year | | --- | --- | | Plan-on-Graph: Self-Correcting Adaptive Planning on Knowledge Graphs | NeurIPS 2024 | | Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification | 2025 | | TeLoGraF: Temporal Logic Planning via Graph-encoded Flow Matching | 2025 | | FlexPlanner: Flexible 3D Floorplanning via Deep Reinforcement Learning in Hybrid Action Space with Multi-Modality Representation | NeurIPS 2024 | | Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following | NeurIPS 2024 | | Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent | 2024 | | RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking | 2025 | | Reasoning with Language Model is Planning with World Model | NeurIPS 2023 | | Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning | NeurIPS 2023 | | Agent Planning with World Knowledge Model | NeurIPS 2024 | | BehaviorGPT: Smart Agent Simulation for Autonomous Driving with Next-Patch Prediction | NeurIPS 2024 | | DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning | 2024 | | FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model | 2024 | | Continual Reinforcement Learning by Planning with Online World Models | 2025 | | AdaWM: Adaptive World Model based Planning for Autonomous Driving | 2025 | | HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face | 2023 | | Tool-Planner: Task Planning with Clusters across Multiple Tools | 2024 | | RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning | ICLR 2025 |
Post-training Planning
| Paper | Year | | --- | --- | | Reflexion: Language Agents with Verbal Reinforcement Learning | NeurIPS 2023 | | Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens | 2025 | | Rational Decision-Making Agent with Internalized Utility Judgment | 2023 | | Scaling Autonomous Agents via Automatic Reward Modeling | 2025 | | Strategic Planning: A Top-Down Approach to Option Generation | 2025 | | Non-myopic Generation of Language Models for Reasoning and Planning | 2024 | | Physics-informed Temporal Difference Metric Learning for Robot Motion Planning | 2025 | | Generalizable Motion Planning via Operator Learning | 2024 | | ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration | 2025 | | Latent Diffusion Planning for Imitation Learning | 2025 | | SafeDiffuser: Safe Planning with Diffusion Probabilistic Models | ICLR 2023 | | ContraDiff: Planning Towards High Return States via Contrastive Learning | ICLR 2025 | | Amortized Planning with Large-Scale Transformers: A Case Study on Chess | NeurIPS 2024 | | GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models | 2023 | | A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks | 2025 |
π οΈ Tool-Use Optimization
!tool
In-Context Tool-Integration
Interleaving Reasoning and Tool Use
| Paper | Year | | --- | --- | | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | NeurIPS 2022 | | ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models | EMNLP 2023 | | MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting | ACL 2023 | | Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions | ACL 2023 | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | ART: Automatic Multi-step Reasoning and Tool-use for Large Language Models | 2023 |
Optimizing Context for Tool Interaction
| Paper | Year | | --- | --- | | Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models | 2023 | | EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction | NAACL 2025 | | GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution | EACL 2024 | | AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning | NeurIPS 2024 |
Post-training Tool-Integration
Bootstrapping of Tool Use via SFT
| Paper | Year | | --- | --- | | Toolformer: Language Models Can Teach Themselves to Use Tools | NeurIPS 2023 | | ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs | ICLR 2024 | | ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases | 2023 | | Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models | NeurIPS 2023 | | RestGPT: Connecting Large Language Models with Real-World RESTful APIs | 2023 | | ADaPT: As-Needed Decomposition and Planning with Language Models | 2023 | | Agent Lumos: Unified and Modular Training for Open-Source Language Agents | 2023 | | Learning to Use Tools via Cooperative and Interactive Agents | 2024 | | Understanding the Effects of RLHF on LLM Generalisation and Diversity | 2023 | | Preserving Diversity in Supervised Fine-Tuning of Large Language Models | 2024 | | Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification | EMNLP 2024 | | Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning | 2025 | | iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use | 2025 | | START: Self-taught Reasoner with Tools | 2025 |
Mastery of Tool Use via RL
| Paper | Year | | --- | --- | | SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution | 2025 | | SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement | 2024 | | ToolRL: Reward is All Tool Learning Needs | 2025 | | RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents | 2025 | | Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning | 2025 | | AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning | 2025 | | ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning | 2025 | | Agentic Reinforced Policy Optimization | 2025 | | Agentic Entropy-Balanced Policy Optimization | 2025 | | Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning | 2025 | | DeepAgent: A General Reasoning Agent with Scalable Toolsets | 2025 | | Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning | 2025 | | Demystifying Reinforcement Learning in Agentic Reasoning | 2025 | | Reinforcement Pre-Training | 2025 | | ReTool: Reinforcement Learning for Strategic Tool Use in LLMs | 2025 | | ZeroSearch: Incentivize the Search Capability of LLMs Without Searching | 2025 | | Kimi k1.5: Scaling Reinforcement Learning with LLMs | 2025 | | Gemini 2.5: Pushing the Frontier with Advanced Reasoning and Next Generation Agentic Capabilities | 2025 | | Kimi k2: Open Agentic Intelligence | 2025 | | GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models | 2025 | | Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning | 2025 | | SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models | 2026 | | TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning | 2025 |
Orchestration-based Tool-Integration
Agentic Pipelines for Tool Orchestration
| Paper | Year | | --- | --- | | ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback | 2025 | | Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning | KDD 2025 | | OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning | 2025 | | Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models | 2025 | | PyVision: Agentic Vision with Dynamic Tooling | 2025 | | Learning to Use Tools via Cooperative and Interactive Agents | 2024 | | El Agente: An Autonomous Agent for Quantum Chemistry | 2025 |
Tool Representations for Orchestration
| Paper | Year | | --- | --- | | ToolExpNet: Optimizing Multi-Tool Selection in LLMs with Similarity and Dependency-Aware Experience Networks | ACL (Findings) 2025 | | T^2Agent: A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search | 2025 | | ToolChain: Efficient Action Space Navigation in Large Language Models with A Search | 2023 | | ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval | COLING 2024 |
π Agentic Search
In-Context Search
Interleaving Reasoning and Search
| Paper | Year | | --- | --- | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | Measuring and Narrowing the Compositionality Gap in Language Models | 2022 | | Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions | 2022 | | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | NeurIPS Workshop 2023 | | Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-Adaptive Planning Agent | 2024 | | DeepRAG: Thinking to Retrieve Step by Step for Large Language Models | 2025 | | MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains | NeurIPS Workshop 2025 |
Structure-Enhanced Search
| Paper | Year | | --- | --- | | Agent-G: An Agentic Framework for Graph Retrieval Augmented Generation | 2025 | | MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains | NeurIPS Workshop 2025 | | GeAR: Graph-Enhanced Agent for Retrieval-Augmented Generation | 2024 | | Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection | 2025 |
Post-Training Search
SFT-Based Agentic Search
| Paper | Year | | --- | --- | | Toolformer: Language Models Can Teach Themselves to Use Tools | NeurIPS 2023 | | INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning | 2024 | | RAG-Studio: Towards In-Domain Adaptation of Retrieval Augmented Generation through Self-Alignment | EMNLP (Findings) 2024 | | RAFT: Adapting Language Model to Domain Specific RAG | 2024 | | Search-o1: Agentic search-enhanced large reasoning models | 2025 | | RA-DIT: Retrieval-Augmented Dual Instruction Tuning | ICLR 2023 | | SFR-RAG: Towards Contextually Faithful LLMs | 2024 |
RL-Based Agentic Search
| Paper | Year | | --- | --- | | WebGPT: Browser-assisted question-answering with human feedback | 2021 | | RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning | 2025 | | Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning | 2025 | | KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering | 2025 | | DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-World Environments | 2025 | | ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning | 2025 | | ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding | 2025 |
𧬠Self-evolving Agentic Reasoning
π Agentic Feedback Mechanisms
!feed
Reflective Feedback
| Paper | Year | | --- | --- | | Reflexion: Language Agents with Verbal Reinforcement Learning | NeurIPS 2023 | | Self-Refine: Iterative Refinement with Self-Feedback | NeurIPS 2023 | | Enable Language Models to Implicitly Learn Self-Improvement From Data | ICLR 2024 | | A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve | TMLR 2025 | | Tree of Thoughts: Deliberate Problem Solving with Large Language Models | NeurIPS 2023 | | Graph of Thoughts: Solving Elaborate Problems with Large Language Models | AAAI 2024 | | Zero-Shot Verification-Guided Chain of Thoughts | 2025 | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | WebGPT: Browser-assisted Question-Answering with Human Feedback | 2021 | | MemGPT: Towards LLMs as Operating Systems | 2023 | | Voyager: An Open-Ended Embodied Agent with Large Language Models | 2023 |
Parametric Adaptation
| Paper | Year | | --- | --- | | AgentTuning: Enabling Generalized Agent Abilities for LLMs | 2023 | | ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent | 2023 | | Re-ReST: Reflection-Reinforced Self-Training for Language Agents | 2024 | | Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes | 2023 | | Deep Reinforcement Learning from Human Preferences | NeurIPS 2017 | | Direct Preference Optimization: Your Language Model is Secretly a Reward Model | NeurIPS 2023 | | Constitutional AI: Harmlessness from AI Feedback | 2022 | | ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection | ACL (Findings) 2025 |
Validator-Driven Feedback
| Paper | Year | | --- | --- | | ReZero: Enhancing LLM search ability by trying one-more-time | 2025 | | Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback | 2025 | | CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning | 2022 | | LEVER: Learning to Verify Language-to-Code Generation with Execution | ICML 2023 | | SWE-bench: Can Language Models Resolve Real-world Github Issues? | ICLR 2024 | | Do As I Can, Not As I Say: Grounding Language in Robotic Affordances | CoRL 2022 | | PaLM-E: An Embodied Multimodal Language Model | ICML 2023 | | Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning | 2025 |
π§ Agentic Memory
!mem
Agentic Use of Flat Memory
Factual Memory
| Paper | Year | | --- | --- | | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | NeurIPS 2020 | | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | ICLR 2024 | | MemoryBank: Enhancing Large Language Models with Long-Term Memory | 2023 | | LlamaIndex | 2022 | | MemGPT: Towards LLMs as Operating Systems | 2023 | | RET-LLM: Towards a General Read-Write Memory for Large Language Models | 2023 | | SCM: Enhancing Large Language Model with Self-Controlled Memory Framework | 2023 | | Evaluating Very Long-Term Conversational Memory of LLM Agents | 2024 | | LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory | 2024 | | SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals | NAACL 2025 | | FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design | 2023 | | A-mem: Agentic memory for llm agents | 2025 | | In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents | 2025 | | Zep: A Temporal Knowledge Graph Architecture for Agent Memory | 2025 | | MIRIX: Multi-Agent Memory System for LLM-Based Agents | 2025 | | MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models | 2025 | | LightMem: Lightweight and Efficient Memory-Augmented Generation | 2025 | | Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science | 2025 |
Experience Memory
| Paper | Year | | --- | --- | | Agent Workflow Memory | 2024 | | Sleep-time Compute: Beyond Inference Scaling at Test-time | 2025 | | Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory | 2025 | | Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models | 2025 | | ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory | 2025 | | Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory | 2025 |
Structured Use of Memory
| Paper | Year | | --- | --- | | RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph | 2024 | | From Local to Global: A Graph RAG Approach to Query-Focused Summarization | 2024 | | Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory | 2025 | | Zep: A Temporal Knowledge Graph Architecture for Agent Memory | 2025 | | From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs | 2024 | | AutoFlow: Automated Workflow Generation for Large Language Model Agents | 2024 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | FlowMind: Automatic Workflow Generation with LLMs | 2024 | | Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory (M3-Agent) | 2025 | | Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations | 2025 | | Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks | NeurIPS 2024 | | RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents | 2024 |
Post-training Memory Control
| Paper | Year | | --- | --- | | MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent | 2025 | | MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents | 2025 | | Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning | 2025 | | Mem-alpha: Learning Memory Construction via Reinforcement Learning | 2025 | | Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks | 2025 | | Agent Learning via Early Experience | 2025 | | Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents | 2026 | | MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory | 2026 |
π Evolving Foundational Agentic Capabilities
!mem
Self-evolving Planning
| Paper | Year | | --- | --- | | Self-challenging language model agents | 2025 | | Self-rewarding language models | ICML 2024 | | RLSR: Reinforcement Learning from Self Reward | 2025 | | Self: Self-evolution with language feedback | 2023 | | Training language models to self-correct via reinforcement learning | 2024 | | TextGrad: Differentiable Text Feedback for Language Models | 2024 | | AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning | 2025 | | AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation | 2024 | | Reflexion: Language agents with verbal reinforcement learning | NeurIPS 2023 | | Adaplanner: Adaptive planning from feedback with language models | NeurIPS 2023 | | Self-refine: Iterative refinement with self-feedback | NeurIPS 2023 | | A self-improving coding agent | 2025 | | Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning | 2025 | | DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning | 2025 |
Self-evolving Tool-use
| Paper | Year | | --- | --- | | Large Language Models as Tool Makers | ICLR 2024 | | CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets | ICLR 2024 | | CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models | EMNLP 2023 | | LLM Agents Making Agent Tools | 2025 |
Self-evolving Search for Memory Retrieval
| Paper | Year | | --- | --- | | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | NeurIPS 2020 | | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | ICLR 2024 | | MemoryBank: Enhancing Large Language Models with Long-Term Memory | 2023 | | MemGPT: Towards LLMs as Operating Systems | 2023 | | Agent Workflow Memory | 2024 | | Dynamic Cheatsheet: Test-time learning with adaptive memory | 2025 | | Reflexion: Language agents with verbal reinforcement learning | NeurIPS 2023 | | ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory | 2025 | | Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models | 2025 | | AutoFlow: Automated Workflow Generation for Large Language Model Agents | 2024 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | FlowMind: Automatic Workflow Generation with LLMs | 2024 | | RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph | 2024 | | From Local to Global: A Graph RAG Approach to Query-Focused Summarization | 2024 | | Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory | 2025 | | Zep: A Temporal Knowledge Graph Architecture for Agent Memory | 2025 | | MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models | 2025 | | Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks | 2025 |
π₯ Collective Multi-agent Reasoning
!mem
π€ Collaboration and Division of Labor
In-context Collaboration
Manually Crafted Pipelines
| Paper | Year | | --- | --- | | AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving | 2025 | | MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework | ICLR 2024 | | SurgRAW: Multi-agent workflow with chain-of-thought reasoning for surgical intelligence | 2025 | | Collab-RAG: Boosting retrieval-augmented generation for complex question answering via white-box and black-box llm collaboration | 2025 | | MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning | 2025 | | Chain of Agents: Large Language Models Collaborating on Long-Context Tasks | NeurIPS 2024 | | AutoAgents: a framework for automatic agent generation | IJCAI 2024 | | RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning | 2025 | | SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents | 2024 | | MDocAgent: A multi-modal multi-agent framework for document understanding | 2025 |
LLM-Driven Pipelines
| Paper | Year | | --- | --- | | AutoML-Agent: A multi-agent llm framework for full-pipeline automl | 2024 | | Magentic-One: A generalist multi-agent system for solving complex tasks | 2024 | | MAS-GPT: Training LLMs to build LLM-based multi-agent systems | 2025 | | MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines | 2025 | | Agent-oriented planning in multi-agent systems | 2024 | | AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering | 2025 | | Talk to Right Specialists: Routing and planning in multi-agent system for question answering | 2025 |
Theory-of-Mind-Augmented Collaboration
| Paper | Year | | --- | --- | | Theory of mind for multi-agent collaboration via large language models | 2023 | | Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models | 2024 | | MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning | 2024 | | How large language models encode theory-of-mind: a study on sparse parameter patterns | npj Artificial Intelligence 2025 | | Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection | 2025 | | BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind | 2025 |
Post-training Collaboration
Multi-agent Prompt Optimization
| Paper | Year | | --- | --- | | AutoAgents: A Framework for Automatic Agent Generation | IJCAI 2024 | | Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration | NAACL 2024 | | DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines | 2023 | | Multi-agent Design: Optimizing Agents with Better Prompts and Topologies | 2025 | | Automatic Prompt Optimization with "Gradient Descent" and Beam Search | 2023 |
Graph-based Topology Generation
| Paper | Year | | --- | --- | | Learning Multi-Agent Communication from Graph Modeling Perspective | 2024 | | G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks | 2024 | | Graph Diffusion for Robust Multi-Agent Coordination | ICML 2025 | | Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems | 2024 | | Adaptive Graph Pruning for Multi-Agent Communication | 2025 | | G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-Agent Systems | 2025 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | Multi-agent Design: Optimizing Agents with Better Prompts and Topologies | 2025 | | Multi-Agent Architecture Search via Agentic Supernet | 2025 | | DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-Agent System | 2025 | | GPTSwarm: Language Agents as Optimizable Graphs | ICML 2024 |
Policy-based Topology Generation
| Paper | Year | | --- | --- | | MASRouter: Learning to Route LLMs for Multi-Agent Systems | 2025 | | RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory | 2025 | | xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning | 2025 | | Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration | 2025 | | LLM Collaboration with Multi-Agent Reinforcement Learning | 2025 | | Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems | 2025 | | Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy | 2025 | | LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation | IEEE RA-L 2025 | | MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning | 2025 | | Reflective Multi-Agent Collaboration Based on Large Language Models) | NeurIPS 2024 | | Sirius: Self-Improving Multi-Agent Systems via Bootstrapped Reasoning | 2025 | | Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains | 2025 | | M3HF: Multi-Agent Reinforcement Learning from Multi-Phase Human Feedback of Mixed Quality | 2025 | | O-MAPL: Offline Multi-Agent Preference Learning | 2025 |
π± Multi-Agent Memory and Evolution
!mem
From Single-Agent Evolution to Multi-Agent Evolution
Intra-test-time Evolution
| Paper | Year | | --- | --- | | Reflexion: Language Agents with Verbal Reinforcement Learning | NeurIPS 2023 | | Self-Refine: Iterative Refinement with Self-Feedback | NeurIPS 2023 | | AdaPlanner: Adaptive Planning from Feedback with Language Models | NeurIPS 2023 | | TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution | TiFA 2024 | | Self-Adapting Language Models | 2025 | | TTRL: Test-Time Reinforcement Learning | 2025 | | Ladder: Self-Improving LLMs through Recursive Problem Decomposition | 2025 |
Inter-test-time Evolution
| Paper | Year | | --- | --- | | Self: Self-Evolution with Language Feedback | 2023 | | STaR: Bootstrapping Reasoning with Reasoning | NeurIPS 2022 | | Reasoning Beyond Limits: Advances and Open Problems for LLMs | 2025 | | RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning | 2025 | | DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning | 2025 | | WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning | 2024 | | Why do animals need shaping? A theory of task composition and curriculum learning | 2024 | | SAGE: Self-evolving Agents with Reflective and Memory-augmented Abilities | Neurocomputing 2025 | | MemInsight: Autonomous Memory Augmentation for LLM Agents | 2025 | | Agent Workflow Memory | 2024 |
Multi-agent Evolution
| Paper | Year | | --- | --- | | Self: Self-Evolution with Language Feedback | 2023 | | Training Language Models to Self-Correct via Reinforcement Learning | 2024 | | TextGrad: Automatic "Differentiation" via Text | 2024 | | REMA: Learning to Meta-Think for LLMs with Multi-Agent Reinforcement Learning | 2025 | | Group-in-Group Policy Optimization for LLM Agent Training | 2025 | | Agent Workflow Memory | 2024 | | MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models | 2025 | | Multi-agent Design: Optimizing Agents with Better Prompts and Topologies | 2025 | | AFlow: Automating Agentic Workflow Generation | ICLR 2025 | | Testing Advanced Driver Assistance Systems Using Multi-Objective Search and Neural Networks | ASE 2016 | | Latent Collaboration in Multi-Agent Systems | 2025 |
Multi-agent Memory Management for Evolution
| Paper | Year | | --- | --- | | G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems | 2025 | | Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory | 2025 | | LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning | 2025 | | SEDM: Scalable Self-Evolving Distributed Memory for Agents | 2025 | | Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control | 2025 | | Memory Sharing for Large Language Model based Agents | 2024 | | MIRIX: Multi-Agent Memory System for LLM-Based Agents | 2025 | | LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation | 2025 | | MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning | ALTA 2025 | | Lyfe Agents: Generative agents for low-cost real-time social interactions | 2023 | | Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving | 2025 |
Training Multi-agent to Evolve
| Paper | Year | | --- | --- | | Multi-Agent Evolve: LLM Self-Improve through Co-evolution | 2025 | | CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards | 2025 | | MARFT: Multi-Agent Reinforcement Fine-Tuning | 2025 | | Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs | 2025 | | MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning | 2025 | | MALT: Multi-Agent Learning from Trajectories | 2025 | | MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning | 2025 | | Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques | 2024 | | The Alignment Waltz: Jointly Training Agents to Collaborate for Safety | 2025 |
π¨ Applications
!app
π» Math Exploration & Vibe Coding Agents
Foundational Agentic Reasoning
| Paper | Year | | --- | --- | | Advancing mathematics by guiding human intuition with AI | Nature 2021 | | Solving olympiad geometry without human demonstrations | Nature 2024 | | Mathematical discoveries from program search with large language models | Nature 2024 | | Mathematical Exploration and Discovery at Scale | 2025 | | Advancing geometry with AI: Multi-agent generation of polytopes | 2025 | | Towards Robust Mathematical Reasoning | EMNLP 2025 | | CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules | ICLR 2024 | | Executable Code Actions Elicit Better LLM Agents | ICML 2024 | | Knowledge-Aware Code Generation with Large Language Models | ICPC 2024 | | CodePlan: Repository-level Coding using LLMs and Planning | FSE 2024 | | Multi-stage guided code generation for Large Language Models | Eng. App. AI 2025 | | CodeTree: Agent-Guided Tree Search for Code Generation with Large Language Models | 2024 | | DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning | 2024 | | Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks | ACL 2025 | | CoRT: Code-integrated Reasoning within Thinking | 2025 | | DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal | 2025 | | Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search | NeurIPS 2024 | | VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning | AAAI 2025 | | Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents | ICML 2025 | | An In-Context Learning Agent for Formal Theorem-Proving | COLM 2024 | | Formal Mathematical Reasoning: A New Frontier in AI | 2024 | | Generative Modelling for Mathematical Discovery | 2025 | | Toolformer: Language Models Can Teach Themselves to Use Tools | NeurIPS 2023 | | ToolCoder: Teach Code Generation Models to use API search tools | 2023 | | ToolGen: Unified Tool Retrieval and Calling via Generation | ICLR 2025 | | CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges | ACL 2024 | | ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation | ICSE 2025 | | CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision | 2025 | | RepoHyper: Better Context Retrieval is All You Need for Repository-Level Code Completion | 2024 | | CodeNav: Beyond Tool-Use to Using Real-World Codebases with LLM Agents | ICLR 2024 | | Optimizing Code Runtime Performance Through Context-Aware Retrieval-Augmented Generation | ICPC 2025 | | Knowledge Graph Based Repository-Level Code Generation | LLM4Code 2025 | | cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree | 2025 |
Self-evolving Agentic Reasoning
| Paper | Year | | --- | --- | | Evaluating Language Models for Mathematics through Interactions | PNAS 2024 | | CLCL: Non-compositional Expression Detection with Contrastive Learning and Curriculum Learning | ACL 2023 | | Is Self-Repair a Silver Bullet for Code Generation? | 2024 | | LeDeX: Learning to Debug with Execution Feedback | NeurIPS 2024 | | Self-Refine: Iterative Refinement with Self-Feedback | NeurIPS 2023 | | A Self-Iteration Code Generation Method Based on Large Language Models | ICPADS 2023 | | Teaching Large Language Models to Self-Debug | ICLR 2024 | | Self-Collaboration Code Generation via ChatGPT | TOSEM 2024 | | L2MAC: Large Language Model Automatic Computer for Extensive Code Generation | 2023 | | Cogito, Ergo Sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation | 2025 |
Collective Multi-agent Reasoning
| Paper | Year | | --- | --- | | AgentCoder: Multi-Agent-Based Code Generation with Iterative Testing and Optimisation | 2023 | | A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement | ASE 2024 | | SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents | ICSE 2025 | | Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization | 2024 | | MapCoder: Multi-Agent Code Generation for Competitive Problem Solving | 2024 | | AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing | 2024 | | QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks | 2025 | | SEW: Self-Evolving Agentic Workflows for Automated Code Generation | 2025 | | Self-Evolving Multi-Agent Collaboration Networks for Software Development | 2024 | | Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement | 2024 | | CodeCoR: An LLM-based Self-Reflective Multi-Agent Framework for Code Generation | 2025 | | SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering | ICML 2025 | | Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation | 2025 |
π¬ Scientific Discovery Agents
Here are the extracted citation tables grouped by their respective sections.
Foundational Agentic Reasoning
| Paper | Year | | --- | --- | | ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning | Digital Discovery 2024 | | Agent-based learning of materials datasets from the scientific literature | Digital Discovery 2024 | | ReAct: Synergizing Reasoning and Acting in Language Models | ICLR 2023 | | Biomni: A General-Purpose Biomedical AI Agent | bioRxiv 2025 | | SciAgent: Tool-augmented Language Models for Scientific Reasoning | 2024 | | Chemcrow: Augmenting large-language models with chemistry tools | 2023 | | CACTUS: Chemistry Agent Connecting Tool-Usage to Science | ACS Omega 2024 | | ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving | 2024 | | CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning | 2025 | | TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools | 2025 | | AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning | Nature Communications 2025 | | LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation | 2024 | | HoneyComb: A Flexible LLM-Based Agent System for Materials Science | 2024 | | CRISPR-GPT for Agentic Automation of Gene-editing Experiments | 2024 | | PharmAgents: Building a Virtual Pharma with Large Language Model Agents | 2025 | | ORGANA: A robotic assistant for automated chemistry experimentation and characterization | Matter 2025 | | AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence | 2024 | | Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis | 2024 | | LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery | 2024 | | CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis | BioRxiv 2024 | | BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments | 2024 | | DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction | 2024 | | Accelerating Scientific Research Through a Multi-LLM Framework | 2025 | | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | 2025 | | Large Language Models are Zero Shot Hypothesis Proposers | 2023 | | PaperQA: Retrieval-Augmented Generative Agent for Scientific Research | 2023 | | Language agents achieve superhuman synthesis of scientific knowledge | 2024 | | LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation | 2024 |
Self-evolving Agentic Reasoning
| Paper | Year | | --- | --- | | ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning | 2025 | | Accelerated Inorganic Materials Design with Generative AI Agents | 2025 | | LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery | 2024 | | ChemReasoner: Heuristic Search over a Large Language Model's Knowledge Space using Quantum-Chemical Feedback | 2024 | | LLMatDesign: Autonomous Materials Discovery with Large Language Models | 2024 | | Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents | 2025 |
Collective multi-agent reasoning
| Paper | Year | | --- | --- | | ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning | Digital Discovery 2024 | | PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration | 2025 | | AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence | 2024 | | CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis | BioRxiv 2024 | | Accelerating Scientific Research Through a Multi-LLM Framework | 2025 | | Toward a team of ai-made scientists for scientific discovery from gene expression data | 2024 | | The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation | bioRxiv 2024 |
π€ Embodied Agents
Foundational Agentic Reasoning
| Paper | Year | | --- | --- | | Do As I Can, Not As I Say: Grounding Language in Robotic Affordances | 2022 | | SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning | 2023 | | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | NeurIPS 2023 | | Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents | ECCV 2024 | | Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents | NeurIPS 2023 | | Robotic Control via Embodied Chain-of-Thought Reasoning | 2024 | | Fast ECoT: Fast Embodied Chain-of-Thought for Vision-Language-Action Models | 2025 | | Cosmos-Reason1: Physical Commonsense with Multimodal Chain of Thought Reasoning | 2025 | | CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models | 2025 | | Emma-X: An Embodied Multimodal Action Model with Chain of Thought Reasoning | 2024 | | Robot-R1: Reinforcement Learning Enhanced Large Vision-Language Models for Robotic Manipulation | 2025 | | ManipLVM-R1: Learning to Reason for Robotic Manipulation via Reinforcement Learning | 2025 | | Embodied-R: Emergent Spatial Reasoning in Robotics via Multi-Agent Reinforcement Learning | 2025 | | VIKI-R: A VLM-Based Reinforcement Learning Approach for Heterogeneous Multi-Agent Cooperation | 2025 | | GSCE: A Prompt Framework for Enhanced Logical Reasoning in LLM-Based Drone Control | 2025 | | MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge | NeurIPS 2022 | | Physical AI Agents: Integrating Generative AI, Symbolic AI and Robotics | 2025 | | Chat with the Environment: Interactive Multimodal Perception using Large Language Models | IROS 2023 | | An embodied generalist agent in 3d world | ICML 2024 | | Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models | 2025 | | Gemini Robotics: Bringing AI to the Physical World | 2025 | | Octopus: Embodied Vision-Language Programmer from Environmental Feedback | ECCV 2024 | | CaPo: Cooperative Plan Optimization for Multi-Agent Collaboration | 2024 | | COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models | ICRA 2025 | | MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception | CVPR 2024 | | LLM-Planner: Few-Shot Grounded High-Level Planning for Embodied Agents with Large Language Models | ICCV 2023 | | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | NeurIPS 2023 | | L3MVN: Leveraging Large Language Models for Visual Target Navigation | 2023 | | SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments | ICAPS 2023 | | SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning | CoRL 2023 | | ReMEmbR: Building and Reasoning with Long-Horizon Spatio-Temporal Memory for Embodied Agents | 2025 | | Embodied-RAG: General Non-parametric Embodied Memory for Retrieval-Augmented Generation | NeurIPS Workshop AFM 2024 | | Retrieval-Augmented Embodied Agents | 2024 | | MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents | 2024 |
Self-evolving Agentic Reasoning
| Paper | Year | | --- | --- | | LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics | 2024 | | Optimus-1: Hybrid Multimodal Memory Empowered Agents for Long-Horizon Tasks in Minecraft | 2024 | | Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models | EMNLP 2023 | | Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation | 2025 | | Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents | 2024 | | Ella: Embodied Social Agents with Lifelong Memory | 2025 | | Chat with the Environment: Interactive Multimodal Perception using Large Language Models | IROS 2023 | | From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation | 2025 | | Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners | CoRL 2023 | | Octopus: Embodied Vision-Language Programmer from Environmental Feedback | ECCV 2024 | | MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning | 2024 | | Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration | 2024 | | EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM | 2025 | | Voyager: An Open-Ended Embodied Agent with Large Language Models | 2023 |
Collective multi-agent reasoning
| Paper | Year | | --- | --- | | Smart-LLM: Smart Multi-Agent Robot Task Planning with Large Language Models | 2024 | | CaPo: Cooperative Plan Optimization for Multi-Agent Collaboration | 2024 | | COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models | ICRA 2025 | | Theory of mind for multi-agent collaboration via large language models | 2023 | | How large language models encode theory-of-mind: a study on sparse parameter patterns | npj Artificial Intelligence 2025 | | Hypothetical Minds: Scaffolding theory of mind for multi-agent tasks with large language models | 2024 | | MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning | 2024 | | EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM | 2025 | | COMBO: Compositional World Models for Embodied Multi-Agent Cooperation | 2025 | | VIKI-R: A VLM-Based Reinforcement Learning Approach for Heterogeneous Multi-Agent Cooperation | 2025 | | RoCo: Dialectic Multi-Robot Collaboration with Large Language Models | 2024 |
π₯ Healthcare & Medicine Agents
Foundational agentic reasoning
| Paper | Year | | --- | --- | | Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology | Nature Medicine 2024 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology | 2025 | | MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow | 2025 | | MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility | 2025 | | ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning | 2024 | | DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making | 2025 | | TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools | 2025 | | AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning | Nature Communications 2025 | | Large language model agents can use tools to perform clinical calculations | NPJ Digital Medicine 2025 | | MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling | 2024 | | MMedAgent: Learning to Use Medical Tools with Multi-modal Agents | 2024 | | VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis | 2024 | | Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning | 2024 | | Adaptive Reasoning and Acting in Medical Language Agents | 2024 | | MedRAX: Medical Reasoning Agent for Chest X-ray | 2025 | | Conversational Health Agents: A Personalized LLM-Powered Agent Framework | 2023 | | MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science | 2025 | | Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education | 2024 | | Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions | MICCAI 2025 | | RAG-Enhanced Collaborative LLM Agents for Drug Discovery | 2025 | | MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs | 2025 |
Self-evolving agentic reasoning
| Paper | Year | | --- | --- | | Epidemic Modeling with Generative Agents | 2023 | | Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions | MICCAI 2025 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | LLMs Can Simulate Standardized Patients via Agent Coevolution | 2024 | | Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education | 2024 | | MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility | 2025 | | DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making | 2025 | | MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science | 2025 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling | 2025 | | Large language model agents can use tools to perform clinical calculations | NPJ Digital Medicine 2025 |
Collective multi-agent reasoning
| Paper | Year | | --- | --- | | MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making | 2024 | | DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue | 2025 | | Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis | 2024 | | ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning | 2024 | | PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology | 2025 | | Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions | MICCAI 2025 | | LLMs Can Simulate Standardized Patients via Agent Coevolution | 2024 | | DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making | 2025 | | MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning | 2024 | | RAG-Enhanced Collaborative LLM Agents for Drug Discovery | 2025 | | GMAI-VL-R1: Harnessing Reinforcement Learning for Multi-Modal Medical Reasoning | 2025 |
π Autonomous Web Exploration & Research Agents
Foundational agentic reasoning
| Paper | Year | | --- | --- | | Agent Laboratory: Using LLM Agents as Research Assistants | 2025 | | GPT Researcher | 2023 | | Accelerating Scientific Research Through a Multi-LLM Framework | 2025 | | Video-Browser: Towards Agentic Open-web Video Browsing | 2025 | | InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification | 2025 | | WebGPT: Browser-assisted question-answering with human feedback | 2021 | | Language Models are Few-Shot Learners | NeurIPS 2020 | | GPT-4V(ision) is a Generalist Web Agent, if Grounded | ICML 2024 | | AutoWebGLM: A Large Language Model-based Web Navigating Agent | 2024 | | Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents | 2024 | | WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning | 2024 | | WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning | 2025 | | Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning | 2024 | | DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning | 2025 | | EvolveSearch: An Iterative Self-Evolving Search Agent | 2025 | | WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model | 2025 | | ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL | ICLR 2025 | | Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents | 2024 | | WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection | 2025 | | ZeroSearch: Incentivize the Search Capability of LLMs Without Searching | 2025 | | StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization | 2025 | | How to Train Your LLM Web Agent: A Statistical Diagnosis | 2025 | | Agent S: An Open Agentic Framework that Uses Computers Like a Human | 2024 | | InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection | 2025 | | MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation | 2024 | | PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC | 2025 | | UItron: Foundational GUI Agent with Advanced Perception and Planning | 2025 | | ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay | 2025 | | ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents | 2025 | | UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning | 2025 | | GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents | 2025 | | InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners | 2025 | | UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning | 2025 | | GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration | EMNLP 2025 | | Learning GUI Grounding with Spatial Reasoning from Visual Feedback | 2025 | | GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning | 2025 | | UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding | 2025 | | ZeroGUI: Automating Online GUI Learning at Zero Human Cost | 2025 | | AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning | 2025 | | AutoGLM: Autonomous Foundation Agents for GUIs | 2024 | | Mobile-Agent-v3: Fundamental Agents for GUI Automation | 2025 | | WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | ACL 2024 | | BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions | 2025 | | WALT: Web Agents that Learn Tools | 2025 | | WebDancer: Towards Autonomous Information Seeking Agency | 2025 | | WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization | 2025 | | AutoDroid: LLM-powered Task Automation in Android | MobiCom 2024 | | MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices | 2024 | | AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant | 2024 | | OS-Copilot: Towards Generalist Computer Agents with Self-Improvement | 2024 | | OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning | 2024 | | OS-ATLAS: A Foundation Action Model for Generalist GUI Agents | 2024 | | SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents | 2024 | | Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools | 2025 | | Agent Laboratory: Using LLM Agents as Research Assistants | 2025 | | MLR-Copilot: Autonomous Machine Learning Research based on Large Language Model Agents | 2024 | | Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback | 2025 | | The AI Scientist: Fully Automated Open-Ended Scientific Discovery | 2024 | | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | 2025 | | WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents | 2025 | | WebSailor: Navigating Super-human Reasoning for Web Agent | 2025 | | RaDA: Retrieval-augmented Web Agent Planning with LLMs | 2024 | | Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control | ICLR 2024 | | LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark | 2025 | | Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation | 2023 | | Retrieval-augmented GUI Agents with Generative Guidelines | 2025 | | WebThinker: Empowering Large Reasoning Models with Deep Research Capability | 2025 | | DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments | 2025 | | PaperQA: Retrieval-Augmented Generative Agent for Scientific Research | 2023 | | Language agents achieve superhuman synthesis of scientific knowledge | 2024 | | Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents | 2024 | | Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination | 2024 |
Self-evolving agentic reasoning
| Paper | Year | | --- | --- | | Agent Workflow Memory | 2024 | | VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought | 2024 | | BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions | 2025 | | AutoWebGLM: A Large Language Model-based Web Navigating Agent | 2024 | | AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents | 2024 | | LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications | 2025 | | WebDancer: Towards Automated Web Information Seeking with Large Language Model Agents | 2025 | | WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization | 2025 | | Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation | 2023 | | MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation | 2024 | | Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks | 2025 | | Agent Laboratory: Using LLM Agents as Research Assistants | 2025 | | GPT Researcher | 2023 | | Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents | 2024 | | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | 2025 | | Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents | 2024 | | Reflection-Based Memory For Web navigation Agents | 2025 | | Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems | 2024 | | Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution | 2025 | | WINELL: Wikipedia Never-Ending Updating with LLM Agents | 2025 | | WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection | 2025 | | GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior | 2025 | | History-Aware Reasoning for GUI Agents | 2025 | | MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation | 2025 | | InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection | 2025 | | Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks | 2025 | | CycleResearcher: Improving Automated Research via Automated Review | 2024 | | MLR-Copilot: Autonomous Machine Learning Research based on Large Language Model Agents | 2024 | | Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback | 2025 | | DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments | 2025 |
Collective multi-agent reasoning
| Paper | Year | | --- | --- | | WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration | 2024 | | WINELL: Wikipedia Never-Ending Updating with LLM Agents | 2025 | | Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution | 2025 | | Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents | 2024 | | Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems | 2024 | | Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks | 2025 | | Agentic Web: Weaving the Next Web with AI Agents | 2025 | | CoLA: Collaborative Low-Rank Adaptation | 2025 | | Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration | ACL 2024 | | Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks | 2025 | | Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation | 2025 | | MobileExperts: Orchestrating Tool-Capable Specialists for Mobile Automation | 2024 | | Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use | 2025 | | PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC | 2025 | | AgentRxiv: Towards Collaborative Autonomous Research | 2025 | | Accelerating Scientific Research Through a Multi-LLM Framework | 2025 | | Large Language Models are Zero-Shot Reasoners | NeurIPS 2022 | | Emergent autonomous scientific research capabilities of large language models | Nature 2023 | | Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data | 2024 |
π Benchmarks
βοΈ Core Mechanisms of Agentic Reasoning
Tool Use
Single-Turn Tool Use
| Paper | Year | | --- | --- | | ToolQA: A Dataset for LLM Question Answering with External Tools | NeurIPS 2023 | | Gorilla: Large Language Model Connected with Massive APIs | 2023 | | ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs | ICLR 2024 | | MetaTool: A Benchmark for Controlling Special-purpose Large Language Models | ICLR 2024 | | T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step | ACL 2024 | | GTA: A Benchmark for General Tool Agents | NeurIPS 2024 | | Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models | 2025 |
Multi-Turn Tool Use
| Paper | Year | | --- | --- | | ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases | 2023 | | On the Tool Manipulation Capability of Open-source Large Language Models | 2023 | | API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs | EMNLP 2023 | | Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios | ACL 2024 | | MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models | ICLR 2025 |
Search
Memory and Planning
Long-Horizon Episodic Memory
| Paper | Year | | --- | --- | | PerLTQA: A Persona-based Long-term Memory Benchmark for RAG | 2024 | | ELITR-Bench: A Meeting Assistant Benchmark for Long-Context LLMs | 2024 | | Multi-IF: A Benchmark for Multi-turn Instruction Following | 2024 | | MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs | 2025 | | TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models | 2025 | | StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns | 2025 | | MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation | 2025 |
Multi-session Recall
| Paper | Year | | --- | --- | | Evaluating Very Long-Term Conversational Memory of LLM Agents | 2024 | | MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants | 2024 | | LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory | 2024 | | REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation | 2025 | | Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions | 2025 | | Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents | 2026 | | Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory | 2025 |
Planning and Feedback
| Paper | Year | | --- | --- | | ALFWorld: Aligning Text and Embodied Environments for Interactive Learning | ICLR 2021 | | PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change | NeurIPS 2022 | | ACPBench: Reasoning about Action, Change, and Planning | 2024 | | Text2World: Benchmarking Large Language Models for Symbolic World Model Generation | ACL 2025 | | REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks | 2025 | | TravelPlanner: A Benchmark for Real-World Planning with Language Agents | ICML 2024 | | FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents | 2024 | | UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models | 2025 |
Multi-Agent System
Game-based reinforcement learning evaluation
| Paper | Year | | --- | --- | | MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence | AAAI 2018 | | Pommerman: A Multi-Agent Playground | 2018 | | The StarCraft Multi-Agent Challenge | NeurIPS 2019 | | MineLand: Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs | 2024 | | TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft | 2024 | | Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot | ICML 2021 | | BenchMARL: Benchmarking Multi-Agent Reinforcement Learning | 2023 | | Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence | AAAI 2020 |
Simulation-centric real-world assessment
| Paper | Year | | --- | --- | | SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving | CoRL 2020 | | Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world | NeurIPS 2022 | | A Versatile Multi-Agent Reinforcement Learning Benchmark for Inventory Management | 2023 | | IMP-MARL: a Suite of Environments for Infrastructure Management Planning with Multi-Agent Reinforcement Learning | NeurIPS 2023 | | POGEMA: Partially Observable Grid Environment for Multiple Agents | Arxiv 2022 | | IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning | NeurIPS 2024 | | REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks | 2025 |
Language, Communication, and Social Reasoning
| Paper | Year | | --- | --- | | LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | 2023 | | AvalonBench: Evaluating LLMs Playing the Game of Avalon | 2023 | | Welfare Diplomacy: Benchmarking Language Model Cooperation | 2023 | | MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration | EMNLP 2024 | | BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems | 2024 | | COMMA: A Benchmark for Inter-Agent Communication in Multi-Agent Systems | 2024 | | IntellAgent: A Benchmark for Evaluating Conversational Agents in Realistic Scenarios | 2025 | | MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents | 2025 |
π― Applications of Agentic Reasoning
Embodied Agents
| Paper | Year | | --- | --- | | Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks | 2025 | | BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games | NeurIPS 2024 | | ALFWorld: Aligning Text and Embodied Environments for Interactive Learning | ICLR 2021 | | Understanding the Weakness of Large Language Model Agents within a Complex Android Environment | 2024 | | MindAgent: Emergent Gaming Interaction | 2023 | | Playing repeated games with Large Language Models | 2023 | | OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | NeurIPS 2024 |
Scientific Discovery Agents
| Paper | Year | | --- | --- | | DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents | NeurIPS 2024 | | ScienceWorld: Is your Agent Smarter than a 5th Grader? | EMNLP 2022 | | ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery | NeurIPS 2024 | | The AI Scientist: Fully Automated Open-Ended Scientific Discovery | 2024 | | LAB-Bench: Measuring Capabilities of Language Models for Biology Research | 2024 | | MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation | 2023 |
Autonomous Research Agents
| Paper | Year | | --- | --- | | WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? | ICML 2024 | | WorkArena++: Towards Agents that Act Like Employees | 2024 | | OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation | 2024 | | PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change | NeurIPS 2022 | | FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents | 2024 | | ACPBench: Reasoning about Action, Change, and Planning | 2024 | | TRAIL: Trace Reasoning and Agentic Issue Localization | 2025 | | CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization | NeurIPS 2023 | | Agent-as-a-Judge: Evaluate Agents with Agents | 2024 | | InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation | 2025 |
Medical and Clinical Agents
| Paper | Year | | --- | --- | | AgentClinic: a multimodal agent benchmark for clinical environments | NeurIPS 2024 | | MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents | NEJM AI 2025 | | EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records | 2024 | | MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning | 2023 | | GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning | 2024 |
Web Agents
| Paper | Year | | --- | --- | | WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents | NeurIPS 2022 | | WebArena: A Realistic Web Environment for Building Autonomous Agents | ICLR 2024 | | OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | NeurIPS 2024 | | AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents | ACL 2024 | | WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? | 2024 | | VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks | NeurIPS 2024 | | WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | ACL 2024 | | Mind2Web: Towards a Generalist Agent for the Web | NeurIPS 2023 | | Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | 2025 | | WebCanvas: Benchmarking Web Agents in Online Canvas | NeurIPS 2024 | | Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks | 2025 | | VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? | 2024 | | WebLINX: Real-World Website Navigation with Multi-Turn Dialogue | CVPR 2024 | | LASER: LLM Agent with State-Space Exploration for Web Navigation | NeurIPS 2023 | | AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Agent for Automated Web Navigation | 2024 | | OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web | 2024 | | BEARCUBS: A benchmark for computer-using web agents | 2025 | | BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents | 2025 | | BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese | 2025 | | Video-Browser: Towards Agentic Open-web Video Browsing | 2025 |
General Tool-Use Agents
| Paper | Year | | --- | --- | | GTA: A Benchmark for General Tool Agents | NeurIPS 2024 | | NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls | 2024 | | Executable Code Actions Elicit Better LLM Agents | ICML 2024 | | RestGPT: Connecting Large Language Models with Real-World RESTful APIs | 2023 | | Search-o1: Agentic Search-Enhanced Large Reasoning Models | 2025 | | Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning | 2025 | | ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints | 2024 | | R-Judge: Benchmarking Safety-Critical Decision Making for LLM Agents | 2024 |
License
This repository is licensed under the MIT License.
Star History

A curated and comprehensive collection of research papers, surveys and resources on agentic reasoning β focusing on how large language models and intelligent agents integrate reasoning with action, planning, tool use, search, memory, self-evolution and multi-agent coordination. This repository organizes key works and thematic areas from foundational concepts to real-world applications in AI-driven agentic workflows.