ToolGen: Unified Tool Retrieval and Calling via Generation

by Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li

https://github.com/Reason-Wang/ToolGen

https://arxiv.org/abs/2410.03439

Abstract
1 Introduction
2 Related Work
- 2.1 Tool Retrieval
- 2.2 LLM-Agents with Tool Calling
3 ToolGen
4 Tool Retrieval Evaluation
5 End-to-End Evaluation
6 Conclusions
Appendix A. Real Tool Example
Appendix B. Tool Virtualization Implementation
Appendix C Constrained Beam Search Detail
Appendix D Retry Mechanism
Appendix E Ablation
Appendix F Generalization
Appendix G Adapt ToolBench Data to ToolGen

Abstract

ToolGen: Integrating Tool Knowledge into Language Models

Background:

Large language models (LLMs) have limitations in executing tasks without external tools
Traditional methods rely on inputting tool descriptions as context, but are constrained by context length and require separate retrieval mechanisms

Introducing ToolGen:

Paradigm shift: integrates tool knowledge directly into the LLM's parameters
Each tool represented as a unique token
Enables generation of tool calls and arguments in next token prediction capabilities
Seamless blend of language generation and tool invocation

Benefits:

No additional retrieval step required
Significantly enhances performance and scalability

Experimental Results:

Over 47,000 tools tested
Superior results in both tool retrieval and autonomous task completion
Sets the stage for new era of AI agents adapting to tools across diverse domains

Implications:

Transforms tool retrieval into a generative process
Enables end-to-end tool learning
Opens opportunities for integration with advanced techniques like chain-of-thought and reinforcement learning.

1 Introduction

Large Language Models (LLMs)

Impressive capabilities as interactive systems
Process external inputs, execute actions, and complete tasks autonomously
Demonstrated by various methods: Gravitas [2023], Qin et al. [2023], Yao et al. [2023], Shinn et al. [2023], Wu et al. [2024a], Liu et al. [2024]
Tool calling via APIs is a common and effective approach for interaction

Challenges with Existing Methods for Tool Retrieval and Execution

Struggle to scale efficiently as the number of tools grows
Combined method of tool retrieval and execution:
- Retrieval models rely on small encoders that fail to fully capture semantics
- Separating retrieval from execution introduces inefficiencies and potential misalignment
LLMs have limited intrinsic knowledge of tool-related functionalities, resulting in suboptimal performance

Introducing ToolGen: A Novel Framework for Integrating Real-World Tool Knowledge into LLMs

Expands LLM's vocabulary with tool-specific virtual tokens
Trains the model to generate these tokens within conversational context
Allows LLM to leverage pre-existing knowledge more effectively for retrieving and calling tools
Each tool represented as a unique virtual token in the LLM's vocabulary
Training stages: tool memorization, retrieval training, agent training
- Tool memorization: associates each virtual tool token with documentation
- Retrieval training: model learns to generate relevant tool tokens based on user queries
- Agent training: model acts as an autonomous agent, generating plans and tools, determining appropriate parameters for task completion
Calling tools and receiving feedback from external environments allows efficient handling of user queries

Comparison with Traditional Paradigms

ToolGen merges retrieval and generation into a single, cohesive model, setting the stage for advanced techniques like chain-of-thought reasoning and reinforcement learning in real-world applications

Contributions

Novel framework, ToolGen, that integrates tool retrieval and execution into the LLM's generative process using virtual tokens
Three-stage training process for efficient and scalable tool retrieval and API calling within ToolGen
Experimental validation demonstrating comparable performance to current best tool retrieval methods with less cost and higher efficiency, surpassing traditional tool learning paradigms across large-scale tool repositories.

2 Related Work

2.1 Tool Retrieval

Tool Retrieval for LLM Agents

Importance of Tool Retrieval:

Essential for LLM agents in real-world task execution
Tools are usually represented by their documentation

Traditional Retrieval Methods:

Sparse retrieval (e.g., BM25)
Dense retrieval (e.g., DPR, ANCE)
Rely on large document indices and external modules
Lead to inefficiencies and difficulty in optimizing in an end-to-end agent framework

Alternative Retrieval Methods:

Chen et al. (2024b): Rewrite queries, extract intent for unsupervised settings
Xu et al. (2024): Iteratively refine queries based on tool feedback
Generative Retrieval: Models directly generate relevant document identifiers

ToolGen Approach:

Represent each tool as a unique token
Tool retrieval and calling framed as generation task
Simplifies retrieval, integrates with other LLM/LLM-based agent features (e.g., chain-of-thought reasoning, ReAct)
Consolidates retrieval and task execution into a single LLM agent
Reduces latency and computational overhead for more efficient and effective task completion

2.2 LLM-Agents with Tool Calling

LLMs and Tools:

LLMs have shown potential in mastering various tasks (Chen et al., 2023a; Zeng et al., 2023; Yin et al., 2024; Wang et al., 2024)
Limited set of actions in existing works (Schick et al., 2023; Patil et al., 2023; Qin et al., 2023)
Tool selection challenge in complex environments (ToolBench, Qin et al., 2023)
Pipelined approaches face issues: error propagation and LLMs' inability to fully understand tools (Patil et al., 2023; Qin et al., 2023)
Representing actions as tokens and converting them into generative tasks (RT2, Brohan et al., 2023; Self-RAG, Asai et al., 2023)
ToolkenGPT: introduces tool-specific tokens to trigger tool usage (Hao et al., 2023)

Comparison with ToolkenGPT:

Focus on real-world tools for complex tasks vs. simpler tools (YouTube channel search vs. math functions)
Full-parameter fine-tuning vs. few-shot prompting
Larger tool set (47,000 tools vs. 13–300)

Other Approaches:

ToolPlanner and AutoACT use reinforcement learning or multi-agent systems for enhancing tool learning/task completion (Wu et al., 2024b; Qiao et al., 2024)
Not compared with our method due to feedback mechanisms and potential integration for future work.

3 ToolGen

Introduction: Introduce notations utilized throughout the paper.

ToolGen Methodology:

Tool Virtualization: Map tools into virtual tokens.
Tool Memorization: Predict tool tokens based on their documentation.
Retrieval Training: Learn to predict tool tokens from queries.
End-to-end Agent Tuning: Fine-tune the retriever model using pipeline data (trajectories), resulting in the ToolGen Agent model.

Inference Approach: Detailed later in the paper.

3.1 Preliminaries

Tool Learning with Large Tool Sets

Goal: Resolve user query q using tools from a large tool set D={d1, d2, ..., dN}
Retriever (R): Retrieve k relevant tools from D, denoted as Dk,R={dr1, dr2, …, drk}
Prompt: Concatenation of q and Dk,R, denoted as Promp= [q, Dk, R]

Iterative Four-Stage Paradigm for Task Completion

Generate a plan (pi): -
Select a tool (dsi): Choose an appropriate tool from the retrieved set Dk,R
Determine tool parameters (ci): Set up necessary configurations or inputs for the selected tool
Collect feedback (fi): Interact with the tool to obtain information on its performance and potential refinements
Repeat steps 1-4 until the task is completed, generating final answer a
Trajectory of Actions: Traj=[Promp, (p1, dsi1, ci1, fi1), ... , (pt, dst, ci, ft), a]
Iterative Approach: Allows for dynamic adjustment and refinement at each step, improving task completion performance.

3.2 Tool Virtualization

Atomic Indexing in ToolGen

Virtualization of Tools:

Each tool is mapped to a unique new token through atomic indexing
Expands the Language Model's (LLM) vocabulary to include the unique tool token
Initializes the embedding for each tool token as the average embedding of its corresponding tool name, ensuring a semantically meaningful starting point

Token Set Definition:

The token set is defined as T=Index(d)|∀d∈D, where:
- Index is the function mapping tools to tokens
- D is the set of all available tools

Comparison with Other Indexing Methods:

Atomic indexing is more efficient and can mitigate hallucination compared to other indexing methods, such as:
- Semantic mappings
- Numeric mappings
These are discussed in sections 4.3 and 5.4

3.3 Tool Memorization

Tool Memorization in LLMs

Assign tokens to tools, but LLMs lack knowledge of them.
To remedy this, fine-tune the LLM by providing tool descriptions as inputs and their tokens as outputs (tool memorization).
Use loss function: L_tool = ∑d∈D - log pθ(Index(d) | ddo_c), where θ represents LLM parameters and ddo_c is a tool description.
This process equips the LLM with basic knowledge of tools and their actions.

3.4 Retrieval Training

Training LLMs to Link User Queries and Tool Tokens:

We fine-tune LLMs by associating user queries with relevant tool tokens:

Inputs are user queries
Outputs are tool tokens

The equation, Retriever loss, is used for fine-tuning: Retriever loss = ∑q∈Q∑d∈Dq - log⁡pθ′(Index⁢(d) | q)

Where:

Q is the set of user queries
Dq is the set of tools relevant to each query
pθ′(Index⁢(d) | q) is the probability that tool d is generated given query q, based on the LLM parameters after tool memorization (θ′)

This process produces ToolGen Retriever, capable of generating appropriate tool tokens for a user's query.

3.5 End-to-End Agent-Tuning

LLM Capabilities and Fine-Tuning Process

During retrieval training, the LLM generates tool tokens from queries. Subsequently, it is fine-tuned using agent task completion trajectories. The inference strategy resembles Agent-Flan (Chen et al., 2024c), but instead of generating Thought, Action, and Arguments together as ReAct, our process separates them. Our pipeline follows an iterative approach:

The LLM first generates a Thought.
This Thought triggers the generation of a corresponding Action token.
Using this Action token, the model retrieves tool documentation.
The LLM then generates necessary Arguments based on the documentation.
This iterative process continues until the model produces a "finish" token or the maximum number of turns is reached.
The generated trajectory is represented as Traj=[q,(p1,Index(ds1),c1,f1),…,(pt,Index(dst),ct,ft),a]. In this structure, relevant tools are no longer required.

3.6 Inference

Constrained Beam Search for Action Token Restriction

To prevent action tokens outside the predefined tool set during inference, we implemented a constrained beam search generation that limits output tokens to the tool token set. This approach was applied both to tool retrieval, where the model selects tools based on queries, and to the end-to-end agent system, decreasing hallucination during action generation. Further details can be found in Section 5.4, and implementation specifics are explained in Appendix C.

(Note: Constrained Beam Search, implemented for tool retrieval and end-to-end agent system to limit hallucination during action generation)

4 Tool Retrieval Evaluation

4.1 Experimental Setup

ToolGen: Unified Tool Retrieval and Calling via Generation

Foundation Model:

Pretrained Llama-3-8B (Dubey et al., 2024) with vocabulary size of 128,256

Expansion:

Atomic indexing approach to expand vocabulary by an additional 46,985 tokens following tool virtualization process
Final vocabulary size of 175,241

Fine-tuning:

Llama-3 chat template with cosine learning rate scheduler
Maximum learning 4×10−5
Trained using Deepspeed ZeRO 3 (Rajbhandari et al., 2020) across 4×A100 GPUs
8 epochs for tool memorization, 1 epoch for retrieval training

Dataset:

ToolBench: real-world tool benchmark with 16k tool collections and 47k unique APIs
Each API documented with name, description, and parameters for calling the API

Baselines:

BM25: classical unsupervised retrieval method based on TF-IDF
EmbSim: sentence embeddings generated using OpenAI's sentence embedding model
Re-Invoke: unsupervised retrieval method with query rewriting and document expansion
IterFeedback: BERT-based retriever with iterative feedback for up to 10 rounds
ToolRetriever: BERT-based retriever trained via contrastive learning

Settings:

In-Domain Retrieval: retrieval search space restricted to tools within the same domain
Multi-Domain Retrieval: search space expanded to include tools from all three domains

Metrics:

Normalized Discounted Cumulative Gain (NDCG) a widely used metric in ranking tasks, including tool retrieval.

4.2 Results

Tool Retrieval Results (Table 1)

All trained models significantly outperform untrained baselines (BM25, EmbSim, Re-Invoke) across all metrics
Benefit of training on tool retrieval data demonstrated
Proposed ToolGen model: consistently achieves best performance in both settings

In-Domain Setting:

ToolGen delivers competitive results, comparable to IterFeedback system
Outperforms ToolRetriever by significant margin in all metrics
Surpasses IterFeedback in several cases: NDCG@5 for domain I1 and NDCG@1,@3,@5 for I2

Multi-Domain Setting:

ToolGen remains robust, outperforming ToolRetriever and other baselines
Single model capable of handling complex real-world retrieval tasks with less defined domain boundaries.

4.3 Indexing Method Comparison

ToolGen vs Alternative Indexing Approaches

Alternative Indexing Methods:

Numeric: Map each tool to a unique number, providing no semantic information but distinct identifiers
Hierarchical: Cluster tools into non-overlapping groups and recursively partition, forming a hierarchical structure
Semantic: Map each tool to its name, using semantic content of names to guide the Language Model

Analysis of Subtokens:

Atomic indexing: Each tool represented by a single token
Numeric indexing: Tools encoded into N tokens based on number of tools
Semantic indexing, Hierarchical indexing: Produce variable number of subtokens, with more outliers in semantic indexing

Evaluation of Indexing Methods:

Semantic indexing demonstrates best retrieval performance across various metrics and scenarios
Atomic indexing achieves better end-to-end results

Ablation Study:

Removing retrieval training, tool memorization, and constrained beam search decreases ToolGen's performance

4.4 Ablation

Ablation Study for ToolGen Performance:

Retrieval training: crucial factor for tool retrieval performance as it aligns with query inputs and output tokenized tools
Removing retrieval training leads to minor performance drop, plays a role in generalization (discussed further in Appendix F)
Tool memorization: helps improve generalization but not major contributor to retrieval task performance
Constrained beam search: prevents hallucinations and useful for end-to-end agent tasks. Not a major contributor to retrieval task but important for preventing incorrect outputs.

5 End-to-End Evaluation

5.1 Experimental Setup

ToolGen Framework Adaptations

Remove explicit tool selection from ToolBench data
Fine-tune retrieval model using reformatted data

Baselines

GPT-3.5: same as StableToolBench, using GPT-3.5 for tool calling capability
ToolLlama-2: fine-tuned Llama-2 on ToolBench data
ToolLlama-3: fine-tuned Llama-3 (same base model as ToolGen) on ToolBench dataset, named "ToolLlama" to distinguish from ToolLlama-2

Settings

Ground truth tools defined based on ChatGPT selections for evaluation purposes
Retrieval-based setting used for end-to-end experiments: ToolRetriever retrieves relevant tools for baselines, while ToolGen generates tool tokens directly
Both models finetuned using cosine scheduler with maximum learning rate set to 4×10^-5
Context length truncated to 6,144 and total batch size set to 512
Flash-Attention and Deepspeed ZeRO 3 used for memory savings
ToolGen separates Thought, Action, and Parameters steps, while ToolLlama completes them in a single round
Maximum of 16 turns allowed for ToolGen, compared to ToolLlama's 6-turn limit
Retry mechanism introduced to prevent early termination: models generate responses with higher temperature when "give up" or "I’m sorry" detected

Metrics

Solvable Pass Rate (SoPR) and Solvable Win Rate (SoWR) used for end-to-end evaluation
Results reported under two settings: Ground Truth Tools (G.T.) and Retriever (R.)

5.2 Results

End-to-End Evaluation Performance (Table 4)

ToolGen achieves best average SoPR score of 54.19 and highest SoWR of 49.70 using Ground Truth Tools (G.T.)
In Retriever setting, ToolGen maintains lead with an average SoPR of 53.28 and SoWR of 51.51
ToolLlama shows competitive performance, surpassing ToolGen on some individual instances

Ablation Study (Appendix G)

Table 5 presents end-to-end evaluation for different indexing methods: Numeric, Hierarchical, Semantic, Atomic
In Numeric setting: SoPR = 34.76, SoWR = 28.79
In Hierarchical setting: SoPR = 50.20, SoWR = 29.51
In Semantic setting: SoPR = 58.79, SoWR = 47.88
In Atomic setting: SoPR = 58.08, SoWR = 47.58

5.3 Indexing Method Comparison

Tool Retrieval Evaluation: End-to-End

Comparison of Indexing Methods for Agent Task:

Atomic method: achieves best performance among four indexing methods in end-to-end agent task (Table 5)
Hallucination rates: Atomic has lower hallucinations compared to other methods (Figure 4)
Constrained decoding: removed, allowing free generation of Thought, Action, and Parameters

Impact of Indexing Methods on Performance:

Atomic method: highest performance due to lower hallucination rates
Other methods: higher hallucination rates lead to poorer performance (discussed in Section 5.4)

Impact of Constrained Decoding:

With constrained decoding, ToolGen does not generate nonexistent tools
Without constraint, ToolGen generates 7% non-tool tokens during Action generation stage with atomic indexing
Semantic indexing results in even more nonexistent tool tokens for ToolGen
ToolLlama and GPT-3.5 still generate hallucinations despite being provided with five ground truth tools in the prompt (Figure 4)
Without any tools specified, ToolLlama generates over 50% nonexistent tool names.

5.4 Hallucination

Evaluation of Model Hallucination in Tool Generation

ToolGen Evaluation:

Input query in ToolGen agent paradigm (Thought, Tool, Parameters)
Report proportion of generated tools that do not exist in dataset out of all tool generation actions

Actions Decoding without Beam Search Constraints:

Tested on ToolLlama and GPT-3.5 models
Input query along with 5 ground truth tools

Hallucination Rates:

Figure 4 shows hallucination rates of nonexistent tools for different models
ToolGen, with constrained decoding, does not hallucinate at all
ToolLlama and GPT-3.5 may still generate nonexistent tool names despite being provided with 5 ground truth tools

6 Conclusions

ToolGen Framework: Unifies tool retrieval and execution within large language models (LLMs) through embedding tool-specific virtual tokens into the model's vocabulary, transforming tool interactions into a generative task. A three-stage training process enables efficient tool handling in real-world scenarios, setting a new standard for scalable AI agents dealing with extensive tool repositories. Future advancements may include chain-of-thought reasoning, reinforcement learning, and ReAct, enhancing LLMs' autonomy and versatility in practical applications.

Appendix A. Real Tool Example

Tool Example (Figure 5) showcases a real tool that consists of various APIs.

Tool Details: "tool_name" - name, "tool_description" - description of the tool's functionality
API Fields: "name" - API name, "description" - API information, "method" - HTTP calling method, "required_parameters" - essential parameters for an API call, "optional_parameters" - optional additional parameters

Refer to caption Figure 5: A real tool example featuring one API with simplified fields.

Appendix B. Tool Virtualization Implementation

ToolGen Indexing Methods

Atomic indexing:

Uses a single token to represent a tool (tool name + API name)
Does not hallucinate nonexistent tools
Example: <<Youtube Hub&&Get Video Details>>

Semantic indexing:

Maps each tool to the ToolBench name (combination of tool name and API name)
Tokenizes the name into multiple tokens for semantic meaning
Example: get_video_details_for_youtube_hub

Numeric indexing:

Assigns unique numbers to each tool
Uses a 5-digit number representation (spaced)
Example: 0 0 0 1 2 8

Hierarchical indexing:

Maps each tool to a hierarchical, clustered representation
Clusters tools into 10 categories per iteration
Assigns numbers from root to leaf node in the clustering tree
Example: 0 1 2 2 3 3 3

Appendix C Constrained Beam Search Detail

ToolGen: Unified Tool Retrieval and Calling via Generation

During retrieval and end-to-end agent tasks, use constrained beam search to limit generated actions to valid tool tokens
Build disjunctive trie for each node representing a tool token ID
- Children are all feasible ids following current id
- Determine possible next token IDs based on current searched IDs
During beam search step, mask out unfeasible tokens' logits to force possible ids to be sampled or searched

Retrieval:

Constraint directly applied during generation

End-to-End Agent Tasks:

Decompose inference step into three conversational turns
Detect when ToolGen needs to generate action, then apply constraint

Dataset Statistics:

Table 6 shows dataset statistics for tool memorization and retrieval training
- I1: 49,936 samples
- I2: 194,086 samples
- I3: 222,783 samples
- All: 72,833 tools used, resulting in slightly more than the number of samples

ToolGen Inference Example:

Figure 6 shows an example of ToolGen inference with no relevant tools for choice
System prompt given first, followed by user's task query
ToolGen generates thought, which hints model to generate action
After generating action, user provides tool documentation for model input generation.

Appendix D Retry Mechanism

Techniques for Reproducibility:

Fixing random seeds
Setting temperatures to zero for decoding

Problems with Adopted Techniques:

Models give up early, not trying enough
Negative impact on end-to-end evaluation

Mitigating Negative Impact:

Retry mechanism: regenerate turn when models try to give up or say sorry

Constrained Beam Search Algorithm:

Build Disjunctive Trie
Inputs: Set of tool token ids {Ids1,Ids2,…,Idsn}; Initial InputIds; Beam width k; Language model LM
Output: Searched Beams
Initialize Root←{}, Beams←[(InputIds, root of T)]
while Beams is not empty do
- for each (beam, node) in Beams do
  - if beam ends with eos_token_id then
    - Output beam and remove beam from beams
    - Continue
  - end if
  - score ← LM(beam)
  - feasible_ids ← children of node in T
  - Mask out ids not in feasible_ids from score
  - Append (NewBeam, NewNode) to NewBeams for each id, group in zip(TopIds, Groups) do
    - NewBeam ← beams[group] + [id]
    - NewNode ← node.child(id)
    - Append (NewBeam, NewNode) to NewBeams
  - end for
- Beams ← NewBeams
end while

Appendix E Ablation

Table 7: Ablation Results for ToolGen End-to-End Evaluation

Model Performance on Unseen Instructions (SoPR):

ToolGen Agent shows slightly better performance without tool memorization or retrieval training

Model Performance on Unseen Tools (SoWR):

Training without the first two stages causes a drop in both SoPR and SoWR
This demonstrates that the first two stage training plays a role in the generalization capability of ToolGen
Retrieval training is more significant compared to tool memorization

Ablation Results for Model Performance: | Model | SoPR | SoWR | I1-Inst. | I2-Inst. | I3-Inst. | Avg. | I1-Tool. | I1-Cat. | I2 Cat. | Avg. | |---|---|---|---|---|---|---|---|---|---|---|---| | ToolGen | 54.60 | 52.36 | 43.44 | 51.82 | 50.31 | 51.82 | 54.72 | 26.23 | 47.28 | 42.91 | | w/o retrieval training | 56.95 | 46.70 | 50.27 | 52.42 | 49.69 | 51.61 | 34.43 | 47.27 | - | - | | w/o memorization | 56.03 | 47.96 | 57.38 | 53.69 | 49.08 | 53.17 | 34.43 | 49.70 | - | - | | I1-Tool. | 56.54 | 49.46 | - | - | - | 52.66 | - | - | 51.96 | 50.83 | | I1-Cat. | 49.47 | 40.31 | - | - | - | 42.84 | - | 39.87 | - | 40.23 | | I2 Cat. | 58.86 | 46.19 | - | - | - | 51.70 | - | 42.74 | - | 49.46 | | I1-Inst. (ToolGen) | 50.31 | 54.72 | 50.31 | 54.72 | 43.44 | 51.82 | 39.87 | 38.56 | 42.74 | 43.43 | | I2-Inst. (ToolGen) | 40.51 | 39.87 | 40.51 | 39.87 | 51.82 | 46.19 | - | 36.29 | 37.90 | 38.28 | | I3-Inst. (ToolGen) | 37.90 | 39.53 | 37.90 | 39.53 | 47.28 | 39.32 | - | 36.13 | 36.66 | 36.83 |

Appendix F Generalization

ToolGen Agent Performance on Unseen Tools

Generalization Results:

Table 8 shows end-to-end evaluation of models on unseen tools
ToolGen underperforms ToolLlama, indicating worse generalization capability in completing full tasks
Generalization problem is prevalent in generative retrieval and beyond the scope of this paper

ToolGen Performance:

In Table 8, ToolGen:
- SoPR (Setting 1 - Tools): 52.32, 40.46, 39.65, 47.67, 39.24, 38.56, 37.90, 39.30
- SoWR (Setting 1 - Words): 56.54, 49.46, 51.96, 52.66, 40.51, 39.87, 37.90, 39.53
Compared to:
- GPT-3.5 (SoPR: 58.90, SoWR: 57.59) and (Retrieval: 57.59, 61.76)
- ToolLlama (SoPR: 57.38, SoWR: 57.70) and (Retrieval: 45.43, 61.76)

Appendix G Adapt ToolBench Data to ToolGen

ToolGen Data Adaptation from ToolBench:

Adopt tool documentations for tool memorization training as input: documents -> tool tokens
Use annotated data for retrieval training, converting queries to virtual tokens and relevant tools
Convert interaction trajectories for end-to-end agent-tuning:
- Remove system prompted tools
- Replace tool names with virtual tool tokens
- Decompose ReAct into three conversational turns: Thought -> Action Prompt -> Action (Tool Tokens) -> Tool Parameters
Table 6 in Appendix C shows the number of samples for each dataset.

Data Conversion for ToolGen:

Memorization Training: user role as input, assistant role as output
Retrieval Training: queries as input, virtual tokens representing tools as output
End-to-End Agent-Tuning:
- Remove system prompted tools from trajectories
- Replace tool names with virtual tool tokens
- Decompose ReAct into three conversational turns for model training.

Adapted Data Examples:

Figure 7 shows examples of tool memorization and retrieval training data (Figure 7)
Figure 8 demonstrates an example of end-to-end agent-tuning data (Figure 8).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ToolGen-Unified-Tool-Retrieval-and-Calling-via-Generation.md

ToolGen-Unified-Tool-Retrieval-and-Calling-via-Generation.md

ToolGen: Unified Tool Retrieval and Calling via Generation

Contents

Abstract

1 Introduction

2 Related Work

2.1 Tool Retrieval

2.2 LLM-Agents with Tool Calling

3 ToolGen

3.1 Preliminaries

3.2 Tool Virtualization

3.3 Tool Memorization

3.4 Retrieval Training

3.5 End-to-End Agent-Tuning

3.6 Inference

4 Tool Retrieval Evaluation

4.1 Experimental Setup

4.2 Results

4.3 Indexing Method Comparison

4.4 Ablation

5 End-to-End Evaluation

5.1 Experimental Setup

5.2 Results

5.3 Indexing Method Comparison

5.4 Hallucination

6 Conclusions

Appendix A. Real Tool Example

Appendix B. Tool Virtualization Implementation

Appendix C Constrained Beam Search Detail

Appendix D Retry Mechanism

Appendix E Ablation

Appendix F Generalization

Appendix G Adapt ToolBench Data to ToolGen

Files

ToolGen-Unified-Tool-Retrieval-and-Calling-via-Generation.md

Latest commit

History

ToolGen-Unified-Tool-Retrieval-and-Calling-via-Generation.md

File metadata and controls

ToolGen: Unified Tool Retrieval and Calling via Generation

Contents

Abstract

1 Introduction

2 Related Work

2.1 Tool Retrieval

2.2 LLM-Agents with Tool Calling

3 ToolGen

3.1 Preliminaries

3.2 Tool Virtualization

3.3 Tool Memorization

3.4 Retrieval Training

3.5 End-to-End Agent-Tuning

3.6 Inference

4 Tool Retrieval Evaluation

4.1 Experimental Setup

4.2 Results

4.3 Indexing Method Comparison

4.4 Ablation

5 End-to-End Evaluation

5.1 Experimental Setup

5.2 Results

5.3 Indexing Method Comparison

5.4 Hallucination

6 Conclusions

Appendix A. Real Tool Example

Appendix B. Tool Virtualization Implementation

Appendix C Constrained Beam Search Detail

Appendix D Retry Mechanism

Appendix E Ablation

Appendix F Generalization

Appendix G Adapt ToolBench Data to ToolGen