How to build great tools for AI agents: A field guide
How to build great tools for AI agents: A field guide


We discovered through our anonymised tool execution logs that specific Firecrawl actions were failing with an exceptionally high rate. We tested the function extensively and found that it worked fine; however, when an agent used it, the numbers were off. We found that the agent was correctly specifying the scrape options (which specify the desired output formats for the scraped content from each page) as JSON. Still, the Firecrawl API requires that when this is selected, the jsonOptions
parameter should also be passed. The agent could never figure this out on the first try.
The solution? Better descriptions for the request schema that is now included If format is json, jsonOptions is required
. The result? No more agent failures!
It showed us how much value we can unlock by optimising the descriptions, names, and requests for tools. We could substantially improve the tool performance for our users. We studied everything—documentation from top providers, open-source agents like Copilot, and system prompts from leading tools. We compiled all the learnings and optimised the tools across our platform.
Here's what we learnt.
TL;DR
Consistent Naming
Stick to one style,
snake_case
preferred.Inconsistent names confuse LLMs and lower invocation accuracy.
Narrow Scope (One Concern per Tool)
Each tool = one atomic action.
Split large “do-everything” tools into smaller, precise ones.
Write Crisp Descriptions
Template: “Tool to <do X>. Use when <Y happens>.”
State critical constraints up-front (“after confirming …”).
List impactful limits only (e.g., “max 750 lines”).
Keep it short—under 1024 characters.
Design Parameters for Clarity
Document hidden rules (“At least one of agent_id | user_id | run_id | app_id is required”).
Strong typing everywhere; use enums for finite sets.
Fewer top-level params → fewer model mistakes.
Declare formats (
"format": "email"
, dates, etc.).Embed tiny examples right inside the description.
Continuous Improvement
Monitor anonymised production errors to identify friction points.
Add automated tests & evals before every change—know if you helped or hurt.
Iterate relentlessly; tooling is never “done.”

Fundamental Principles for Building Great Agent Tools
Building practical agent tools isn’t about relying solely on guesswork or intuition. In our research, which involved delving into resources from OpenAI, Google Gemini, Anthropic Claude, and observing agents such as Cursor and Replit, we've distilled a set of clear, foundational principles. Here are two core principles worth internalising deeply.
i. Consistent Formatting Matters
When designing tool collections, a simple yet profoundly impactful practice is using consistent naming conventions. At our company, we've standardised on snake_case
naming all tools, because inconsistency can easily confuse models, making them believe a tool is superior/inferior to another with a different case. This can result in unreliable outcomes and increased debugging loops.
ii. Narrow Scope Principle: One Concern Per Tool
Another critical guiding rule we've learned is to uphold a narrow scope for every agent tool rigorously. Tools should ideally perform a single, precise, and atomic operation. Initially, aiming for minimal complexity might feel counterintuitive—after all, a smaller scope could sometimes mean more tools overall. However, we have repeatedly observed in the field and among leading agents like Cursor and Replit that atomic, single-purpose tools significantly decrease ambiguity, streamline debugging, and enhance long-term maintainability.
Consider an overly broad tool like manage_files
, which could copy, move, delete, or rename files based on conditional arguments. Instead of simplifying things, this complexity makes the tool more prone to errors and more complex for AI to invoke accurately. On the contrary, splitting these into clearly defined tools such as copy_file
, move_file
, and delete_file
, results in precise, low-friction interactions.
Moving forward, we'll build a layer where tool composition is meaningful and essential, but it must be carefully researched to ensure the LLM clearly understands each tool and its scope. More details soon! ;)
Writing Perfect Tool Descriptions
Even if your agent tools are solidly built, their effectiveness can be undermined by vague, overly complex, or unclear descriptions.
Your goal with every description is simple yet critical:
Clearly convey what the tool does.
Precisely indicate when the AI should invoke it.
To achieve this balance quickly and effectively, we recommend a straightforward, proven template:
Tool to <what it does>. Use when <specific situation to invoke tool>
This simple structure clarifies two critical dimensions right away: action and context. That improved clarity leads directly to fewer invocation errors from your agent.
A great description is like a well-placed signpost: just enough detail, pointing clearly in the right direction. Too vague, you lose precision. Too complex, you risk confusion.
When and How to State Constraints:
Explicit constraints are critical hints for the AI, guiding it toward optimal tool invocation while preventing incorrect actions. Google's Vertex AI documentation illustrates this neatly; their sample function description explicitly calls out required user inputs as preconditions:
"Book flight tickets after confirming the user's requirements (time, departure, destination, party size, airline, etc.).”
This addition isn't just informative but an implicit directive instructing the AI clearly about when this tool is appropriate, significantly reducing errors.
Other explicit constraints examples might look like:
Translation: “Translates text from one language to another. Use only if the user specifically asks for translation.”
Accurately Communicate Limitations (but Selectively):
Be honest and transparent about tool limitations, but do so sparingly and intentionally. Cursor’s internal tools, for instance, transparently inform users if there's a specific handling limitation directly within the description:
“Read File – Read the contents of a file (up to 750 lines in Max mode).”
This kind of targeted limitation transparency is crucial because it ensures the agent never exceeds realistic capabilities, minimizing avoidable invocation errors.
But be cautious: don't clutter all your tool descriptions with every conceivable limit or minor constraint. Reserve limitations for genuinely impactful constraints —those that could significantly affect the tool’s correctness or reliability, such as hard data input caps, API key requirements, or format restrictions.
Remember, your AI model can't peek into the code behind the tool, and it depends entirely on what your description says. Ensure accuracy, be transparent, and review and update frequently.
Description Length: The Sweet Spot
Precision doesn't require lengthy prose. Shorter, well-crafted descriptions typically perform best. Long, verbose descriptions may dilute critical details and occupy limited prompt context space unnecessarily. Indeed, platforms like OpenAI impose a practical limit—specifically, OpenAI sets a 1024-character cap on function descriptions.
The Art of Defining Tool Parameters
Even the most thoughtfully described tool can stumble if its input parameters aren't intuitively designed and clearly specified.
Well-designed parameters lead to fewer mistakes: strong types, explicit constraints, and thoughtful enums. We've tested these approaches extensively, and the improvement is measurable.
Core Guidelines for Great Parameter Design
To help AI agents consistently succeed at invoking tools, strive to follow these essential guidelines in parameter design:
i. Explicitly Document All Parameter Nuances
When defining parameter descriptions, never leave important usage nuances implied or unclear. Consider, for instance, the AddNewMemoryRecords
function from mem0’s API. In this case, while parameters like agent_id
, run_id
, user_id
, and app_id
are individually optional, the function logic requires at least one to be provided. Explicitly stating this nuance in parameter descriptions saves your AI agent (and its developers!) from unnecessary confusion:
class AddNewMemoryRecordsRequest(BaseModel): """Request schema for `AddNewMemoryRecords`""" messages: t.List[MessagesRequest] = Field( ..., alias="messages", description="List of message objects forming the memory's content, representing conversations or multi-part information.", ) agent_id: str = Field( default=None, alias="agent_id", description="Unique identifier for the agent to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", ) user_id: str = Field( default=None, alias="user_id", description="Unique identifier for the user to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", ) app_id: str = Field( default=None, alias="app_id", description="Unique identifier for the application to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", ) run_id: str = Field( default=None, alias="run_id", description="Unique identifier for the run to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", )
Doing this creates crystal-clear usage guidelines, making successful tool invocation dramatically easier.
ii. Always Use Strongly Typed Parameters (Prefer Explicit Types):
Clearly defined and strongly-typed parameters help the agent quickly discern intended data types, minimizing invocation errors. If a parameter accepts finite categorical values, define them explicitly as enums
, instead of leaving them ambiguous within the description.
Anthropic and OpenAI both support enum in schemas, and the Vertex AI documentation explicitly says to use an enum field for finite sets rather than expecting the model to read options from text. This has two benefits: it constrains the model’s output (reducing the chance of typos or unsupported values), and it teaches the model the expected domain of that parameter in a machine-readable way.
Replace something like this vague definition:
"unit": { "type": "string", "description": "Unit of measurement, e.g. Celsius or Fahrenheit" }
With a strongly defined enum-based schema:
"unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" }
This change noticeably reduces confusion and incorrect parameter values.
iii. Keep Parameters to a Minimum Whenever Possible:
Google’s Agent Toolkit emphasizes a critical point—fewer parameters are generally better. Complex parameter lists significantly increase the likelihood of accidental misplacements and mistakes by your agent.
If you feel that the function requires an extensive list of parameters, consider breaking it down into smaller functions. We are experimenting with adding objects instead of primitive types, but the general consensus is that primitive types like str
and integer
work better than objects.
iv. Use Explicit Formatting and Constraints:
When parameters expect specialized string formats—like emails, dates, or IP addresses—always explicitly define these through the format
keyword in your JSON schema.
For instance, instead of the vague definition:
"email": {"type": "string"}
Always prefer explicit format annotations:
"email": { "type": "string", "format": "email" }
Note that pydantic
models don't natively respect the format
constraint. For Python settings leveraging pydantic
, you can inject JSON schema formatting using modern approaches (valid for current pydantic
versions):
from typing import Annotated from pydantic import BaseModel, Field, WithJsonSchema class UserContactInfo(BaseModel): email: Annotated[ str, Field(title="Email"), WithJsonSchema({'format': 'email'}) ]
This ensures type and format constraints remain rigorously enforced and transparent to the agent.
v. Add Short, Immediate Examples Within the Parameter Description Itself:
While schema definition often includes a dedicated examples area, always reinforce clarity by adding brief examples directly within parameter descriptions themselves (when helpful). This especially helps in cases where the parameter has to follow a specific format. For example, the list threads function in Gmail has examples for the format in which the queries should be defined:
class ListThreadsRequest(BaseModel): ... query: str = Field( default="", description="Filter for threads, using Gmail search query syntax (e.g., 'from:user@example.com is:unread').", examples=["is:unread", "<from:john.doe@example.com>", "subject:important"], )
Immediate examples further reinforce the agents' understanding, paving the way for improved accuracy.
Continuous Improvement through Tool Monitoring & Refinement
Agent tooling isn't a static, set-and-forget exercise. Tools exist in a dynamic environment: model architectures evolve, user expectations shift, and real-world scenarios frequently diverge subtly but significantly from initial assumptions. To keep your agents reliable and highly accurate, continuous monitoring and iterative refinement aren't just recommended, they're essential.

We saw an almost 10x drop in the number of tool failures after implementing the ideas in this post.
At Composio, we meticulously track anonymized production errors from agent tool calls. Each error gives us invaluable feedback to analyze what went wrong: is it a confusion in invocation logic, unclear parameter documentation, or a hidden constraint not transparently communicated? By understanding these patterns, we can pinpoint areas for improvement systematically.
In parallel, we’re also actively investing in rigorous automated testing and evaluation frameworks for our agent tools. After all, without concrete test results, how else would you know whether your latest changes improved your tools or inadvertently regressed performance? Thoughtful automated testing safeguards against regressions, giving our team confidence as we iterate on improvements to our toolsets.
We'll be covering how we're approaching the challenge of evaluating tool quality more extensively in a future blog post, stay tuned ;)
Conclusion: Elevating AI Agent Tooling
Crafting great tools for AI agents isn’t a casual exercise—it's a deliberate, disciplined engineering practice founded on clear standards, thoughtful decision-making, and a dedication to constant evolution. From consistent formatting and thoughtfully scoped tools to succinct yet informative descriptions and rigorously defined parameters, every detail matters.
Behind every impressive AI agent, whether it’s Cursor enhancing developer productivity or Claude Code, lies invisible craftsmanship, meticulous planning, and precisely structured tooling components. At Composio, we've dedicated countless hours to studying best practices from platforms like Google, OpenAI, and Anthropic, rigorously applying and challenging them in real-world scenarios.
We believe intensely in quality tooling—it’s foundational for truly effective AI agents. In upcoming posts, we look forward to sharing more on how we systematically test and measure our tool improvements. We hope the insights and frameworks we've shared empower you in your own journey of creating exceptional AI agent tooling. We will be sharing more as we learn, and exciting things are keeping, so keep an eye out ;)
We discovered through our anonymised tool execution logs that specific Firecrawl actions were failing with an exceptionally high rate. We tested the function extensively and found that it worked fine; however, when an agent used it, the numbers were off. We found that the agent was correctly specifying the scrape options (which specify the desired output formats for the scraped content from each page) as JSON. Still, the Firecrawl API requires that when this is selected, the jsonOptions
parameter should also be passed. The agent could never figure this out on the first try.
The solution? Better descriptions for the request schema that is now included If format is json, jsonOptions is required
. The result? No more agent failures!
It showed us how much value we can unlock by optimising the descriptions, names, and requests for tools. We could substantially improve the tool performance for our users. We studied everything—documentation from top providers, open-source agents like Copilot, and system prompts from leading tools. We compiled all the learnings and optimised the tools across our platform.
Here's what we learnt.
TL;DR
Consistent Naming
Stick to one style,
snake_case
preferred.Inconsistent names confuse LLMs and lower invocation accuracy.
Narrow Scope (One Concern per Tool)
Each tool = one atomic action.
Split large “do-everything” tools into smaller, precise ones.
Write Crisp Descriptions
Template: “Tool to <do X>. Use when <Y happens>.”
State critical constraints up-front (“after confirming …”).
List impactful limits only (e.g., “max 750 lines”).
Keep it short—under 1024 characters.
Design Parameters for Clarity
Document hidden rules (“At least one of agent_id | user_id | run_id | app_id is required”).
Strong typing everywhere; use enums for finite sets.
Fewer top-level params → fewer model mistakes.
Declare formats (
"format": "email"
, dates, etc.).Embed tiny examples right inside the description.
Continuous Improvement
Monitor anonymised production errors to identify friction points.
Add automated tests & evals before every change—know if you helped or hurt.
Iterate relentlessly; tooling is never “done.”

Fundamental Principles for Building Great Agent Tools
Building practical agent tools isn’t about relying solely on guesswork or intuition. In our research, which involved delving into resources from OpenAI, Google Gemini, Anthropic Claude, and observing agents such as Cursor and Replit, we've distilled a set of clear, foundational principles. Here are two core principles worth internalising deeply.
i. Consistent Formatting Matters
When designing tool collections, a simple yet profoundly impactful practice is using consistent naming conventions. At our company, we've standardised on snake_case
naming all tools, because inconsistency can easily confuse models, making them believe a tool is superior/inferior to another with a different case. This can result in unreliable outcomes and increased debugging loops.
ii. Narrow Scope Principle: One Concern Per Tool
Another critical guiding rule we've learned is to uphold a narrow scope for every agent tool rigorously. Tools should ideally perform a single, precise, and atomic operation. Initially, aiming for minimal complexity might feel counterintuitive—after all, a smaller scope could sometimes mean more tools overall. However, we have repeatedly observed in the field and among leading agents like Cursor and Replit that atomic, single-purpose tools significantly decrease ambiguity, streamline debugging, and enhance long-term maintainability.
Consider an overly broad tool like manage_files
, which could copy, move, delete, or rename files based on conditional arguments. Instead of simplifying things, this complexity makes the tool more prone to errors and more complex for AI to invoke accurately. On the contrary, splitting these into clearly defined tools such as copy_file
, move_file
, and delete_file
, results in precise, low-friction interactions.
Moving forward, we'll build a layer where tool composition is meaningful and essential, but it must be carefully researched to ensure the LLM clearly understands each tool and its scope. More details soon! ;)
Writing Perfect Tool Descriptions
Even if your agent tools are solidly built, their effectiveness can be undermined by vague, overly complex, or unclear descriptions.
Your goal with every description is simple yet critical:
Clearly convey what the tool does.
Precisely indicate when the AI should invoke it.
To achieve this balance quickly and effectively, we recommend a straightforward, proven template:
Tool to <what it does>. Use when <specific situation to invoke tool>
This simple structure clarifies two critical dimensions right away: action and context. That improved clarity leads directly to fewer invocation errors from your agent.
A great description is like a well-placed signpost: just enough detail, pointing clearly in the right direction. Too vague, you lose precision. Too complex, you risk confusion.
When and How to State Constraints:
Explicit constraints are critical hints for the AI, guiding it toward optimal tool invocation while preventing incorrect actions. Google's Vertex AI documentation illustrates this neatly; their sample function description explicitly calls out required user inputs as preconditions:
"Book flight tickets after confirming the user's requirements (time, departure, destination, party size, airline, etc.).”
This addition isn't just informative but an implicit directive instructing the AI clearly about when this tool is appropriate, significantly reducing errors.
Other explicit constraints examples might look like:
Translation: “Translates text from one language to another. Use only if the user specifically asks for translation.”
Accurately Communicate Limitations (but Selectively):
Be honest and transparent about tool limitations, but do so sparingly and intentionally. Cursor’s internal tools, for instance, transparently inform users if there's a specific handling limitation directly within the description:
“Read File – Read the contents of a file (up to 750 lines in Max mode).”
This kind of targeted limitation transparency is crucial because it ensures the agent never exceeds realistic capabilities, minimizing avoidable invocation errors.
But be cautious: don't clutter all your tool descriptions with every conceivable limit or minor constraint. Reserve limitations for genuinely impactful constraints —those that could significantly affect the tool’s correctness or reliability, such as hard data input caps, API key requirements, or format restrictions.
Remember, your AI model can't peek into the code behind the tool, and it depends entirely on what your description says. Ensure accuracy, be transparent, and review and update frequently.
Description Length: The Sweet Spot
Precision doesn't require lengthy prose. Shorter, well-crafted descriptions typically perform best. Long, verbose descriptions may dilute critical details and occupy limited prompt context space unnecessarily. Indeed, platforms like OpenAI impose a practical limit—specifically, OpenAI sets a 1024-character cap on function descriptions.
The Art of Defining Tool Parameters
Even the most thoughtfully described tool can stumble if its input parameters aren't intuitively designed and clearly specified.
Well-designed parameters lead to fewer mistakes: strong types, explicit constraints, and thoughtful enums. We've tested these approaches extensively, and the improvement is measurable.
Core Guidelines for Great Parameter Design
To help AI agents consistently succeed at invoking tools, strive to follow these essential guidelines in parameter design:
i. Explicitly Document All Parameter Nuances
When defining parameter descriptions, never leave important usage nuances implied or unclear. Consider, for instance, the AddNewMemoryRecords
function from mem0’s API. In this case, while parameters like agent_id
, run_id
, user_id
, and app_id
are individually optional, the function logic requires at least one to be provided. Explicitly stating this nuance in parameter descriptions saves your AI agent (and its developers!) from unnecessary confusion:
class AddNewMemoryRecordsRequest(BaseModel): """Request schema for `AddNewMemoryRecords`""" messages: t.List[MessagesRequest] = Field( ..., alias="messages", description="List of message objects forming the memory's content, representing conversations or multi-part information.", ) agent_id: str = Field( default=None, alias="agent_id", description="Unique identifier for the agent to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", ) user_id: str = Field( default=None, alias="user_id", description="Unique identifier for the user to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", ) app_id: str = Field( default=None, alias="app_id", description="Unique identifier for the application to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", ) run_id: str = Field( default=None, alias="run_id", description="Unique identifier for the run to associate with this memory. At least one of agent_id, user_id, app_id, or run_id must be provided.", )
Doing this creates crystal-clear usage guidelines, making successful tool invocation dramatically easier.
ii. Always Use Strongly Typed Parameters (Prefer Explicit Types):
Clearly defined and strongly-typed parameters help the agent quickly discern intended data types, minimizing invocation errors. If a parameter accepts finite categorical values, define them explicitly as enums
, instead of leaving them ambiguous within the description.
Anthropic and OpenAI both support enum in schemas, and the Vertex AI documentation explicitly says to use an enum field for finite sets rather than expecting the model to read options from text. This has two benefits: it constrains the model’s output (reducing the chance of typos or unsupported values), and it teaches the model the expected domain of that parameter in a machine-readable way.
Replace something like this vague definition:
"unit": { "type": "string", "description": "Unit of measurement, e.g. Celsius or Fahrenheit" }
With a strongly defined enum-based schema:
"unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" }
This change noticeably reduces confusion and incorrect parameter values.
iii. Keep Parameters to a Minimum Whenever Possible:
Google’s Agent Toolkit emphasizes a critical point—fewer parameters are generally better. Complex parameter lists significantly increase the likelihood of accidental misplacements and mistakes by your agent.
If you feel that the function requires an extensive list of parameters, consider breaking it down into smaller functions. We are experimenting with adding objects instead of primitive types, but the general consensus is that primitive types like str
and integer
work better than objects.
iv. Use Explicit Formatting and Constraints:
When parameters expect specialized string formats—like emails, dates, or IP addresses—always explicitly define these through the format
keyword in your JSON schema.
For instance, instead of the vague definition:
"email": {"type": "string"}
Always prefer explicit format annotations:
"email": { "type": "string", "format": "email" }
Note that pydantic
models don't natively respect the format
constraint. For Python settings leveraging pydantic
, you can inject JSON schema formatting using modern approaches (valid for current pydantic
versions):
from typing import Annotated from pydantic import BaseModel, Field, WithJsonSchema class UserContactInfo(BaseModel): email: Annotated[ str, Field(title="Email"), WithJsonSchema({'format': 'email'}) ]
This ensures type and format constraints remain rigorously enforced and transparent to the agent.
v. Add Short, Immediate Examples Within the Parameter Description Itself:
While schema definition often includes a dedicated examples area, always reinforce clarity by adding brief examples directly within parameter descriptions themselves (when helpful). This especially helps in cases where the parameter has to follow a specific format. For example, the list threads function in Gmail has examples for the format in which the queries should be defined:
class ListThreadsRequest(BaseModel): ... query: str = Field( default="", description="Filter for threads, using Gmail search query syntax (e.g., 'from:user@example.com is:unread').", examples=["is:unread", "<from:john.doe@example.com>", "subject:important"], )
Immediate examples further reinforce the agents' understanding, paving the way for improved accuracy.
Continuous Improvement through Tool Monitoring & Refinement
Agent tooling isn't a static, set-and-forget exercise. Tools exist in a dynamic environment: model architectures evolve, user expectations shift, and real-world scenarios frequently diverge subtly but significantly from initial assumptions. To keep your agents reliable and highly accurate, continuous monitoring and iterative refinement aren't just recommended, they're essential.

We saw an almost 10x drop in the number of tool failures after implementing the ideas in this post.
At Composio, we meticulously track anonymized production errors from agent tool calls. Each error gives us invaluable feedback to analyze what went wrong: is it a confusion in invocation logic, unclear parameter documentation, or a hidden constraint not transparently communicated? By understanding these patterns, we can pinpoint areas for improvement systematically.
In parallel, we’re also actively investing in rigorous automated testing and evaluation frameworks for our agent tools. After all, without concrete test results, how else would you know whether your latest changes improved your tools or inadvertently regressed performance? Thoughtful automated testing safeguards against regressions, giving our team confidence as we iterate on improvements to our toolsets.
We'll be covering how we're approaching the challenge of evaluating tool quality more extensively in a future blog post, stay tuned ;)
Conclusion: Elevating AI Agent Tooling
Crafting great tools for AI agents isn’t a casual exercise—it's a deliberate, disciplined engineering practice founded on clear standards, thoughtful decision-making, and a dedication to constant evolution. From consistent formatting and thoughtfully scoped tools to succinct yet informative descriptions and rigorously defined parameters, every detail matters.
Behind every impressive AI agent, whether it’s Cursor enhancing developer productivity or Claude Code, lies invisible craftsmanship, meticulous planning, and precisely structured tooling components. At Composio, we've dedicated countless hours to studying best practices from platforms like Google, OpenAI, and Anthropic, rigorously applying and challenging them in real-world scenarios.
We believe intensely in quality tooling—it’s foundational for truly effective AI agents. In upcoming posts, we look forward to sharing more on how we systematically test and measure our tool improvements. We hope the insights and frameworks we've shared empower you in your own journey of creating exceptional AI agent tooling. We will be sharing more as we learn, and exciting things are keeping, so keep an eye out ;)
Recommended Blogs
Recommended Blogs
Stay updated.

Stay updated.