Improving GPT 4 Function Calling Accuracy

by Soham GanatraApr 25, 20247 min read

Join our Discord Community and check out what we're building!

We just published Part 2 of the blog comparing gpt-4-turbo vs opus vs haiku vs sonnet .

TL;DR: Show me the results

Introduction to GPT Function Calling

Large language models have recently been given the ability to perform function calling. Given the details(function-schema) of a number of functions, the LLM will be able to select and run the function with appropriate chat GPT4 parameters, if the prompt demands for it. OpenAI’s GPT-4 is one of the best function-calling LLMs available for use. In addition to the GPT4 function calling, there are also open-source function calling LLMs like OpenGorilla,  Functionary,  NexusRaven and FireFunction that we will try and compare performance with. Example Function Calling Code can be found at OpenAI function calling documentation.

Integration-Focused Agentic Function Calling GPT

We are transitioning towards Agentic applications for more effective use of LLMs in our daily workflow. In this setup, each AI agent is designated a specific role, equipped with distinct functionalities, often collaborating with other agents to perform complex tasks.

To enhance user experience and streamline workflows, these agents must interact with the tools used by users and automate some functionalities. Currently, AI development allows agents to interact with various software tools to a certain extent through proper integration using software APIs or SDKs. While we can integrate these points into AI agents and hope for flawless operation, the question arises:

Are the common design of API endpoints optimized for Agentic Process Automation (APA)? Maybe we can redesign APIs to improve the GPT4 function calling?

Selecting Endpoints using GPT 4 Function Call

We referenced the docs of ClickUp (Popular Task management App) and curated a selection of endpoints using gpt4 function calling. The decision was made due to the impracticality of expecting the LLM to choose from hundreds of endpoints, considering the limitation of context length.

We converted them to the corresponding OpenAI function calling schema, which is available here. These were specifically selected as they combine endpoints with both flattened and nested parameters.

Creating Benchmark Dataset

To evaluate our approaches effectively, we require a benchmark dataset that is small and focuses specifically on the software-integration aspect of function-calling Language Models (LLMs).

Despite reviewing various existing function calling datasets, none were found to be ideal for this study.

Consequently, we developed our own dataset called the ClickUp-Space dataset, which replicates real-world scenarios to some extent in gpt-4 function calling.

The prompts require one of eight selected functions to solve, ranging from simple to complex. Our evaluation will be based on how accurately the gpt functions are called with the correct parameters. We also prepared code for assessing performance.

Next, we developed a problem set consisting of 50 pairs of prompts along with their respective function calling solutions.

Measuring GPT-4 Function Calling Baseline Performance

Initially, we wanted to assess GPT-4's function calling performance independently, without any system prompts.

We set the temperature to 0 to make the results more predictable. The experiment was repeated three times, resulting in an average accuracy of 30%, which is below the target.

Benchmark without System Prompt - [Code Here]


Flattening the gpt Parameters

Some functions require output parameters in a nested structure. An example below-

Based on our experience with LLMs, we believe that while the model (GPT-4) has been optimised for structured output, a complex output structure may actually reduce performance and gpt function calling accuracy of the LLM output.

Therefore, we programmatically flatten the parameters.

Above function flattened will look as follows:

We attached the parameter name to its parent parameters (ex:features__due_dates__enabled ) by __ , and joined the parameter descriptions to its predecessor ( Ex:enabled__due_dates feature settings__enabled features within the space__ ).

Benchmark after Flattening Schema [Code Here]


Adding System Prompt with GPT4 Functional Calling Accuracy

We didn't have a system prompt before, so the LLM wasn't instructed on its role or interacting with ClickUp APIs.

Let's add a simple system prompt now.

| System

Benchmark with System Prompt - [Code Here]


GPT 4 function calling accuracy by Improving System Prompt

Now that we've observed an improvement in performance by adding a system prompt, we will enhance its detail to assess if the performance increase is sustained.

Benchmark after Flattened Schema + Improved System Prompt


Seems to work great! [Code Here]

Adding Schema Summary in Schema Prompt

Let's enhance the system prompts further by focusing on the functions and their purpose, building upon the clear instructions provided for the LLM's role.

Here is a concise summary of the system functions which we add to prompt.

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary. [Code Here]


Optimising Function Names

Now, let's improve the schemas starting with more descriptive function names.

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + Function Names Optimised [Code Here]


Optimising Function Description

Here, we focus on the function descriptions and make those more clear and focused.

And change schema with:

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + Function Names Optimised + Function Descriptions Optimised [Code Here]


Optimising gpt Function Parameter Descriptions

Earlier, we flattened the schema by stacking nested parameters' descriptions with their parents' descriptions until they were in a flattened state.

Let's now replace them with:

And modifying the previous schema:

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + (Function Names + Function Descriptions + Parameter Descriptions) Optimised [Code Here]


Adding Examples of GPT Function Calls

LLMs perform better when response examples are provided. Let's aim to give examples and analyse the outcomes.

To start, we can provide examples of each function call along with the corresponding function description in the schema to illustrate this concept.

And when we run the benchmark,

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + (Function Names + Function Descriptions + Parameter Descriptions) Optimised + Function Call Examples Added [Code Here]


Sadly, the score seems to degrade!

Adding gpt function calling Example Parameter Values

Since the function call example for addition did not work, let's now try adding sample values to the function parameters to provide a clearer idea of the values to input. We will adjust the descriptions of our function parameters accordingly.

And using these in the function schema, we get:

Flattened Schema + Improved System Prompt containing Schema Summary + (Function Names + Function Descriptions + Parameter Descriptions) Optimised + Function Call Examples Added + Adding Example Parameter Values [Code Here]


Wow! The intuition of adding example pays off.

Compiling the Results for GPT4 Function Calling Accuracy

To summarise all our examples, and their results:

We experimented with strategies to improve the function calling ability of LLMs, specifically for Agentic Software integrations. Starting from a baseline score of 36%, we boosted performance to an average of 78%. The insights shared in this article aim to enhance your applications as well.

Moreover, we discovered a key distinction between general function calling and function calling for software integrations. In general function calls, even with multiple functions, they operate independently and non-linearly when executing an action. However, in software integrations, functions must follow a specific sequence to effectively accomplish an action.

Join our Discord Community and check out what we're building!

All the codes of this articles are available here. Thank you!

Further Experiments & Challenges

We have been experimenting on this for a while and are planning to write further on

  • Parallel Function calling accuracy

  • Sequential Function Call Planning Accuracy (RAG + CoT)

  • Comparison with Open Source Function Calling Models (OpenGorilla, Functionary, NexusRaven, and FireFunction)

When dealing with integration-centric function calls, the process can be complex. For instance, the agent may need to gather data from various endpoints like get_spaces_members, get_current_active_members, and get_member_whose_contract_is_over before responding with the update_member_list function.

This means there could be additional data not yet discussed in the conversation that requires the agent to fetch from other endpoints silently to formulate a complete response.

Optimisations like this are crucial aspect of our efforts at Composio to enhance the smoothness of Agentic integrations. If you are interested in improving accuracy of your agents connect with us at tech@composio.dev.

Co-Authors:

Subscribe if you are interested in learning more!

Share