diff --git a/assets/img/blog_post_8_data_composition.png b/assets/img/blog_post_8_data_composition.png index f98ea3efd..b8203edf0 100644 Binary files a/assets/img/blog_post_8_data_composition.png and b/assets/img/blog_post_8_data_composition.png differ diff --git a/blogs/7_open_functions_v2.html b/blogs/7_open_functions_v2.html index 7b2889ab5..a79e62f43 100644 --- a/blogs/7_open_functions_v2.html +++ b/blogs/7_open_functions_v2.html @@ -86,7 +86,6 @@
With the latest iteration of Gorilla OpenFunctions-v2, we are delighted to mark significant advancements in function calling for LLMs within the open-source community. As a direct substitute for its predecessor, Gorilla OpenFunctions-v2 retains its open-source ethos while introducing exciting enhancements. These include support for multiple programming languages such as Python, Java, JavaScript, and REST API - the first among both open-source and closed-source models, alongside the ability to handle multiple and parallel function calls, and the ability to determine function relevance. This update cements Gorilla OpenFunctions-v2's position at the forefront of function calling capabilities among LLMs. Moreover, the drop-in replacement allows for seamless integration of OpenFunctions into a diverse range of applications, from social media platforms like Instagram to delivery services like DoorDash, as well as utility tools including Google Calendar and Stripe.
-
- Gorilla OpenFunctions-v2 is a 7B parameter model trained further upon on the
- Deepseek-Coder-7B-Instruct-v1.5 model. To trian the model, we collect in total of 65,283
+ Gorilla OpenFunctions-v2 is a 6.91B parameter model trained further upon on the
+ Deepseek-Coder-7B-Instruct-v1.5 6.91B model. To trian the model, we collect in total of 65,283
question-function-answer pairs from three different sources: Python packages (19,353), Java
repositories (16,586), Javascript Repositories (4,245), public-API (6,009), and Command Line
Tools (19,090) from various cloud providers. The data composition is shown in the figure below.
@@ -413,7 +412,7 @@ We are happy to release We are happy to release
+ Quick Links:
+ OpenFunctions Data Composition & Training 🍕Conclusion
gorilla-openfunctions-v2
, a 7B parameter model trained on
+ gorilla-openfunctions-v2
, a 6.91B parameter model trained on
top of the Deepseek-Coder-7B-Instruct-v1.5 LLM.
It takes-in the users prompt along with multiple API calls and returns the functions with the
right arguments. With OpenFunctions we extended native
diff --git a/blogs/8_berkeley_function_calling_leaderboard.html b/blogs/8_berkeley_function_calling_leaderboard.html
index 66581af2b..bc62f6242 100644
--- a/blogs/8_berkeley_function_calling_leaderboard.html
+++ b/blogs/8_berkeley_function_calling_leaderboard.html
@@ -111,7 +111,17 @@
function calls with multiple programming languages and parallel and multiple function calls. We
also provide a specific debugging feature that when the provided function is not suitable for
your task, the model will output an “Error Message”.
-
+
+
+
+
Evaluation Categories 📊
While the previous categories consist of the majority of our evaluations, we include other - specific categories, namely REST API, SQL, Java, and JavaScript, to evaluate model performance on diverse scenarios and support of multiple programming languages, and are resilient + specific categories, namely Chatting Capability, Function Relevance Detection, REST API, SQL, Java, and JavaScript, to evaluate model performance on diverse scenarios and support of multiple programming languages, and are resilient to irrelevant questions and function documentations.
- +Chatting Capability: In Chatting Capability, we design scenarios where no functions are passed in, and the users ask generic questions - this is similar to using the model as a general-purpose chatbot. We evaluate if the model is able to output chat messages and recognize that it does not need to invoke any functions. Note the difference with “Relevance” where the model is expected to also evaluate if any of the function input are relevant or not. We include this category for internal model evaluation and exclude the statistics from live leaderboard. We currently are working on better evaluation on chatability and ensure the chat is relevant and coherent with users' request and open to suggestions and feedback from the community.
Function Relevance Detection: In function relevance detection, we design scenarios where none of the provided functions are relevant and supposed to be invoked. We expect the model's output to be no function call. This scenario provides insight to whether a model @@ -239,12 +249,11 @@
Java + Javascript: Despite function calling formats being the same across most
- programming languages, each programming language has language specific types. For example, C
- has pointer
type, Java has HashMap
type. The goal of this test category is to understand how
- well the function calling model can be extended to not just JSON and Python type but all the
+ programming languages, each programming language has language specific types. For example, Java has HashMap
type. The goal of this test category is to understand how
+ well the function calling model can be extended to not just Python type but all the
language specific typings. We included 100 examples for Java AST evaluation and 70 examples for Javascript AST evaluation.
The three categories outlined above provide insight into the performance of different models across popular API call scenarios, offering valuable perspectives on the potential of function-calling models. +
The categories outlined above provide insight into the performance of different models across popular API call scenarios, offering valuable perspectives on the potential of function-calling models.
["Manchester United", "Man United", "Man U", "MUFC"]
""
. This notation tells the AST (Abstract Syntax Tree) checker that the function call intentionally omits these parameters.