-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need some detailed info on the execution #12
Comments
same here. better if someone can provide any documentation for this |
Best guess so far, text formatted by a bot: Each folder starting with an underscore ( The To run a benchmark, navigate to the repository's root directory and execute the python _arc/search.py This will create agents, test them, iterate to improve their performance, and output a JSON file with the results. If you want to create your own benchmark, follow these steps: Copy an Existing Benchmark: Choose the most similar existing benchmark folder, copy it, and rename it to your desired benchmark name. Evaluation Function |
Thank you for the information!, have you been able to deploy the agents for any custom tasks? |
I have tried to transfer this to other custom domains. my domain use the agent to design a code system, it is not very easy to evaluate the final result of this agent. so how can I use adas if our domain don't have any true label? From my understanding, the adas will evaluate the dataset first, and then enhance the final result via the whole pipeline, right? |
ADAS will copy the 5 default agents from the _prompt.py folder in the chosen domain, and then use each to run the example data through the base code found in the _utils.py file in the given domain. It'll score the new agents based on how well their outputs match the expected outputs, and then I think it takes the one that does the best and sticks it in the _results.json file, if you look in there you'll see an entry for a prompt and its fitness. As your main bot goes through it's task, if you have an agent for the ARC benchmark, it'll see it has a "arc transformer" agent in it's prompt library, and call it, which pulls the _results.json entry and runs through the api call, returning the result.
Running the search.py code multiple times should hopefully improve the bot, since _results.json is loaded as the first entry in the agent list that's tested, i.e. if a better bot is made it replaces the original in the _results.json. When you say design a code system, you may want to break it down into further steps, as the more general you go with this system the less applicable it becomes. One good thing you could do to test a code generation agent is a common sequence algorithm agent, when given a list of numbers they identify what mathematical patterns appear (primes, Fibonacci, etc). This would ideally have ADAS focus in on something like keeping a list of common rules of thumb in the system prompt, or perhaps you could tie in tool use and the agent it generates would default to calling wolfram alpha. Your fitness score would be the number of correctly guessed patterns as a percentage. |
@Bilgeworth Thank you very much for your detailed explanation and for the nice flowchart! @wofeichangaiwoai We believe developing ADAS algorithms for domains that don't have ground truth labels will be an important future research direction. Currently, we anticipate that an LLM evaluator might be used to provide feedback signals for the mate agent. |
Hello all,
I need some help understanding the flow. How do you guys actually use the results, how is it modified to perform some task, how do you adapt it to use custom agents. I would be grateful if someone could explain it.
The text was updated successfully, but these errors were encountered: