Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need some detailed info on the execution #12

Open
kunwar-vikrant opened this issue Aug 30, 2024 · 8 comments
Open

Need some detailed info on the execution #12

kunwar-vikrant opened this issue Aug 30, 2024 · 8 comments

Comments

@kunwar-vikrant
Copy link

Hello all,
I need some help understanding the flow. How do you guys actually use the results, how is it modified to perform some task, how do you adapt it to use custom agents. I would be grateful if someone could explain it.

@hemangjoshi37a
Copy link

same here. better if someone can provide any documentation for this

@Bilgeworth
Copy link

Best guess so far, text formatted by a bot:

Each folder starting with an underscore (_) is an LLM benchmark. The search.py script within each benchmark folder defines the setup, execution, and evaluation process for that benchmark.

The search.py script is responsible for creating agents, testing them, and iterating to optimize their performance. The evaluation of these agents is handled by the evaluate_forward_fn function within the script.

To run a benchmark, navigate to the repository's root directory and execute the search.py script for the desired benchmark. For example, to run the ARC benchmark, use the following command:

python _arc/search.py

This will create agents, test them, iterate to improve their performance, and output a JSON file with the results.

If you want to create your own benchmark, follow these steps:

Copy an Existing Benchmark: Choose the most similar existing benchmark folder, copy it, and rename it to your desired benchmark name.
Modify the Benchmark: Swap in your benchmark-specific setup, execution, and evaluation logic in the search.py script.
Run the Benchmark: Execute the search.py script for your new benchmark. Ensure you change the output name to avoid overwriting the results of the default benchmarks.

Evaluation Function
The actual testing of agents is performed by the evaluate_forward_fn function in search.py.
For example, in the _arc benchmark, this function includes a link to the GitHub repository where the evaluation criteria were sourced.

@kunwar-vikrant
Copy link
Author

Thank you for the information!, have you been able to deploy the agents for any custom tasks?

@Bilgeworth
Copy link

Haven't had any luck yet, biggest issue I'm running into is defining a scoring system, in the mean time I've been doing my best to make something easier to edit on a fork, here's the flowchart from that, some variable and function names have been changed but the workflow is the same (some issues like calculating fitness twice without making changes between have been removed)
image

@Bilgeworth
Copy link

Realized the previous is a bit hard to follow, here's it color coded and rearranged
image

@wofeichangaiwoai
Copy link

I have tried to transfer this to other custom domains. my domain use the agent to design a code system, it is not very easy to evaluate the final result of this agent. so how can I use adas if our domain don't have any true label? From my understanding, the adas will evaluate the dataset first, and then enhance the final result via the whole pipeline, right?

@Bilgeworth
Copy link

Bilgeworth commented Sep 12, 2024

ADAS will copy the 5 default agents from the _prompt.py folder in the chosen domain, and then use each to run the example data through the base code found in the _utils.py file in the given domain.

It'll score the new agents based on how well their outputs match the expected outputs, and then I think it takes the one that does the best and sticks it in the _results.json file, if you look in there you'll see an entry for a prompt and its fitness.
That agent has a generated name, and Ideally when prompting your main bot you include a section of context that says "you have these agents available"*

As your main bot goes through it's task, if you have an agent for the ARC benchmark, it'll see it has a "arc transformer" agent in it's prompt library, and call it, which pulls the _results.json entry and runs through the api call, returning the result.

  • The agent library function has not been made by anyone in this repo as far as I can tell, so right now the final result you can hope for is a .json with an optimized prompt for a given task.

Running the search.py code multiple times should hopefully improve the bot, since _results.json is loaded as the first entry in the agent list that's tested, i.e. if a better bot is made it replaces the original in the _results.json.

When you say design a code system, you may want to break it down into further steps, as the more general you go with this system the less applicable it becomes.

One good thing you could do to test a code generation agent is a common sequence algorithm agent, when given a list of numbers they identify what mathematical patterns appear (primes, Fibonacci, etc).

This would ideally have ADAS focus in on something like keeping a list of common rules of thumb in the system prompt, or perhaps you could tie in tool use and the agent it generates would default to calling wolfram alpha. Your fitness score would be the number of correctly guessed patterns as a percentage.

image

@ShengranHu
Copy link
Owner

@Bilgeworth Thank you very much for your detailed explanation and for the nice flowchart!

@wofeichangaiwoai We believe developing ADAS algorithms for domains that don't have ground truth labels will be an important future research direction. Currently, we anticipate that an LLM evaluator might be used to provide feedback signals for the mate agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants