Need some detailed info on the execution #12

kunwar-vikrant · 2024-08-30T14:37:32Z

Hello all,
I need some help understanding the flow. How do you guys actually use the results, how is it modified to perform some task, how do you adapt it to use custom agents. I would be grateful if someone could explain it.

hemangjoshi37a · 2024-08-31T07:42:43Z

same here. better if someone can provide any documentation for this

Bilgeworth · 2024-09-03T18:41:41Z

Best guess so far, text formatted by a bot:

Each folder starting with an underscore (_) is an LLM benchmark. The search.py script within each benchmark folder defines the setup, execution, and evaluation process for that benchmark.

The search.py script is responsible for creating agents, testing them, and iterating to optimize their performance. The evaluation of these agents is handled by the evaluate_forward_fn function within the script.

To run a benchmark, navigate to the repository's root directory and execute the search.py script for the desired benchmark. For example, to run the ARC benchmark, use the following command:

python _arc/search.py

This will create agents, test them, iterate to improve their performance, and output a JSON file with the results.

If you want to create your own benchmark, follow these steps:

Copy an Existing Benchmark: Choose the most similar existing benchmark folder, copy it, and rename it to your desired benchmark name.
Modify the Benchmark: Swap in your benchmark-specific setup, execution, and evaluation logic in the search.py script.
Run the Benchmark: Execute the search.py script for your new benchmark. Ensure you change the output name to avoid overwriting the results of the default benchmarks.

Evaluation Function
The actual testing of agents is performed by the evaluate_forward_fn function in search.py.
For example, in the _arc benchmark, this function includes a link to the GitHub repository where the evaluation criteria were sourced.

kunwar-vikrant · 2024-09-09T13:36:18Z

Thank you for the information!, have you been able to deploy the agents for any custom tasks?

Bilgeworth · 2024-09-09T14:43:38Z

Haven't had any luck yet, biggest issue I'm running into is defining a scoring system, in the mean time I've been doing my best to make something easier to edit on a fork, here's the flowchart from that, some variable and function names have been changed but the workflow is the same (some issues like calculating fitness twice without making changes between have been removed)

Bilgeworth · 2024-09-09T15:03:32Z

Realized the previous is a bit hard to follow, here's it color coded and rearranged

wofeichangaiwoai · 2024-09-12T09:34:38Z

I have tried to transfer this to other custom domains. my domain use the agent to design a code system, it is not very easy to evaluate the final result of this agent. so how can I use adas if our domain don't have any true label? From my understanding, the adas will evaluate the dataset first, and then enhance the final result via the whole pipeline, right?

Bilgeworth · 2024-09-12T18:13:27Z

ADAS will copy the 5 default agents from the _prompt.py folder in the chosen domain, and then use each to run the example data through the base code found in the _utils.py file in the given domain.

It'll score the new agents based on how well their outputs match the expected outputs, and then I think it takes the one that does the best and sticks it in the _results.json file, if you look in there you'll see an entry for a prompt and its fitness.
That agent has a generated name, and Ideally when prompting your main bot you include a section of context that says "you have these agents available"*

As your main bot goes through it's task, if you have an agent for the ARC benchmark, it'll see it has a "arc transformer" agent in it's prompt library, and call it, which pulls the _results.json entry and runs through the api call, returning the result.

The agent library function has not been made by anyone in this repo as far as I can tell, so right now the final result you can hope for is a .json with an optimized prompt for a given task.

Running the search.py code multiple times should hopefully improve the bot, since _results.json is loaded as the first entry in the agent list that's tested, i.e. if a better bot is made it replaces the original in the _results.json.

When you say design a code system, you may want to break it down into further steps, as the more general you go with this system the less applicable it becomes.

One good thing you could do to test a code generation agent is a common sequence algorithm agent, when given a list of numbers they identify what mathematical patterns appear (primes, Fibonacci, etc).

This would ideally have ADAS focus in on something like keeping a list of common rules of thumb in the system prompt, or perhaps you could tie in tool use and the agent it generates would default to calling wolfram alpha. Your fitness score would be the number of correctly guessed patterns as a percentage.

ShengranHu · 2024-09-12T20:31:52Z

@Bilgeworth Thank you very much for your detailed explanation and for the nice flowchart!

@wofeichangaiwoai We believe developing ADAS algorithms for domains that don't have ground truth labels will be an important future research direction. Currently, we anticipate that an LLM evaluator might be used to provide feedback signals for the mate agent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need some detailed info on the execution #12

Need some detailed info on the execution #12

kunwar-vikrant commented Aug 30, 2024

hemangjoshi37a commented Aug 31, 2024

Bilgeworth commented Sep 3, 2024

kunwar-vikrant commented Sep 9, 2024

Bilgeworth commented Sep 9, 2024

Bilgeworth commented Sep 9, 2024

wofeichangaiwoai commented Sep 12, 2024

Bilgeworth commented Sep 12, 2024 •

edited

Loading

ShengranHu commented Sep 12, 2024

Need some detailed info on the execution #12

Need some detailed info on the execution #12

Comments

kunwar-vikrant commented Aug 30, 2024

hemangjoshi37a commented Aug 31, 2024

Bilgeworth commented Sep 3, 2024

kunwar-vikrant commented Sep 9, 2024

Bilgeworth commented Sep 9, 2024

Bilgeworth commented Sep 9, 2024

wofeichangaiwoai commented Sep 12, 2024

Bilgeworth commented Sep 12, 2024 • edited Loading

ShengranHu commented Sep 12, 2024

Bilgeworth commented Sep 12, 2024 •

edited

Loading