Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's add a capstone project! #93

Open
burtenshaw opened this issue Dec 12, 2024 · 5 comments
Open

Let's add a capstone project! #93

burtenshaw opened this issue Dec 12, 2024 · 5 comments

Comments

@burtenshaw
Copy link
Collaborator

Some students have asked for a capstone project that allows them to try out their skills and get collective feedback on their work.

We have another discussion going on here about a capstone using a student leaderboard. Also, the evaluation module already has a project in it, which could be converted to a capstone.

Evaluation Capstone

This could use the existing material in module_4 and ask the students to implement an LLM evaluation task. They could then share their evaluation task and results. We could publish the results in a leaderboard on the Hugging Face hub.

RAG Capstone

@duydl proposed a capstone project on RAG here

@burtenshaw
Copy link
Collaborator Author

I have added a draft PR that proposes a setup for this issue: #97

@michaelshekasta
Copy link

I think LB is great, but we should focus on specific tasks. In my opinion, we should define (or better yet, find) datasets that have one correct answer.

Here are some examples:

  1. Math problems (I know this will be difficult, but it's easy to understand what I mean)
  2. Counting elements in an image (VLM)
  3. Specific fields such as law, finance, or medicine
  4. Code-related tasks: fixing Python code or similar tasks

What do you think @burtenshaw ?

@burtenshaw
Copy link
Collaborator Author

Thanks @michaelshekasta . This sounds like a valid evaluation setup, and the tasks that you suggest are a good starting point. These are my main concerns:

  • If we define a set of tasks for students, we might be repeating the work of the open llm leaderboard, which will become difficult to maintain.
  • Any set of tasks may limit students with specific focuses. It would be cool if we could allow people to define their own tasks and add those to a set of core tasks.
  • There's a lack of library support for evaluating LLMs on vision tasks which would make the implementation extensive.

With these concerns in mind, what do you think about this as a proposal to implement on top of #97?

  1. We take a set of core automated tasks that meet the tasks/domains you highlight (-vision)
  2. We add these tasks to the capstone setup in [MODULE] Capstone project on evaluation #97
  3. We instruct students to do one (or both) of these things:
    • run the evaluation suite and open a PR with your model's results
    • create a custom evaluation task and add it to the suite
  4. We review and validate PRs for results submissions and add them to the leaderboard

IMO, this workflow would satisfy you comment about core interpretable evaluation, whilst also support a growing 'community driven' evaluation suite.

@michaelshekasta
Copy link

michaelshekasta commented Dec 16, 2024

@burtenshaw Thank you for the quick reply! I will break my thoughts into two parts:

Small LLMs – I understand that we have openLB, but I'm unclear about the benefits of using a small LLM in this context. I would assume that larger models would likely perform better than smaller ones. My suggestion is to choose a specific task, such as math, and optimize the model for it.

VLMs – I believe we can also use the VLM leaderboard and select a specific task to optimize. Alternatively, we could use a dataset (such as one with stable diffusion models) and apply the methods from this work to count the relevant data.

@burtenshaw
Copy link
Collaborator Author

Small LLMs – I understand that we have openLB, but I'm unclear about the benefits of using a small LLM in this context. I would assume that larger models would likely perform better than smaller ones.

This is correct and the open llm leaderboad can still be used to compare smaller models. See this filter

My suggestion is to choose a specific task, such as math, and optimize the model for it.

I agree with this, I just want to extend it so that others can also choose specific tasks for the student leaderboard.

VLMs – I believe we can also use the VLM leaderboard and select a specific task to optimize.

Ok. VLM leaderboard uses the VLMEvalKit library. I'm definitely open to other evaluations, which relates to my above comment. As long as its something we can maintain.

To summarise, I think that your proposal is compatible within the PR. We would just need to add a core specific task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants