-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let's add a capstone project! #93
Comments
I have added a draft PR that proposes a setup for this issue: #97 |
I think LB is great, but we should focus on specific tasks. In my opinion, we should define (or better yet, find) datasets that have one correct answer. Here are some examples:
What do you think @burtenshaw ? |
Thanks @michaelshekasta . This sounds like a valid evaluation setup, and the tasks that you suggest are a good starting point. These are my main concerns:
With these concerns in mind, what do you think about this as a proposal to implement on top of #97?
IMO, this workflow would satisfy you comment about core interpretable evaluation, whilst also support a growing 'community driven' evaluation suite. |
@burtenshaw Thank you for the quick reply! I will break my thoughts into two parts: Small LLMs – I understand that we have openLB, but I'm unclear about the benefits of using a small LLM in this context. I would assume that larger models would likely perform better than smaller ones. My suggestion is to choose a specific task, such as math, and optimize the model for it. VLMs – I believe we can also use the VLM leaderboard and select a specific task to optimize. Alternatively, we could use a dataset (such as one with stable diffusion models) and apply the methods from this work to count the relevant data. |
This is correct and the open llm leaderboad can still be used to compare smaller models. See this filter
I agree with this, I just want to extend it so that others can also choose specific tasks for the student leaderboard.
Ok. VLM leaderboard uses the VLMEvalKit library. I'm definitely open to other evaluations, which relates to my above comment. As long as its something we can maintain. To summarise, I think that your proposal is compatible within the PR. We would just need to add a core specific task. |
Some students have asked for a capstone project that allows them to try out their skills and get collective feedback on their work.
We have another discussion going on here about a capstone using a student leaderboard. Also, the evaluation module already has a project in it, which could be converted to a capstone.
Evaluation Capstone
This could use the existing material in module_4 and ask the students to implement an LLM evaluation task. They could then share their evaluation task and results. We could publish the results in a leaderboard on the Hugging Face hub.
RAG Capstone
@duydl proposed a capstone project on RAG here
The text was updated successfully, but these errors were encountered: