💨 Notus 7B v2 #4
Replies: 1 comment 1 reply
-
I strongly feel like Notus-7B-v1 is seriously underrated in the open-LLM community. I stumbled upon Notus-7B-v1 actually around the same time I was looking at spin-offs of Yi-34B, such as SUStech-34B whose glamorously high benchmarks on their page was getting editorialized. Yet, in my testing, SUStech-34B actually provides relatively generic outputs that are exceptionally vague even when instructed not to be vague (it performs worse than Yi-34B, even). On the other hand, Notus-7B-v1 is so gorgeously rich in detail (when told to), and provides the same or more bullet points and in most cases, more detail than base Yi-34B on a variety of subjects (US history, EU history, politics, architecture, early history, english literature, etc). Argilla is seriously not getting enough credit here. Benchmarks are numbers and hype seem to be consistently built around them, unfortunately. I'm very personally very excited for anything Argilla comes out with in the future. Notus is an exceptional refinement over Zephyr, which is already a notable refinement over the Mistral original model. |
Beta Was this translation helpful? Give feedback.
-
After completing the v1 of Notus 7B, we have identified some things to improve based on the experiments we've run and the results we've obtained. Even though we showcased that we could surpass Zephyr 7B Beta putting some more effort on the data curation, we wanted to re-iterate on Notus 7B to actually build a more robust dataset for DPO fine-tuning.
Find all the relevant information about Notus 7B v1 at https://huggingface.co/collections/argilla/notus-7b-v1-655529d7c73cb6c830e9555a
Notus 7B v2
As mentioned before, we want to dedicate some time and effort to the data curation and data review, and if applicable, to generate a new curated dataset that is smaller than the original UltraFeedback and also higher quality. The curation will be mainly conducted by human annotators, but may involve data coming from AI Feedback (AIF) from
distilabel
.In this stage, we'll run some more experiments with the compute we have for the moment 8 x A100 40GB, and see whether less, higher quality data has a considerable impact on the model performance on both benchmarks and human evaluation.
Beta Was this translation helpful? Give feedback.
All reactions