[Review] #27

druce · 2025-01-27T17:19:37Z

Format

What's the book format where you found this issue?
[ ] pdf
[x ] web
[ ] ipynb

Chapter

evals

Issue Description

amazeballs, a few small thoughts

glider - what is a 3b evaluator llm, paper says 3.8b parameters, (also says 3b evaluator but I couldn't figure out what that means or just a typo)

maybe add log loss here : For discriminative tasks, LLM-based applications may produce log-probabilities or discrete predictions, traditional machine learning metrics like log loss, accuracy, precision, recall, and F1 score can be applied.

maybe artificial analysis worth a mention in review of benchmarks, maybe not, i like that it does performance analysis, latency/throughput, nice dashboard to look up eg R1 and see how it stacks up, API providers , meta analysis https://artificialanalysis.ai/models/deepseek-r1

in the table comparing langsmith, promptfoo, lighteval, seems noteworthy that langsmith needs API key and collects all your prompts and traces, even though on the website in the got a question section they say they never look at it https://www.langchain.com/langsmith

just my midwit observations

souzatharsis · 2025-01-27T20:03:58Z

great points, Druce! We will incorporate your feedback, shortly Best Regards,

…

-- Thársis <http://linkedin.com/in/tharsissouza>

On Mon, Jan 27, 2025 at 12:20 PM Druce Vertes ***@***.***> wrote: Format What's the book format where you found this issue? [ ] pdf [x ] web [ ] ipynb Chapter evals Issue Description amazeballs, a few small thoughts glider - what is a 3b evaluator llm, paper says 3.8b parameters, (also says 3b evaluator but I couldn't figure out what that means or just a typo) maybe add log loss here : For discriminative tasks, LLM-based applications may produce log-probabilities or discrete predictions, traditional machine learning metrics like log loss, accuracy, precision, recall, and F1 score can be applied. maybe artificial analysis <https://artificialanalysis.ai/> worth a mention in review of benchmarks, maybe not, i like that it does performance analysis, latency/throughput, nice dashboard to look up eg R1 and see how it stacks up, API providers , meta analysis https://artificialanalysis.ai/models/deepseek-r1 in the table comparing langsmith, promptfoo, lighteval, seems noteworthy that langsmith needs API key and collects all your prompts and traces, even though on the website in the got a question section they say they never look at it https://www.langchain.com/langsmith just my midwit observations — Reply to this email directly, view it on GitHub <#27>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTMY3ILDAQDJ3WNU4EL7NT2MZTD7AVCNFSM6AAAAABV6S3EB6VHI2DSMVQWIX3LMV43ASLTON2WKOZSHAYTGNRQGIYTGNA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Review] #27

[Review] #27

druce commented Jan 27, 2025

souzatharsis commented Jan 27, 2025 via email

[Review] #27

[Review] #27

Comments

druce commented Jan 27, 2025

Format

Chapter

Issue Description

souzatharsis commented Jan 27, 2025 via email