Hacktoberfest 2024

Evidently at Hacktoberfest 2024

Thanks for your interest in contributing to Evidently! This page outlines how you can contribute to Evidently during Hacktoberfest (and beyond!).

New to Evidently?

LLM Evaluations

Evidently is an open-source Python library for ML and LLM evaluation and observability. It helps evaluate, test, and monitor AI-powered systems and data pipelines, from experimentation to production.

Key Features:

Works with both predictive and generative systems.
Supports tabular data, text data, and embeddings.
Includes 100+ built-in metrics, from data drift detection to LLM evaluations.
Offers an open architecture: easily export data to your existing tools.

If this is your first time using Evidently, you can get started in just a few minutes by following our quickstart guides:

Call for Contributions

There are many ways to contribute to Evidently, and you can find more details in our Contribution Guide.

We appreciate all contributions, no matter how small. Even fixing a typo in the documentation is a valuable contribution! Don’t hesitate to submit a pull request.

In addition, during Hacktoberfest, we invite you to make a specific type of contribution: adding new LLM evaluation methods.

You can look for open issues in the repository that are labeled hacktoberfest and good-first-issue:
Hacktoberfest Issues on GitHub

In this guide, we explain more about adding LLM evals.

How LLM evaluations work in Evidently

Evidently provides various evaluation methods for tabular data, ML models, and LLMs. When working with text data, we use Descriptors to evaluate specific text properties.

What is a Descriptor?

A Descriptor is a row-level score that assesses a specific characteristic of text data. It can evaluate data in a single column or compare two columns. For example, a descriptor might:

Measure text length.
Analyze sentiment.
Perform semantic similarity checks between texts in two columns.

Descriptors can be:

Deterministic: Like using regular expressions to match keywords.
ML-based: Such as using a pre-trained model for sentiment analysis.
LLM-based: Where you call an LLM with a templated evaluation prompt.

To run an evaluation, you must provide a table where one or more columns contain text data. Evidently processes the entire table, returning both row-level descriptors and a summary score or test result for the complete dataset.

You can use Descriptors to generate visual reports or JSON Reports that summarize the distribution of scores across all data points.

You can also create Tests that check specific conditions (e.g., if all text lengths fall within a specified range) and bundle them in a TestSuite.

Finally, you can use descriptors for continuous monitoring: as you compute multiple Reports or Test Suites, Evidently can parse the values and plot them on a dashboard.

LLM Evaluations

These features are all built into the open-source Evidently library: you can use any Descriptor in a Report or Test Suite, and get a monitoring Dashboard on top of them.

Available Descriptors

You can see the full list of available Descriptors in the Evidently documentation:
Available Descriptors

You can also easily implement custom Descriptors:

How to contribute

While you can always implement your custom Descriptors, many checks are repeatable across use cases. It makes sense to package them into built-in Descriptors, making them easier for others to reuse.

During Hacktoberfest, we invite you to contribute the new Descriptors. Here’s how you can get started:

Pick an existing Issue: We’ve already posted several ideas: Hacktoberfest Issues
Propose your own Idea: If you’ve got an idea for a new Descriptor, open a new GitHub issue and let us know! We’ll review it to make sure it doesn’t overlap with existing contributions and that it fits within the scope of the library.

Types of descriptor to add

Rule-based descriptors: Most of the examples you’ll see are rule-based. You’ll write custom code to handle specific behaviors, often using something like regular expressions to catch patterns in the text.
LLM-based descriptors: You can also suggest new descriptors that use large language models as evaluators: for example, to evaluate Verbosity (VerbosityLLMEval) or Coherence (CoherenceLLMEval).

But, there are a few things to keep in mind:

We’re already working on several RAG-focused descriptors (like factuality, faithfulness, answer completeness, and context relevance) and conversation-level metrics (like chat success and chat sentiment). We won’t accept contributions for these specific metrics at the moment to avoid overlap.
The key to contributing an LLM-based descriptor isn’t just coming up with the prompt. You’ll need to provide a diverse, representative dataset of at least 200 examples, including 50 examples of a “negative” class, to evaluate the quality of LLM-powered descriptors. This dataset can be a mix of synthetic data you created yourself and public data, as long as it’s relevant to the task and available for use (including commercial use under permissive license like MIT or Apache 2.0 or released in public domain).

So if you want to suggest a new LLM-based descriptor, please open an issue first with your proposal. We will only accept contributions of new prompt-based descriptors together with an evaluation dataset.

Instructions on how to add a new descriptor

For general instructions, like how to clone the repository, check out our Contribution Guide.

Here’s a step-by-step guide to help you add a new Descriptor.

Choose what to work on. Start by picking a descriptor to work on from the existing issues, or open a new issue with your idea. If you proposed a new idea, wait for us to confirm or provide feedback on your proposal. In some cases, we’ll discuss the design before you start coding—it’s just as important as the code itself!
Fork the repository. Fork the Evidently GitHub repository to your account to start working on the implementation.
Create a Descriptor. Implement the Descriptor you’ve chosen. You can work with deterministic, ML-based, or LLM-based descriptors. Take a look at existing descriptors for inspiration on implementation. Note that we prioritize implementations that do not introduce new dependencies to the Evidently library. If your implementation requires a new dependency, open an issue first for discussion.
Test your descriptor. Test it locally to ensure it works with different types of text data. Implement software tests that you can include in the contribution to allow testing of the new functionality.
Submit a Pull Request. Once you’ve tested your descriptor and added tests, submit your PR following the contribution guidelines. Then, wait for the review and approval process.
Update documentation. Once your descriptor is approved, add it to the list of all descriptors under Text Evals heading on the documentation page.

Need help?

If you have any questions or need support, join the Evidently community on Discord. You can ask questions and get help from other contributors and maintainers.

We hosted a Community Call on October 3 to show an example implementation of the new descriptor. You can check the video recording here.

Want to share your work?

Feel free to share your contributions on Twitter or other social media using the hashtags #Hacktoberfest and #EvidentlyHacktoberfest!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly