Welcome to our getML Data Science Challenge! This challenge invites you to tackle a series of relational learning benchmark problems using getML FastProp โ the fastest open-source tool for automated feature engineering on relational and time-series data.
๐ฏ How to participate
๐ Available Challenges
๐ Get Started with getML FastProp
Ready to elevate your data science skills and make a real impact? Here's what awaits you:
-
Build better models faster:
Learn how to use getML FastProp (short for Fast Propositionalization) for automated feature extraction from relational and time-series data to deliver more accurate prediction models without manual SQL or deep business domain expertise. getML-community is open source (ELv2), so you can keep using it to turbocharge your next predictive analytics project and beyond! -
Gain recognition and rewards:
Showcase your skills by surpassing benchmark scores with the help of getML, and receive โฌ100 as a token of our gratitude for every accepted submission. Your contributions may also be featured in community spotlights and future publications! -
Expand your network:
Connect with the getML dev team and other data scientists on our Discord. Share your knowledge, learn from others, and help shape the future of open-source automated feature engineering.
Enjoying what you see? ๐ We'd love for you to star our main repo, getML-community repository. It's a great way to stay updated and show your support for our work! ๐
Note
This challenge will be open for submissions starting on the 17th of January 2025 at 16:00 CET!
We will periodically add new challenges until we run out of tasks. There are more to come.
At getML, we've seen firsthand how FastProp simplifies and accelerates feature engineering for relational and time-series data. By launching this challenge, we aim to share these benefits with the broader data science and engineering community.
Our goal is twofold:
- Empower you to tackle relational data problems with ease by adding FastProp to your toolkit, enabling you to focus on insights rather than manual feature extraction.
- Collaborate with the community to gather feedback, learn from real-world use cases, and continue enhancing FastProp based on your experiences.
By participating, you'll not only advance your skills but also contribute to the development of cutting-edge, open-source tools that are shaping the future of learning on relational data.
Relational data, often found in databases with multiple linked tables, poses a significant hurdle for traditional machine learning algorithms that typically expect a single, flat table. Relational learning overcomes this by employing algorithms to engineer features and learn directly from these interconnected data structures. This approach unlocks valuable insights hidden within the relationships between data points, leading to more accurate prediction models.
At the heart of getML-community lies FastProp, our open-source algorithm specifically designed for efficient feature engineering on relational data. FastProp seamlessly transforms complex relational data into a single table format, making it compatible with any machine learning algorithm. This automation not only saves you valuable time and effort but also has the potential to reveal hidden patterns crucial for accurate predictions.
- Unmatched Speed: It is engineered for speed, surpassing many existing methods in benchmarks. (See the results).
- Simplicity: FastProp seamlessly integrates with the MLOps ecosystem, making it incredibly easy to incorporate into your workflow.
- Enhanced Productivity: By streamlining the tedious process of feature engineering, getML FastProp allows you to focus on the business-critical aspects of your project, and not thousands of lines of SQL.
RelBench, a project from SNAP (Stanford University), provides a standardized set of benchmark datasets and tasks for evaluating relational learning algorithms. It aims to accelerate research and development in this field by offering a common ground for research. They created two baselines for comparison:
-
Manual Feature Engineering by an expert:
In RelBench, human data scientists manually engineered features using their domain expertise and knowledge of relational databases. This involved carefully selecting, aggregating, and transforming data from multiple tables to create informative features. -
Graph based neural networks:
Relational Deep Learning (RDL) represents a category of approaches within relational learning that is developed at SNAP, which leverages mostly graph neural networks to learn features from relational data.
We've been busy putting getML FastProp to the test on RelBench, and the results are quite impressive! We've already surpassed both RDL and human baselines on two tasks, and we believe there's plenty more potential to unlock. Now, it's your turn to explore the power of getML and see what you can achieve.
This challenge invites you to apply your data science skills and creativity to a series of unsolved RelBench tasks. We encourage you to experiment with getML FastProp, combine it with your favorite machine learning models, and see if you can surpass the existing baselines โ and maybe even our own scores!
-
This is what we are looking for:
- Effective use of getML: Demonstrate a good understanding of getML data models and how to tune them for optimal performance.
- Performance: Aim to outperform at least one of the existing RelBench baselines.
- Sound and reproducible code: Well-structured, modern, and commented code that others can easily understand and execute.
-
What you should bring:
- Basic Python skills.
- Experience with scikit-learn or similar data science libraries.
- Time to get to know getML FastProp by using the example notebooks and user guide.
Remember, everything is allowed: use all tools available, search engines, LLMs, or ask us on Discord.
Ready to take on the challenge? Choose a task and start building your solution!
-
Pick an Unsolved Task:
- Refer to the table in this repo's README or issues to find tasks without an open Draft Pull Request (PR).
- You may have only one active pull request (PR) at a time and can complete a maximum of two tasks in total.
-
Open a Draft PR:
- Review the Terms of Service & Licensing section and confirm your acceptance by including the specified details in your submission.
- Create a Draft Pull Request on the getML relbench GitHub repository.
- Title your branch as well as the PR:
[dataset-name]-[taskname]
(e.g.,rel-amazon-user-churn
). - This reserves the task for you. No one else can claim it while you are actively working on your open Draft PR.
-
Build a Predictor that Uses FastProp:
- Develop your solution using getML FastProp for feature engineering.
- Document your code, pipeline, and reasoning clearly in the notebook.
- We encourage you to use a regression or classification model of your choice on top of the getML-generated features. (see our examples)
- Any used libraries must be published under any OSI approved license or the Elastic License
-
Stay Active:
- Continuously push updates to your PR to show progress.
- If thereโs no commit, discussion, or meaningful interaction to advance the task within 5 days, we may close the Draft PR to make the task available to others.
- We aim to respond to questions, provide feedback, or suggest improvements promptly. Any delays on our side will not count towards inactivity.
- We reserve the right to determine if a PR is being actively worked on.
-
Aim to Beat Existing Scores:
- Compare your final metric against the existing RelBench baselines (RDL and Human).
- Strive to outperform at least one existing RelBench baseline (RDL or Human). We believe this is achievable for many of the challenges!
- If you surpass both, thatโs excellent. We will consider to merge your PR immediately.
-
Get Your PR Merged and Address Feedback:
- Submit a well-documented Jupyter Notebook that allows for easy reproduction of your results.
- Once you feel that your notebook is ready for merging, mark your PR as Ready for Review.
- Weโll review your PR and provide feedback where needed.
- Your PR will be merged into the main branch if it meets the criteria and successfully addresses our feedback.
-
After Merging the PR:
- We will contact you for the reward using the email address associated with your commit. Alternatively, you can securely provide us with a contact email address to facilitate communication.
- If you need assistance, feel free to reach out to us on Discord or send an email to our support team.
- Earn a โฌ100 gift for each successfully merged notebook that outperforms at least one baseline.
- With your consent, we will gladly associate your name and social media profiles with the results on our communication channels.
We offer two methods to deliver the โฌ100 gift: through GitHub Sponsors or via PayPal.
- We will contact you using the email address associated with your commit to gather the necessary details.
- If you are using GitHubโs privacy feature to keep your email private, or if you prefer us to contact you at a different email address, please follow these steps:
- To ensure the reward is delivered to the correct person without requiring you to publicly share your email address, use our encryption tool to securely encrypt your email.
- Include the encrypted email address in your PR. This method allows us to maintain your privacy while verifying the recipient.
We've prepared two example notebooks to help you get started:
hm-churn.ipynb
: A classification example using the H&M dataset, showcasing how to predict customer churnhm-item.ipynb
: A regression example using the H&M dataset, demonstrating how to forecast item sales using FastProp together with LightGBM and Optuna
These notebooks provide a practical introduction to the getML workflow, from data loading and preprocessing to pipeline construction and evaluation.
Furhter information can be found in the User Guide and the API Reference.
Note
New challenges will be added on the 27th of January at 16:00 CET.
Dataset | Task | PR's & Submissions | Task + Measure | Score getML | Score RDL | Score Human |
---|---|---|---|---|---|---|
rel-amazon | user-churn | classification (AUROC) | โ | 0.704 | 0.676 | |
item-churn | PR #12 | classification (AUROC) | โ | 0.828 | 0.818 | |
user-ltv | regression (MAE) | โ | 14.313 | 13.928 | ||
item-ltv | regression (MAE) | โ | 50.053 | 41.122 | ||
rel-f1 | driver-dnf | classification (AUROC) | โ | 0.726 | 0.698 | |
driver-top3 | classification (AUROC) | โ | 0.755 | 0.824 | ||
driver-position | regression (MAE) | โ | 4.022 | 3.963 | ||
rel-hm | user-churn | hm-churn.ipynb | classification (AUROC) | 0.707 | 0.699 | 0.690 |
item-sales | hm-item.ipynb | regression (MAE) | 0.031 | 0.056 | 0.036 | |
rel-trial | study-outcome | classification (AUROC) | โ | 0.686 | 0.720 | |
study-adverse | regression (MAE) | โ | 44.473 | 40.581 | ||
site-success | regression (MAE) | โ | 0.400 | 0.407 |
We're excited to see your innovative solutions and contributions to the getML community. Good luck, happy coding, and let's push the boundaries of automated feature engineering together!
You can count on our support: Join our Discord for technical guidance, feedback on your approaches, and to interact with the getML dev team. We're committed to helping you succeed in this challenge.
To ease participation in the challenge, we have already prepared a base environment for you. To start hacking, you just need to clone this repository or clone your forked version of the repository.
To get started on linux, just install uv and leverage our curated environment:
uv run jupyter lab
To get started on macOS and Windows, you first need to start the getML docker service:
curl https://raw.githubusercontent.com/getml/getml-community/1.5.0/runtime/docker-compose.yml | docker-compose up -f -
Afterwards, install uv and use the provided environment as above:
uv run jupyter lab
You are always free to use any other project manger of your choice but we recommend to stick to uv.
For larger tasks and datasets, the size of the data might exceed the memory capacity of your local machine. One solution is to use the memory mapping feature, but keep in mind that this might increase compute time.
We recommend leveraging virtual machines from major cloud providers to handle such workloads. Many providers offer free trials or starter budgets that should cover your needs:
For most datasets, we highly recommend developing on a machine with 64 to 128GB of memory and more than 16 CPU cores to keep getML pipeline runtimes around 30 to 60 minutes. While smaller setups are possible, they will likely require getML's memory mapping to be turned on, resulting in increased runtimes overall.
For complete details regarding the terms of service, please refer to the terms of service (TOS). This document outlines all the guidelines, rules, and conditions associated with participation, including data protection, eligibility, and usage of resources.
The submitted notebook needs to contain the below license information and acknowledgement of the terms of service (TOS) in a dedicated cell:
**License:**
All code in this notebook is licensed under the [MIT License](https://mit-license.org/).
Text and Images License: All non-code content (text, documentation, and images) is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
**Terms of serivce:**
I, [GitHub username], hereby acknowledge that I have read, understood, and agree to the [Terms of Service of the getML Community Data Science Challenge](https://github.com/getml/getml-relbench/blob/main/TOS.md).