generated from ydataai/opensource-template
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: synthetic data, labs and pipelines general documentation (#72)
* docs: define Synthetic Data structure * docs: add new documentation for Pipelines, Synthetic Data and Labs. * docs: add new documentation regarding Labs * fix(linting): code formatting * docs: update labs with academy link * fix(linting): code formatting --------- Co-authored-by: Fabiana Clemente <[email protected]> Co-authored-by: Azory YData Bot <[email protected]>
- Loading branch information
1 parent
8bb3697
commit 8489a91
Showing
21 changed files
with
545 additions
and
4 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Fabric coding environment | ||
|
||
^^[**YData Fabric Labs**](https://ydata.ai/products/fabric)^^ are on-demand, cloud-based data development environments with automatically provisioned hardware (multiple infrastructure configurations, | ||
including GPUs, are possible) and **full platform integration** via a Python interface (allowing access to Data Sources, Synthesizers, | ||
and the Workspace’s shared files). | ||
|
||
Wit Labs, you can create environment with the support to familiar IDEs like [**Visual Studio Code**](https://code.visualstudio.com/), [**Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/)** | ||
and [**H20 Flow**](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/flow.html), with support for both Python and R are included. | ||
|
||
For Python specifically, pre-configured bundles including TensorFlow, PyTorch and/or the main popular data science libraries | ||
are also available, jumpstarting data development. Additional libraries can be easily installed leveraging a simple *!pip install* | ||
|
||
<p align="center"><img src="assets/labs/welcome_labs_creation.webp" alt="Welcome Labs" width="900"></p> | ||
|
||
## Get started with your first lab | ||
|
||
🧪 Follow this [step-by-step guided tutorial to create your first Lab](../get-started/create_lab.md). | ||
|
||
## Tutorials & recipes | ||
|
||
Leverage YData extensive collection of ^^[tutorials and recipes that you can find in YData Academy](https://github.com/ydataai/academy)^^. Quickstart or accelerate your data developments | ||
with recipes and tutorial use-cases. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# Overview | ||
|
||
Labs exist for Data practitioners to tackle more complex use cases through a familiar environment supercharged with infrastructure, | ||
integration with other Fabric modules and access to advanced synthesis and profiling technology via a familiar python interface. | ||
|
||
It is the preferred environment for Data practitioners to express their domain expertise with all the required tools, | ||
technology and computational power at their fingertips. It is thus the natural continuation of the data understanding works which | ||
started in Data Sources. | ||
|
||
## Supported IDE's and images | ||
|
||
### IDEs | ||
YData Fabric supports integration with various Integrated Development Environments (IDEs) to enhance productivity and streamline workflows. | ||
The supported IDEs include: | ||
|
||
- **Visual Studio Code (VS Code):** A highly versatile and widely-used code editor that offers robust support for numerous programming languages | ||
and frameworks. Its integration with Git and extensions like GitLens makes it ideal for version control and collaborative development. | ||
- **Jupyter Lab:** An interactive development environment that allows for notebook-based data science and machine learning workflows. | ||
It supports seamless Git integration through extensions and offers a user-friendly interface for managing code, data, and visualizations. | ||
- **H2O Flow:** A web-based interface specifically designed for machine learning and data analysis with the H2O platform. | ||
It provides a flow-based, interactive environment for building and deploying machine learning models. | ||
|
||
### Labs images | ||
In the Labs environment, users have access to the following default images, tailored to different computational needs: | ||
|
||
#### Python | ||
All the below images support Python as the programming language. Current Python version is x | ||
|
||
- **YData CPU:** Optimized for general-purpose computing and data analysis tasks that do not require GPU acceleration. This image includes access | ||
to YData Fabric unique capabilities for data processing (profiling, constraints engine, synthetic data generation, etc). | ||
- **YData GPU:** Designed for tasks that benefit from GPU acceleration, providing enhanced performance for large-scale data processing and machine learning | ||
operations. Also includes access to YData Fabric unique capabilities for data processing. | ||
- **YData GPU TensorFlow:** Specifically configured for TensorFlow-based machine learning and deep learning applications, leveraging GPU capabilities | ||
to accelerate training and inference processes. These images ensure that users have the necessary resources and configurations to efficiently | ||
conduct their data science and machine learning projects within the Labs environment. | ||
- **YData GPU Torch:** Specifically configured for Torch-based machine learning and deep learning applications, leveraging GPU capabilities | ||
to accelerate training and inference processes. These images ensure that users have the necessary resources and configurations to efficiently | ||
conduct their data science and machine learning projects within the Labs environment. | ||
|
||
#### R | ||
An ^^[image for R](https://www.r-project.org/about.html#:~:text=Introduction%20to%20R,by%20John%20Chambers%20and%20colleagues.)^^, that allows you | ||
to leverage the latest version of the language as well as the most user libraries. | ||
|
||
## Existing Labs | ||
|
||
Existing Labs appear in the *Labs* pane of the web application. Besides information about its settings and status, three buttons exist: | ||
|
||
- **Open:** Open the Lab’s IDE in a new browser tab | ||
- **Pause:** Pause the Lab. When resumed, all data will be available. | ||
- **Delete:** Lab will be deleted. Data not saved in the workspace’s shared folder (see below) will be deleted. | ||
|
||
![The details list of a Lab, with the status and its main actions.](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f6b25172-047e-47bd-8ab2-c9a0a45731ae/Untitled.png) | ||
|
||
The details list of a Lab, with the status and its main actions. | ||
|
||
The Status column indicates the Labs’ status. A Lab can have 4 statuses: | ||
|
||
- 🟢 Lab is running | ||
- 🟡 Lab is being created (hardware is being provisioned) or is either pausing or starting | ||
- 🔴 Lab was shutdown due to an error. A common error is the Lab going out-of-memory. Additional details are offered in the web application. | ||
- ⚫ Lab is paused | ||
|
||
## Git integration | ||
Integrating Git with Jupyter Notebooks and Visual Studio Code (VS Code) streamlines version control and collaborative workflows | ||
for data developers. This integration allows you to track changes, manage project versions, and collaborate effectively within familiar interfaces. | ||
|
||
### Jupyter Lab | ||
|
||
Inside of Labs that use Jupyter Lab as IDE, you will find the ^^[*jupyterlab-git*](https://github.com/jupyterlab/jupyterlab-git)^^ | ||
extension installed in the environment. | ||
|
||
To create or clone a new repository you need to perform the following steps: | ||
|
||
| Select Jupyter Lab Git extension | Cloning a repository to your local env | | ||
|----------------------------------------------------------------|------------------------------------------------------| | ||
| ![Jupyter Lab git](../assets/labs/jupyterlab_git_extension.webp) | ![Cloning](../assets/labs/cloning_jupyterlab.webp) | | ||
|
||
For more complex actions like forking and merging branches, see the gif below: | ||
![Jupyterlab-git extension in action](../assets/labs/jupyterlab-git.gif){: style="width:80%"} | ||
|
||
### Visual Code (VS Code) | ||
|
||
To clone or create a new git repository you can click in *"Clone Git Repository..."* and paste it in the text box in the top center area of screen | ||
as depicted in the image below. | ||
|
||
| Clone Git repository | Cloning a repository to your local env | | ||
|--------------------------------------------------------------------------------|--------------------------------------------------------------| | ||
| ![Vs code clone repo](../assets/labs/git_integration_vscode.webp) | ![Cloning vs code](../assets/labs/cloning_repo_vscode.webp) | | ||
|
||
## Building Pipelines | ||
Building data pipelines and breaking them down into modular components can be challenging. | ||
For instance, a typical machine learning or deep learning pipeline starts with a series of preprocessing steps, | ||
followed by experimentation and optimization, and finally deployment. | ||
Each of these stages presents unique challenges within the development lifecycle. | ||
|
||
Fabric Jupyter Labs simplifies this process by incorporating Elyra as the Pipeline Visual Editor. | ||
The visual editor enables users to build data pipelines from notebooks, Python scripts, and R scripts, making it easier to convert multiple notebooks | ||
or script files into batch jobs or workflows. | ||
|
||
Currently, these pipelines can be executed either locally in JupyterLab or on Kubeflow Pipelines, offering flexibility and scalability | ||
for various project needs. ^^[Read more about pipelines.](../pipelines/index.md)^^ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Concepts | ||
|
||
An example pipeline (as seen in the Pipelines module of the dashboard), where each single-responsibility block corresponds to a step in a typical machine learning workflow | ||
|
||
Each Pipeline is a set of connected blocks. A block is a self-contained set of code, packaged as a container, that performs one step in the Pipeline. Usually, each Pipeline block corresponds to a single responsibility task in a workflow. In a machine learning workflow, each step would correspond to one block, i.e, data ingestion, data cleaning, pre-processing, ML model training, ML model evaluation. | ||
|
||
Each block is parametrized by: | ||
|
||
- **code:** it executes (for instance, a Jupyter Notebook, a Python file, an R script) | ||
- **runtime:** which specifies the container environment it runs in, allowing modularization and inter-step independence of software requirements (for instance, specific Python versions for different blocks) | ||
- **hardware requirements:** depending on the workload, a block may have different needs regarding CPU/GPU/RAM. These requirements are automatically matched with the hardware availability of the cluster the Platform’s running in. This, combined with the modularity of each block, allows cost and efficiency optimizations by up/downscaling hardware according to the workload. | ||
- **file dependencies:** local files that need to be copied to the container environment | ||
- **environment variables**, useful, for instance to apply specific settings or inject authentication credentials | ||
- **output files**: files generated during the block’s workload, which will be made available to all subsequent Pipeline steps | ||
|
||
The hierarchy of a Pipeline, in an ascending manner, is as follows: | ||
|
||
- **Run:** A single execution of a Pipeline. Usually, Pipelines are run due to changes on the code, | ||
on the data sources or on its parameters (as Pipelines can have runtime parameters) | ||
- **Experiment:** Groups of runs of the same Pipeline (may have different parameters, code or settings, which are | ||
then easily comparable). All runs must have an Experiment. An Experiment can contain Runs from different Pipelines. | ||
- **Pipeline Version:** Pipeline definitions can be versioned (for instance, early iterations on the flow of operations; | ||
different versions for staging and production environments) | ||
- **Pipeline** | ||
|
||
📖 ^^[Get started with the concepts and a step-by-step tutorial](../get-started/create_pipeline.md)^^ | ||
|
||
## Runs & Recurring Runs | ||
A *run* is a single execution of a pipeline. Runs comprise an immutable log of all experiments that you attempt, | ||
and are designed to be self-contained to allow for reproducibility. You can track the progress of a run by looking | ||
at its details page on the pipeline's UI, where you can see the runtime graph, output artifacts, and logs for each step | ||
in the run. | ||
|
||
A *recurring run*, or job in the backend APIs, is a repeatable run of a pipeline. | ||
The configuration for a recurring run includes a copy of a pipeline with all parameter values specified | ||
and a run trigger. You can start a recurring run inside any experiment, and it will periodically start a new copy | ||
of the run configuration. You can enable or disable the recurring run from the pipeline's UI. You can also specify | ||
the maximum number of concurrent runs to limit the number of runs launched in parallel. | ||
This can be helpful if the pipeline is expected to run for a long period and is triggered to run frequently. | ||
## Experiment | ||
An experiment is a workspace where you can try different configurations of your pipelines. You can use experiments to organize | ||
your runs into logical groups. Experiments can contain arbitrary runs, including recurring runs. | ||
## Pipeline & Pipeline Version | ||
A pipeline is a description of a workflow, which can include machine learning (ML) tasks, data preparation or even the | ||
generation of synthetic data. The pipeline outlines all the components involved in the workflow and illustrates how these | ||
components interrelate in the form of a graph. The pipeline configuration defines the inputs (parameters) required to run | ||
the pipeline and specifies the inputs and outputs of each component. | ||
|
||
When you run a pipeline, the system launches one or more Kubernetes Pods corresponding to the steps (components) | ||
in your workflow. The Pods start Docker containers, and the containers, in turn, start your programs. | ||
|
||
Pipelines can be easily versioned for reproducibility of results. | ||
## Artifacts | ||
For each block/step in a Run, **Artifacts** can be generated. | ||
Artifacts are raw output data which is automatically rendered in the Pipeline’s UI in a rich manner - as formatted tables, text, charts, bar graphs/scatter plots/line graphs, | ||
ROC curves, confusion matrices or inline HTML. | ||
|
||
Artifacts are useful to attach, to each step/block of a data improvement workflow, relevant visualizations, summary tables, data profiling reports or text analyses. | ||
They are logged by creating a JSON file with a simple, pre-specified format (according to the output artifact type). | ||
Additional types of artifacts are supported (like binary files - models, datasets), yet will not benefit from rich visualizations in the UI. | ||
|
||
!!! tip "Compare side-by-side" | ||
💡 **Artifacts** and **Metrics** can be compared side-by-side across runs, which makes them a powerful tool when doing iterative experimentation over | ||
data quality improvement pipelines. | ||
|
||
## Pipelines examples in YData Academy | ||
👉 ^^[Use cases on YData’s Academy](https://github.com/ydataai/academy/tree/master/4%20-%20Use%20Cases)^^ contain examples of full use-cases as well as Pipelines interface to log metrics and artifacts. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Pipelines | ||
|
||
The Pipelines module of [YData Fabric](https://ydata.ai/products/fabric) is a general-purpose job orchestrator with built-in scalability and modularity | ||
plus reporting and experiment tracking capabilities. | ||
With **automatic hardware provisioning**, **on-demand** or **scheduled execution**, **run fingerprinting** | ||
and a **UI interface for review and configuration**, Pipelines equip the Fabric with | ||
**operational capabilities for interfacing with up/downstream systems** | ||
(for instance to automate data ingestion, synthesis and transfer workflows) and with the ability to | ||
**experiment at scale** (crucial during the iterative development process required to discover the data | ||
improvement pipeline yielding the highest quality datasets). | ||
|
||
YData Fabric's Pipelines are based on ^^[Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/)^^ | ||
and can be created via an interactive interface in Labs with Jupyter Lab as the IDE **(recommended)** or | ||
via [Kubeflow Pipeline’s Python SDK](https://www.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/). | ||
|
||
With its full integration with Fabric's scalable architecture and the ability to leverage Fabric’s Python interface, | ||
Pipelines are the recommended tool to **scale up notebook work to experiment at scale** or | ||
**move from experimentation to production**. | ||
|
||
## Benefits | ||
Using Pipelines for data preparation offers several benefits, particularly in the context of data engineering, | ||
machine learning, and data science workflows. Here are some key advantages: | ||
|
||
- **Modularity:** they allow to break down data preparation into discrete, reusable steps. | ||
Each step can be independently developed, tested, and maintained, enhancing code modularity and readability. | ||
- **Automation:** they automate the data preparation process, reducing the need for manual intervention | ||
and ensuring that data is consistently processed. This leads to more efficient workflows and saves time. | ||
- **Scalability:** Fabric's distributed infrastructure combined with kubernetes based pipelines allows to handle | ||
large volumes of data efficiently, making them suitable for big data environments. | ||
- **Reproducibility:** By defining a series of steps that transform raw data into a ready-to-use format, | ||
pipelines ensure that the same transformations are applied every time. This reproducibility is crucial for | ||
maintaining data integrity and for validating results. | ||
Maintainability: | ||
- **Versioning:** support versioning of the data preparation steps. This versioning is crucial | ||
for tracking changes, auditing processes, and rolling back to previous versions if needed. | ||
- **Flexibility:** and above all they can be customized to fit specific requirements of different projects. | ||
They can be adapted to include various preprocessing techniques, feature engineering steps, | ||
and data validation processes. | ||
|
||
## Related Materials | ||
- 📖 ^^[How to create your first Pipeline](../get-started/create_pipeline.md)^^ | ||
- :fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=feNoXv34waM"><u>How to build a pipeline with YData Fabric</u></a> |
Oops, something went wrong.