Skip to content

Commit

Permalink
More lib doc changes especially links and ray runtime
Browse files Browse the repository at this point in the history
Signed-off-by: David Wood <[email protected]>
  • Loading branch information
daw3rd committed May 10, 2024
1 parent cf548af commit 54287c4
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 202 deletions.
10 changes: 10 additions & 0 deletions data-processing-lib/doc/python-runtime.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
## Python Runtime
The python runtime provides a simple mechanism to run a transform on a set of input data to produce
a set of output data, all within the python execution environment.

A `PythonTransformLauncher` class is provided that enables the running of the transform. For example,

```python
launcher = PythonTransformLauncher(YourTransformConfiguration())
launcher.launch()
```
70 changes: 28 additions & 42 deletions data-processing-lib/doc/ray-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ class extends transform's base TransformConfiguration implementation to add an o
* [TransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py) -
this provides the ability for the transform implementor to create additional Ray resources
and include them in the configuration used to create a transform
(see, for example, [exact dedup](../../transforms/universal/ededup/src/ededup_transform.py)
or [blocklist](../../transforms/universal/blocklisting/src/blocklist_transform.py)).
(see, for example, [exact dedup](../../transforms/universal/ededup/src/ededup_transform.py)).
This also provide the ability to supplement the statics collected by
[Statistics](../src/data_processing/runtime/ray/transform_statistics.py) (see below).

Expand All @@ -31,13 +30,13 @@ is used for building execution metadata.

![Processing Architecture](processing-architecture.jpg)

## Transform Launcher
The [TransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) uses the Transform Configuration
## Ray Transform Launcher
The [RayTransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) uses the Transform Configuration
and provides a single method, `launch()`, that kicks off the Ray environment and transform execution coordinated
by [orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py).
For example,
```python
launcher = TransformLauncher(MyTransformConfiguration())
launcher = RayTransformLauncher(YourTransformConfiguration())
launcher.launch()
```
Note that the launcher defines some additional CLI parameters that are used to control the operation of the
Expand All @@ -47,30 +46,40 @@ number of workers, worker resources, etc.
Discussion of these options is beyond the scope of this document
(see [Launcher Options](ray-launcher-options) for a list of available options.)

## Transform Configuration
The
[DefaultTableTransformConfiguration](../src/data_processing/runtime/ray/transform_runtime.py)
class is sub-classed and initialized with transform-specific name, and implementation
and runtime classes. In addition, it is responsible for providing transform-specific
methods to define and filter optional command line arguments.
Finally, it creates the Transform Runtime, for which a default
implementation uses the class available in the Transform Configuration.
## Ray Transform Configuration
In general, a transform should be able to run in both the python and Ray runtimes.
As such we first define the python-only transform configuration, which will then
be used by the Ray-runtime-specific transform configuration.
The python transform configuration implements
[TransformConfiguration](../src/data_processing/transform/transform_configuration.py)
and deifnes with transform-specific name, and implementation
and class. In addition, it is responsible for providing transform-specific
methods to define and capture optional command line arguments.
```python

class MyTransformConfiguration(DefaultTableTransformConfiguration):
class MyTransformConfiguration(TransformConfiguration):

def __init__(self):
super().__init__(name="MyTransform", transform_class=MyTransform,
runtime_class=MyTransformRuntime
super().__init__(name="MyTransform", transform_class=MyTransform)
self.params = {}

def add_input_params(self, parser: ArgumentParser) -> None:
...
def apply_input_params(self, args: Namespace) -> bool:
...
def create_transform_runtime(self) -> DefaultTableTransformRuntime:
...
```
Details are covered in the samples below.
Next we define the Ray-runtime specific transform configuration as an exension of
the RayTransformConfiguration and uses the `MyTransformConfiguration` above.
```python

class MyTransformConfiguration(RayTransformConfiguration):
def __init__(self):
super().__init__(MyTransformConfiguration(),
runtime_class=MyTransformRuntime
```
This class provides the ability to create the instance of `MyTransformRuntime` class
as neede by the Ray runtime.
Details are covered in the [advanced transform tutorial](advanced-transform-tutorial.md).

## Transform Runtime
The
Expand Down Expand Up @@ -131,26 +140,3 @@ the transform.
In both cases, the framework will log the exception as an error.


## Tutorial Examples
With these basic concepts in mind, we start with a simple example and
progress to more complex transforms.
Before getting started you may want to consult the
[transform project root readme](../../transforms/README.md) documentation.

* [Simplest transform](simplest-transform-tutorial.md) -
Here we will take a simple example to show the basics of creating a simple transform
that takes a single input Table, and produces a single Table.
* [External resources transform](transform-external-resources.md) - shows how to load additional
resources (models, configuration, etc) for a transform.
* [Advanced transform](advanced-transform-tutorial.md)

Once a transform has been built, testing can be enabled with the testing framework:

* [Transform Testing](transform-standalone-testing) - shows how to test a transform
independent of the Ray framework.
* [End-to-End Testing](testing-e2e-transform.md) - shows how to test the
transform running in the Ray environment.

### Additional transform support

We also started a library of [transform utilities](transformer-utilities.md)
162 changes: 3 additions & 159 deletions data-processing-lib/doc/transform-runtimes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,162 +4,6 @@ Runtimes provide a mechanism to run a transform over a set of input files to pro
output files. Each runtime is started using a _launcher_.
The available runtimes are as follows:

* [Python runtime](python-runtime.md)


### Ray Runtime
The Ray runtime includes the following set of components:

* [RayTransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) - this is a
class generally used to implement `main()` that makes use of a `TransformConfiguration` to
start the Ray runtime and execute the transform over the specified set of input files.
The RayTransformLauncher is created using a `RayTransformConfiguration` instance.
* [RayTransformConfiguration](../src/data_processing/runtime/ray/transform_configuration.py) - this
class extends transform's base TransformConfiguration implementation to add an optional
`TranformRuntime` (see next) class to be used by the transform implementation.
* [TransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py) -
this provides the ability for the transform implementor to create additional Ray resources
and include them in the configuration used to create a transform
(see, for example, [exact dedup](../../transforms/universal/ededup/src/ededup_transform.py)
or [blocklist](../../transforms/universal/blocklisting/src/blocklist_transform.py)).
This also provide the ability to supplement the statics collected by
[Statistics](../src/data_processing/runtime/ray/transform_statistics.py) (see below).

Roughly speaking the following steps are completed to establish transforms in the RayWorkers

1. Launcher parses the CLI parameters using an ArgumentParser configured with its own CLI parameters
along with those of the Transform Configuration,
2. Launcher passes the Transform Configuration and CLI parameters to the [RayOrchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py)
3. RayOrchestrator creates the Transform Runtime using the Transform Configuration and its CLI parameter values
4. Transform Runtime creates transform initialization/configuration including the CLI parameters,
and any Ray components need by the transform.
5. [RayWorker](../src/data_processing/runtime/ray/transform_table_processor.py) is started with configuration from the Transform Runtime.
6. RayWorker creates the Transform using the configuration provided by the Transform Runtime.
7. Statistics is used to collect the statistics submitted by the individual transform, that
is used for building execution metadata.

![Processing Architecture](processing-architecture.jpg)

#### Transform Launcher
The [TransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) uses the Transform Configuration
and provides a single method, `launch()`, that kicks off the Ray environment and transform execution coordinated
by [orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py).
For example,
```python
launcher = TransformLauncher(MyTransformConfiguration())
launcher.launch()
```
Note that the launcher defines some additional CLI parameters that are used to control the operation of the
[orchestrator and workers](../src/data_processing/runtime/ray/transform_orchestrator_configuration.py) and
[data access](../src/data_processing/data_access/data_access_factory.py). Things such as data access configuration,
number of workers, worker resources, etc.
Discussion of these options is beyond the scope of this document
(see [Launcher Options](ray-launcher-options) for a list of available options.)

#### Transform Configuration
The
[DefaultTableTransformConfiguration](../src/data_processing/runtime/ray/transform_runtime.py)
class is sub-classed and initialized with transform-specific name, and implementation
and runtime classes. In addition, it is responsible for providing transform-specific
methods to define and filter optional command line arguments.
Finally, it creates the Transform Runtime, for which a default
implementation uses the class available in the Transform Configuration.
```python

class MyTransformConfiguration(DefaultTableTransformConfiguration):

def __init__(self):
super().__init__(name="MyTransform", transform_class=MyTransform,
runtime_class=MyTransformRuntime
self.params = {}
def add_input_params(self, parser: ArgumentParser) -> None:
...
def apply_input_params(self, args: Namespace) -> bool:
...
def create_transform_runtime(self) -> DefaultTableTransformRuntime:
...
```
Details are covered in the samples below.

#### Transform Runtime
The
[DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py)
class is provided and will be
sufficient for many use cases, especially 1:1 table transformation.
However, some transforms will require use of the Ray environment, for example,
to create additional workers, establish a shared memory object, etc.
Of course, these transforms will generally not run outside of Ray environment.

```python
class DefaultTableTransformRuntime:

def __init__(self, params: dict[str, Any]):
...

def get_transform_config(
self, data_access_factory: DataAccessFactory, statistics: ActorHandle, files: list[str]
) -> dict[str, Any]:
...

def compute_execution_stats(self, stats: dict[str, Any]) -> dict[str, Any]:
...
```

The RayOrchestrator initializes the instance with the CLI parameters provided by the Transform Configurations
`get_input_params()` method.

The `get_transform_config()` method is used by the RayOrchestrator to create the parameters
used to initialize the Transform in the RayWorker.
This is where additional Ray components would be added to the environment
and references added to them, as needed, in the returned dictionary of configuration data
that will initialize the transform.
For those transforms that don't need this support, the default implementation
simpy returns the CLI parameters used to initialize the runtime instance.

The `computed_execution_stats()` provides an opportunity to augment the statistics
collected and aggregated by the TransformStatistics actor. It is called by the RayOrchestrator
after all files have been processed.

### Exceptions
A transform may find that it needs to signal error conditions.
For example, if a referenced model could not be loaded or
a given table does not have the expected column.
In general, it should identify such conditions by raising an exception.
With this in mind, there are two types of exceptions:

1. Those that would not allow any tables to be processed (e.g. model loading problem).
2. Those that would not allow a specific table to be processed (e.g. missing column).

In the first situation the transform should throw an exception from the initializer, which
will cause the Ray framework to terminate processing of all tables.
In the second situation (identified in the `transform()` or `flush()` methods), the transform
should throw an exception from the associated method. This will cause only the
error-causing
table to be ignored and not written out, but allow continued processing of tables by
the transform.
In both cases, the framework will log the exception as an error.


### Tutorial Examples
With these basic concepts in mind, we start with a simple example and
progress to more complex transforms.
Before getting started you may want to consult the
[transform project root readme](../../transforms/README.md) documentation.

* [Simplest transform](simplest-transform-tutorial.md) -
Here we will take a simple example to show the basics of creating a simple transform
that takes a single input Table, and produces a single Table.
* [External resources transform](transform-external-resources.md) - shows how to load additional
resources (models, configuration, etc) for a transform.
* [Advanced transform](advanced-transform-tutorial.md)

Once a transform has been built, testing can be enabled with the testing framework:

* [Transform Testing](transform-standalone-testing) - shows how to test a transform
independent of the Ray framework.
* [End-to-End Testing](testing-e2e-transform.md) - shows how to test the
transform running in the Ray environment.

### Additional transform support

We also strted a library of [transform utilities](transformer-utilities.md)
* [Python runtime](python-runtime.md) - provides single process running of a transform.
* [Ray runtime](ray-runtime.md) - provides running transforms across multiple ray workers to
achieve highly scalable processing.
2 changes: 1 addition & 1 deletion data-processing-lib/doc/transform-tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ when splitting tables by size or other criteria.
transform might wish to track the number of instances of PII entities detected and might return
this as `{ "entities" : 1234 }`.
* ```flush() -> tuple(list[pyarrow.Table], dict)``` - this is provided for transforms that
make use of buffering (e.g. [coalesce](../../transforms/universal/coalesce/src/coalesce_transform.py)) across calls
make use of buffering (e.g. to resize the tables) across calls
to `transform()` and need to be flushed of all buffered data at the end of processing of input tables.
The return values are handled the same waa as the return values for `transform()`. Since most transforms will likely
not need this feature, a default implementation is provided to return an empty list and empty dictionary.
Expand Down

0 comments on commit 54287c4

Please sign in to comment.