From 0dcfac777f632071df65687a2e7504bb50d21053 Mon Sep 17 00:00:00 2001 From: Danny Meijer <10511979+dannymeijer@users.noreply.github.com> Date: Thu, 30 May 2024 19:10:56 +0100 Subject: [PATCH] Deployed e6bc5b5 to dev with MkDocs 1.6.0 and mike 2.1.1 --- dev/index.html | 2 +- dev/search/search_index.json | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/index.html b/dev/index.html index 80d5d06..fdc5c93 100644 --- a/dev/index.html +++ b/dev/index.html @@ -3744,7 +3744,7 @@

Koheesio Core Components
┌─────────┐        ┌──────────────────┐        ┌──────────┐
 │ Input 1 │───────▶│                  ├───────▶│ Output 1 │
-└─────────┘        │                  │        └────√─────┘
+└─────────┘        │                  │        └──────────┘
                    │                  │
 ┌─────────┐        │                  │        ┌──────────┐
 │ Input 2 │───────▶│       Step       │───────▶│ Output 2 │
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index c4cd013..007d764 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":""},{"location":"index.html#koheesio","title":"Koheesio","text":"CI/CD Package Meta 

Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

The framework is versatile, aiming to support multiple implementations and working seamlessly with various data processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.

Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type safety and structured configurations within pipeline components.

Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.

"},{"location":"index.html#what-sets-koheesio-apart-from-other-libraries","title":"What sets Koheesio apart from other libraries?\"","text":"

Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.

Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus on utility in a multitude of scenarios. Our preference is for integration, not competition...

We invite contributions from all, promoting collaboration and innovation in the data engineering community.

"},{"location":"index.html#koheesio-core-components","title":"Koheesio Core Components","text":"

Here are the key components included in Koheesio:

  • Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.
    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u221a\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
  • Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.
  • Logger: This is a class for logging messages at different levels.
"},{"location":"index.html#installation","title":"Installation","text":"

You can install Koheesio using either pip or poetry.

"},{"location":"index.html#using-pip","title":"Using Pip","text":"

To install Koheesio using pip, run the following command in your terminal:

pip install koheesio\n
"},{"location":"index.html#using-hatch","title":"Using Hatch","text":"

If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your pyproject.toml.

"},{"location":"index.html#using-poetry","title":"Using Poetry","text":"

If you're using poetry for package management, you can add Koheesio to your project with the following command:

poetry add koheesio\n

or add the following line to your pyproject.toml (under [tool.poetry.dependencies]), making sure to replace ... with the version you want to have installed:

koheesio = {version = \"...\"}\n
"},{"location":"index.html#extras","title":"Extras","text":"

Koheesio also provides some additional features that can be useful in certain scenarios. These include:

  • Spark Expectations: Available through the koheesio.steps.integration.spark.dq.spark_expectations module; installable through the se extra.

    • SE Provides Data Quality checks for Spark DataFrames.
    • For more information, refer to the Spark Expectations docs.
  • Box: Available through the koheesio.steps.integration.box module; installable through the box extra.

    • Box is a cloud content management and file sharing service for businesses.
  • SFTP: Available through the koheesio.steps.integration.spark.sftp module; installable through the sftp extra.

    • SFTP is a network protocol used for secure file transfer over a secure shell.

Note: Some of the steps require extra dependencies. See the Extras section for additional info. Extras can be added to Poetry by adding extras=['name_of_the_extra'] to the toml entry mentioned above

"},{"location":"index.html#contributing","title":"Contributing","text":""},{"location":"index.html#how-to-contribute","title":"How to Contribute","text":"

We welcome contributions to our project! Here's a brief overview of our development process:

  • Code Standards: We use pylint, black, and mypy to maintain code standards. Please ensure your code passes these checks by running make check. No errors or warnings should be reported by the linter before you submit a pull request.

  • Testing: We use pytest for testing. Run the tests with make test and ensure all tests pass before submitting a pull request.

  • Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with admin rights will create a new release on GitHub and publish the new version to PyPI.

For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.

"},{"location":"index.html#additional-resources","title":"Additional Resources","text":"
  • General GitHub documentation
  • GitHub pull request documentation
  • Nike OSS
"},{"location":"api_reference/index.html","title":"API Reference","text":""},{"location":"api_reference/index.html#koheesio.ABOUT","title":"koheesio.ABOUT module-attribute","text":"
ABOUT = _about()\n
"},{"location":"api_reference/index.html#koheesio.VERSION","title":"koheesio.VERSION module-attribute","text":"
VERSION = __version__\n
"},{"location":"api_reference/index.html#koheesio.BaseModel","title":"koheesio.BaseModel","text":"

Base model for all models.

Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.

Additional methods and properties: Different Modes

This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.

  • Normal mode: you need to know the values ahead of time

    normal_mode = YourOwnModel(a=\"foo\", b=42)\n

  • Lazy mode: being able to defer the validation until later

    lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
    The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end.

  • With statements: With statements are also allowed. The validate_output method from the earlier example will run upon exit of the with-statement.

    with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n
    Note: that a lazy mode BaseModel object is required to work with a with-statement.

Examples:

from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n

In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output method is then called to validate the instance.

Koheesio specific configuration:

Koheesio models are configured differently from Pydantic defaults. The following configuration is used:

  1. extra=\"allow\"

    This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.

  2. arbitrary_types_allowed=True

    This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.

  3. populate_by_name=True

    This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.

  4. validate_assignment=False

    This setting determines whether the model should be revalidated when the data is changed. If set to True, every time a field is assigned a new value, the entire model is validated again.

    Pydantic default is (also) False, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.

  5. revalidate_instances=\"subclass-instances\"

    This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never, which means that the model and dataclass instances are not revalidated during validation.

  6. validate_default=True

    This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.

  7. frozen=False

    This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.

  8. coerce_numbers_to_str=True

    This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number type to str. Pydantic doesn't allow number types (int, float, Decimal) to be coerced as type str by default.

  9. use_enum_values=True

    This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.

"},{"location":"api_reference/index.html#koheesio.BaseModel--fields","title":"Fields","text":"

Every Koheesio BaseModel has two fields: name and description. These fields are used to provide a name and a description to the model.

  • name: This is the name of the Model. If not provided, it defaults to the class name.

  • description: This is the description of the Model. It has several default behaviors:

    • If not provided, it defaults to the docstring of the class.
    • If the docstring is not provided, it defaults to the name of the class.
    • For multi-line descriptions, it has the following behaviors:
      • Only the first non-empty line is used.
      • Empty lines are removed.
      • Only the first 3 lines are considered.
      • Only the first 120 characters are considered.
"},{"location":"api_reference/index.html#koheesio.BaseModel--validators","title":"Validators","text":"
  • _set_name_and_description: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/index.html#koheesio.BaseModel--properties","title":"Properties","text":"
  • log: Returns a logger with the name of the class.
"},{"location":"api_reference/index.html#koheesio.BaseModel--class-methods","title":"Class Methods","text":"
  • from_basemodel: Returns a new BaseModel instance based on the data of another BaseModel.
  • from_context: Creates BaseModel instance from a given Context.
  • from_dict: Creates BaseModel instance from a given dictionary.
  • from_json: Creates BaseModel instance from a given JSON string.
  • from_toml: Creates BaseModel object from a given toml file.
  • from_yaml: Creates BaseModel object from a given yaml file.
  • lazy: Constructs the model without doing validation.
"},{"location":"api_reference/index.html#koheesio.BaseModel--dunder-methods","title":"Dunder Methods","text":"
  • __add__: Allows to add two BaseModel instances together.
  • __enter__: Allows for using the model in a with-statement.
  • __exit__: Allows for using the model in a with-statement.
  • __setitem__: Set Item dunder method for BaseModel.
  • __getitem__: Get Item dunder method for BaseModel.
"},{"location":"api_reference/index.html#koheesio.BaseModel--instance-methods","title":"Instance Methods","text":"
  • hasattr: Check if given key is present in the model.
  • get: Get an attribute of the model, but don't fail if not present.
  • merge: Merge key,value map with self.
  • set: Allows for subscribing / assigning to class[key].
  • to_context: Converts the BaseModel instance to a Context object.
  • to_dict: Converts the BaseModel instance to a dictionary.
  • to_json: Converts the BaseModel instance to a JSON string.
  • to_yaml: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/index.html#koheesio.BaseModel.description","title":"description class-attribute instance-attribute","text":"
description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.log","title":"log property","text":"
log: Logger\n

Returns a logger with the name of the class

"},{"location":"api_reference/index.html#koheesio.BaseModel.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.name","title":"name class-attribute instance-attribute","text":"
name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_basemodel","title":"from_basemodel classmethod","text":"
from_basemodel(basemodel: BaseModel, **kwargs)\n

Returns a new BaseModel instance based on the data of another BaseModel

Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_context","title":"from_context classmethod","text":"
from_context(context: Context) -> BaseModel\n

Creates BaseModel instance from a given Context

You have to make sure that the Context object has the necessary attributes to create the model.

Examples:

class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n

Parameters:

Name Type Description Default context Context required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_dict","title":"from_dict classmethod","text":"
from_dict(data: Dict[str, Any]) -> BaseModel\n

Creates BaseModel instance from a given dictionary

Parameters:

Name Type Description Default data Dict[str, Any] required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel instance from a given JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.from_json : Deserializes a JSON string to a Context object

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel object from a given toml file

Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file, or string containing toml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> BaseModel\n

Creates BaseModel object from a given yaml file

Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.get","title":"get","text":"
get(key: str, default: Optional[Any] = None)\n

Get an attribute of the model, but don't fail if not present

Similar to dict.get()

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n

Parameters:

Name Type Description Default key str

name of the key to get

required default Optional[Any]

Default value in case the attribute does not exist

None

Returns:

Type Description Any

The value of the attribute

Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.hasattr","title":"hasattr","text":"
hasattr(key: str) -> bool\n

Check if given key is present in the model

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.lazy","title":"lazy classmethod","text":"
lazy()\n

Constructs the model without doing validation

Essentially an alias to BaseModel.construct()

Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.merge","title":"merge","text":"
merge(other: Union[Dict, BaseModel])\n

Merge key,value map with self

Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}.

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n

Parameters:

Name Type Description Default other Union[Dict, BaseModel]

Dict or another instance of a BaseModel class that will be added to self

required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.set","title":"set","text":"
set(key: str, value: Any)\n

Allows for subscribing / assigning to class[key].

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n

Parameters:

Name Type Description Default key str

The key of the attribute to assign to

required value Any

Value that should be assigned to the given key

required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_context","title":"to_context","text":"
to_context() -> Context\n

Converts the BaseModel instance to a Context object

Returns:

Type Description Context Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Converts the BaseModel instance to a dictionary

Returns:

Type Description Dict[str, Any] Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_json","title":"to_json","text":"
to_json(pretty: bool = False)\n

Converts the BaseModel instance to a JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.to_json : Serializes a Context object to a JSON string

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Converts the BaseModel instance to a YAML string

BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.validate","title":"validate","text":"
validate() -> BaseModel\n

Validate the BaseModel instance

This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.

This method is intended to be used with the lazy method. The lazy method is used to create an instance of the BaseModel without immediate validation. The validate method is then used to validate the instance after.

Note: in the Pydantic BaseModel, the validate method throws a deprecated warning. This is because Pydantic recommends using the validate_model method instead. However, we are using the validate method here in a different context and a slightly different way.

Examples:

class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate method is then called to validate the instance.

Returns:

Type Description BaseModel

The BaseModel instance

Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/index.html#koheesio.Context","title":"koheesio.Context","text":"
Context(*args, **kwargs)\n

The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.

Key Features
  • Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
  • Recursive merging: Merges two Contexts together, with the incoming Context having priority.
  • Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
  • Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.

For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.

Methods:

Name Description add

Add a key/value pair to the context.

get

Get value of a given key.

get_item

Acts just like .get, except that it returns the key also.

contains

Check if the context contains a given key.

merge

Merge this context with the context of another, where the incoming context has priority.

to_dict

Returns all parameters of the context as a dict.

from_dict

Creates Context object from the given dict.

from_yaml

Creates Context object from a given yaml file.

from_json

Creates Context object from a given json file.

Dunder methods
  • __iter__(): Allows for iteration across a Context.
  • __len__(): Returns the length of the Context.
  • __getitem__(item): Makes class subscriptable.
Inherited from Mapping
  • items(): Returns all items of the Context.
  • keys(): Returns all keys of the Context.
  • values(): Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/index.html#koheesio.Context.add","title":"add","text":"
add(key: str, value: Any) -> Context\n

Add a key/value pair to the context

Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n
"},{"location":"api_reference/index.html#koheesio.Context.contains","title":"contains","text":"
contains(key: str) -> bool\n

Check if the context contains a given key

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n
"},{"location":"api_reference/index.html#koheesio.Context.from_dict","title":"from_dict classmethod","text":"
from_dict(kwargs: dict) -> Context\n

Creates Context object from the given dict

Parameters:

Name Type Description Default kwargs dict required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given json file

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Security

(from https://jsonpickle.github.io/)

jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    > The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json--warning","title":"! Warning !","text":"

The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.

"},{"location":"api_reference/index.html#koheesio.Context.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given toml file

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file or string containing toml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> Context\n

Creates Context object from a given yaml file

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.get","title":"get","text":"
get(key: str, default: Any = None, safe: bool = True) -> Any\n

Get value of a given key

The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get() method otherwise.

Parameters:

Name Type Description Default key str

Can be a real key, or can be a dotted notation of a nested key

required default Any

Default value to return

None safe bool

Toggles whether to fail or not when item cannot be found

True

Returns:

Type Description Any

Value of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n

Returns c

Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n
"},{"location":"api_reference/index.html#koheesio.Context.get_all","title":"get_all","text":"
get_all() -> dict\n

alias to to_dict()

Source code in src/koheesio/context.py
def get_all(self) -> dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n
"},{"location":"api_reference/index.html#koheesio.Context.get_item","title":"get_item","text":"
get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n

Acts just like .get, except that it returns the key also

Returns:

Type Description Dict[str, Any]

key/value-pair of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n

Returns {'a.b': 'c'}

Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n
"},{"location":"api_reference/index.html#koheesio.Context.merge","title":"merge","text":"
merge(context: Context, recursive: bool = False) -> Context\n

Merge this context with the context of another, where the incoming context has priority.

Parameters:

Name Type Description Default context Context

Another Context class

required recursive bool

Recursively merge two dictionaries to an arbitrary depth

False

Returns:

Type Description Context

updated context

Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/index.html#koheesio.Context.process_value","title":"process_value","text":"
process_value(value: Any) -> Any\n

Processes the given value, converting dictionaries to Context objects as needed.

Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n
"},{"location":"api_reference/index.html#koheesio.Context.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Returns all parameters of the context as a dict

Returns:

Type Description dict

containing all parameters of the context

Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n
"},{"location":"api_reference/index.html#koheesio.Context.to_json","title":"to_json","text":"
to_json(pretty: bool = False) -> str\n

Returns all parameters of the context as a json string

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/index.html#koheesio.Context.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Returns all parameters of the context as a yaml string

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin","title":"koheesio.ExtraParamsMixin","text":"

Mixin class that adds support for arbitrary keyword arguments to Pydantic models.

The keyword arguments are extracted from the model's values and moved to a params dictionary.

"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.extra_params","title":"extra_params cached property","text":"
extra_params: Dict[str, Any]\n

Extract params (passed as arbitrary kwargs) from values and move them to params dict

"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.params","title":"params class-attribute instance-attribute","text":"
params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory","title":"koheesio.LoggingFactory","text":"
LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n

Logging factory to be used to generate logger instances.

Parameters:

Name Type Description Default name Optional[str] None env Optional[str] None logger_id Optional[str] None Source code in src/koheesio/logger.py
def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute instance-attribute","text":"
CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.ENV","title":"ENV class-attribute instance-attribute","text":"
ENV: Optional[str] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER","title":"LOGGER class-attribute instance-attribute","text":"
LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute instance-attribute","text":"
LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute instance-attribute","text":"
LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute instance-attribute","text":"
LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute instance-attribute","text":"
LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute instance-attribute","text":"
LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute instance-attribute","text":"
LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.add_handlers","title":"add_handlers staticmethod","text":"
add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n

Add handlers to existing root logger.

Parameters:

Name Type Description Default handler_class required handlers_config required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.get_logger","title":"get_logger staticmethod","text":"
get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n

Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.

Parameters:

Name Type Description Default name str required inherit_from_koheesio bool False

Returns:

Name Type Description logger Logger Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n
"},{"location":"api_reference/index.html#koheesio.Step","title":"koheesio.Step","text":"

Base class for a step

A custom unit of logic that can be executed.

The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self) method, specifying the expected inputs and outputs.

Note: since the Step class is meta classed, the execute method is wrapped with the do_execute function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.

Methods and Attributes

The Step class has several attributes and methods.

Background

A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.

A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!

The diagram serves to illustrate the concept of a Step:

\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n

Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.

  • Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
  • Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the execute method of the Step class with the _execute_wrapper function. This ensures that the execute method always returns the output of the Step along with providing logging and validation of the output.
  • Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute.
  • The Output class can be extended to add additional fields to the output of the Step.

Examples:

class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -> MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/index.html#koheesio.Step--input","title":"INPUT","text":"

The following fields are available by default on the Step class: - name: Name of the Step. If not set, the name of the class will be used. - description: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.

When subclassing a Step, any additional pydantic field will be treated as input to the Step. See also the explanation on the .execute() method below.

"},{"location":"api_reference/index.html#koheesio.Step--output","title":"OUTPUT","text":"

Every Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute. The Output class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute().

  • Output: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class.
  • output: Allows you to interact with the Output of the Step lazily (see above and StepOutput)

When subclassing a Step, any additional pydantic field added to the nested Output class will be treated as output of the Step. See also the description of StepOutput for more information.

"},{"location":"api_reference/index.html#koheesio.Step--methods","title":"Methods:","text":"
  • execute: Abstract method to implement for new steps.
    • The Inputs of the step can be accessed, using self.input_name.
    • The output of the step can be accessed, using self.output.output_name.
  • run: Alias to .execute() method. You can use this to run the step, but execute is preferred.
  • to_yaml: YAML dump the step
  • get_description: Get the description of the Step

When subclassing a Step, execute is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.

Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute function making it always return a StepOutput. See also the explanation on the do_execute function.

"},{"location":"api_reference/index.html#koheesio.Step--class-methods","title":"class methods:","text":"
  • from_step: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
  • get_description: Get the description of the Step
"},{"location":"api_reference/index.html#koheesio.Step--dunder-methods","title":"dunder methods:","text":"
  • __getattr__: Allows input to be accessed through self.input_name
  • __repr__ and __str__: String representation of a step
"},{"location":"api_reference/index.html#koheesio.Step.output","title":"output property writable","text":"
output: Output\n

Interact with the output of the Step

"},{"location":"api_reference/index.html#koheesio.Step.Output","title":"Output","text":"

Output class for Step

"},{"location":"api_reference/index.html#koheesio.Step.execute","title":"execute abstractmethod","text":"
execute()\n

Abstract method to implement for new steps.

The Inputs of the step can be accessed, using self.input_name

Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute function making it always return the Steps output

Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/index.html#koheesio.Step.from_step","title":"from_step classmethod","text":"
from_step(step: Step, **kwargs)\n

Returns a new Step instance based on the data of another Step or BaseModel instance

Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_json","title":"repr_json","text":"
repr_json(simple=False) -> str\n

dump the step to json, meant for representation

Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid json

Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -> Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_yaml","title":"repr_yaml","text":"
repr_yaml(simple=False) -> str\n

dump the step to yaml, meant for representation

Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n  a: foo\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid yaml

Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n
"},{"location":"api_reference/index.html#koheesio.Step.run","title":"run","text":"
run()\n

Alias to .execute()

Source code in src/koheesio/steps/__init__.py
def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/index.html#koheesio.StepOutput","title":"koheesio.StepOutput","text":"

Class for the StepOutput model

Usage

Setting up the StepOutputs class is done like this:

class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n

"},{"location":"api_reference/index.html#koheesio.StepOutput.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/index.html#koheesio.StepOutput.validate_output","title":"validate_output","text":"
validate_output() -> StepOutput\n

Validate the output of the Step

Essentially, this method is a wrapper around the validate method of the BaseModel class

Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/index.html#koheesio.print_logo","title":"koheesio.print_logo","text":"
print_logo()\n
Source code in src/koheesio/__init__.py
def print_logo():\n    global _logo_printed\n    global _koheesio_print_logo\n\n    if not _logo_printed and _koheesio_print_logo:\n        print(ABOUT)\n        _logo_printed = True\n
"},{"location":"api_reference/context.html","title":"Context","text":"

The Context module is a part of the Koheesio framework and is primarily used for managing the environment configuration where a Task or Step runs. It helps in adapting the behavior of a Task/Step based on the environment it operates in, thereby avoiding the repetition of configuration values across different tasks.

The Context class, which is a key component of this module, functions similarly to a dictionary but with additional features. It supports operations like handling nested keys, recursive merging of contexts, and serialization/deserialization to and from various formats like JSON, YAML, and TOML.

For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.

"},{"location":"api_reference/context.html#koheesio.context.Context","title":"koheesio.context.Context","text":"
Context(*args, **kwargs)\n

The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.

Key Features
  • Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
  • Recursive merging: Merges two Contexts together, with the incoming Context having priority.
  • Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
  • Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.

For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.

Methods:

Name Description add

Add a key/value pair to the context.

get

Get value of a given key.

get_item

Acts just like .get, except that it returns the key also.

contains

Check if the context contains a given key.

merge

Merge this context with the context of another, where the incoming context has priority.

to_dict

Returns all parameters of the context as a dict.

from_dict

Creates Context object from the given dict.

from_yaml

Creates Context object from a given yaml file.

from_json

Creates Context object from a given json file.

Dunder methods
  • __iter__(): Allows for iteration across a Context.
  • __len__(): Returns the length of the Context.
  • __getitem__(item): Makes class subscriptable.
Inherited from Mapping
  • items(): Returns all items of the Context.
  • keys(): Returns all keys of the Context.
  • values(): Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.add","title":"add","text":"
add(key: str, value: Any) -> Context\n

Add a key/value pair to the context

Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n
"},{"location":"api_reference/context.html#koheesio.context.Context.contains","title":"contains","text":"
contains(key: str) -> bool\n

Check if the context contains a given key

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_dict","title":"from_dict classmethod","text":"
from_dict(kwargs: dict) -> Context\n

Creates Context object from the given dict

Parameters:

Name Type Description Default kwargs dict required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given json file

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Security

(from https://jsonpickle.github.io/)

jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    > The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json--warning","title":"! Warning !","text":"

The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.

"},{"location":"api_reference/context.html#koheesio.context.Context.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given toml file

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file or string containing toml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> Context\n

Creates Context object from a given yaml file

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get","title":"get","text":"
get(key: str, default: Any = None, safe: bool = True) -> Any\n

Get value of a given key

The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get() method otherwise.

Parameters:

Name Type Description Default key str

Can be a real key, or can be a dotted notation of a nested key

required default Any

Default value to return

None safe bool

Toggles whether to fail or not when item cannot be found

True

Returns:

Type Description Any

Value of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n

Returns c

Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_all","title":"get_all","text":"
get_all() -> dict\n

alias to to_dict()

Source code in src/koheesio/context.py
def get_all(self) -> dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_item","title":"get_item","text":"
get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n

Acts just like .get, except that it returns the key also

Returns:

Type Description Dict[str, Any]

key/value-pair of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n

Returns {'a.b': 'c'}

Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n
"},{"location":"api_reference/context.html#koheesio.context.Context.merge","title":"merge","text":"
merge(context: Context, recursive: bool = False) -> Context\n

Merge this context with the context of another, where the incoming context has priority.

Parameters:

Name Type Description Default context Context

Another Context class

required recursive bool

Recursively merge two dictionaries to an arbitrary depth

False

Returns:

Type Description Context

updated context

Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/context.html#koheesio.context.Context.process_value","title":"process_value","text":"
process_value(value: Any) -> Any\n

Processes the given value, converting dictionaries to Context objects as needed.

Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Returns all parameters of the context as a dict

Returns:

Type Description dict

containing all parameters of the context

Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_json","title":"to_json","text":"
to_json(pretty: bool = False) -> str\n

Returns all parameters of the context as a json string

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Returns all parameters of the context as a yaml string

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n
"},{"location":"api_reference/intro_api.html","title":"Intro api","text":""},{"location":"api_reference/intro_api.html#api-reference","title":"API Reference","text":"

You can navigate the API by clicking on the modules listed on the left to access the documentation.

"},{"location":"api_reference/logger.html","title":"Logger","text":"

Loggers are used to log messages from your application.

For a comprehensive guide on the usage, examples, and additional features of the logging classes, please refer to the reference/concepts/logging section of the Koheesio documentation.

Classes:

Name Description LoggingFactory

Logging factory to be used to generate logger instances.

Masked

Represents a masked value.

MaskedString

Represents a masked string value.

MaskedInt

Represents a masked integer value.

MaskedFloat

Represents a masked float value.

MaskedDict

Represents a masked dictionary value.

LoggerIDFilter

Filter which injects run_id information into the log.

Functions:

Name Description warn

Issue a warning.

"},{"location":"api_reference/logger.html#koheesio.logger.T","title":"koheesio.logger.T module-attribute","text":"
T = TypeVar('T')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter","title":"koheesio.logger.LoggerIDFilter","text":"

Filter which injects run_id information into the log.

"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.LOGGER_ID","title":"LOGGER_ID class-attribute instance-attribute","text":"
LOGGER_ID: str = str(uuid4())\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.filter","title":"filter","text":"
filter(record)\n
Source code in src/koheesio/logger.py
def filter(self, record):\n    record.logger_id = LoggerIDFilter.LOGGER_ID\n\n    return True\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory","title":"koheesio.logger.LoggingFactory","text":"
LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n

Logging factory to be used to generate logger instances.

Parameters:

Name Type Description Default name Optional[str] None env Optional[str] None logger_id Optional[str] None Source code in src/koheesio/logger.py
def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute instance-attribute","text":"
CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.ENV","title":"ENV class-attribute instance-attribute","text":"
ENV: Optional[str] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER","title":"LOGGER class-attribute instance-attribute","text":"
LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute instance-attribute","text":"
LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute instance-attribute","text":"
LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute instance-attribute","text":"
LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute instance-attribute","text":"
LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute instance-attribute","text":"
LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute instance-attribute","text":"
LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.add_handlers","title":"add_handlers staticmethod","text":"
add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n

Add handlers to existing root logger.

Parameters:

Name Type Description Default handler_class required handlers_config required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.get_logger","title":"get_logger staticmethod","text":"
get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n

Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.

Parameters:

Name Type Description Default name str required inherit_from_koheesio bool False

Returns:

Name Type Description logger Logger Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked","title":"koheesio.logger.Masked","text":"
Masked(value: T)\n

Represents a masked value.

Parameters:

Name Type Description Default value T

The value to be masked.

required

Attributes:

Name Type Description _value T

The original value.

Methods:

Name Description __repr__

Returns a string representation of the masked value.

__str__

Returns a string representation of the masked value.

__get_validators__

Returns a generator of validators for the masked value.

validate

Validates the masked value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked.validate","title":"validate classmethod","text":"
validate(v: Any, _values)\n

Validate the input value and return an instance of the class.

Parameters:

Name Type Description Default v Any

The input value to validate.

required _values Any

Additional values used for validation.

required

Returns:

Name Type Description instance cls

An instance of the class.

Source code in src/koheesio/logger.py
@classmethod\ndef validate(cls, v: Any, _values):\n    \"\"\"\n    Validate the input value and return an instance of the class.\n\n    Parameters\n    ----------\n    v : Any\n        The input value to validate.\n    _values : Any\n        Additional values used for validation.\n\n    Returns\n    -------\n    instance : cls\n        An instance of the class.\n\n    \"\"\"\n    return cls(v)\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedDict","title":"koheesio.logger.MaskedDict","text":"
MaskedDict(value: T)\n

Represents a masked dictionary value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedFloat","title":"koheesio.logger.MaskedFloat","text":"
MaskedFloat(value: T)\n

Represents a masked float value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedInt","title":"koheesio.logger.MaskedInt","text":"
MaskedInt(value: T)\n

Represents a masked integer value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedString","title":"koheesio.logger.MaskedString","text":"
MaskedString(value: T)\n

Represents a masked string value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/utils.html","title":"Utils","text":"

Utility functions

"},{"location":"api_reference/utils.html#koheesio.utils.convert_str_to_bool","title":"koheesio.utils.convert_str_to_bool","text":"
convert_str_to_bool(value) -> Any\n

Converts a string to a boolean if the string is either 'true' or 'false'

Source code in src/koheesio/utils.py
def convert_str_to_bool(value) -> Any:\n    \"\"\"Converts a string to a boolean if the string is either 'true' or 'false'\"\"\"\n    if isinstance(value, str) and (v := value.lower()) in [\"true\", \"false\"]:\n        value = v == \"true\"\n    return value\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_args_for_func","title":"koheesio.utils.get_args_for_func","text":"
get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]\n

Helper function that matches keyword arguments (params) on a given function

This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to construct a new Callable (partial) function on which the input was mapped.

Example
input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\ndef example_func(a: str):\n    return a\n\n\nfunc, kwargs = get_args_for_func(example_func, input_dict)\n

In this example, - func would be a callable with the input mapped toward it (i.e. can be called like any normal function) - kwargs would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})

Parameters:

Name Type Description Default func Callable

The function to inspect

required params Dict

Dictionary with keyword values that will be mapped on the 'func'

required

Returns:

Type Description Tuple[Callable, Dict[str, Any]]
  • Callable a partial() func with the found keyword values mapped toward it
  • Dict[str, Any] the keyword args that match the func
Source code in src/koheesio/utils.py
def get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]:\n    \"\"\"Helper function that matches keyword arguments (params) on a given function\n\n    This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to\n     construct a new Callable (partial) function on which the input was mapped.\n\n    Example\n    -------\n    ```python\n    input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\n    def example_func(a: str):\n        return a\n\n\n    func, kwargs = get_args_for_func(example_func, input_dict)\n    ```\n\n    In this example,\n    - `func` would be a callable with the input mapped toward it (i.e. can be called like any normal function)\n    - `kwargs` would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})\n\n    Parameters\n    ----------\n    func: Callable\n        The function to inspect\n    params: Dict\n        Dictionary with keyword values that will be mapped on the 'func'\n\n    Returns\n    -------\n    Tuple[Callable, Dict[str, Any]]\n        - Callable\n            a partial() func with the found keyword values mapped toward it\n        - Dict[str, Any]\n            the keyword args that match the func\n    \"\"\"\n    _kwargs = {k: v for k, v in params.items() if k in inspect.getfullargspec(func).args}\n    return (\n        partial(func, **_kwargs),\n        _kwargs,\n    )\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_project_root","title":"koheesio.utils.get_project_root","text":"
get_project_root() -> Path\n

Returns project root path.

Source code in src/koheesio/utils.py
def get_project_root() -> Path:\n    \"\"\"Returns project root path.\"\"\"\n    cmd = Path(__file__)\n    return Path([i for i in cmd.parents if i.as_uri().endswith(\"src\")][0]).parent\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_random_string","title":"koheesio.utils.get_random_string","text":"
get_random_string(length: int = 64, prefix: Optional[str] = None) -> str\n

Generate a random string of specified length

Source code in src/koheesio/utils.py
def get_random_string(length: int = 64, prefix: Optional[str] = None) -> str:\n    \"\"\"Generate a random string of specified length\"\"\"\n    if prefix:\n        return f\"{prefix}_{uuid.uuid4().hex}\"[0:length]\n    return f\"{uuid.uuid4().hex}\"[0:length]\n
"},{"location":"api_reference/utils.html#koheesio.utils.import_class","title":"koheesio.utils.import_class","text":"
import_class(module_class: str) -> Any\n

Import class and module based on provided string.

Parameters:

Name Type Description Default module_class str required

Returns:

Type Description object Class from specified input string. Source code in src/koheesio/utils.py
def import_class(module_class: str) -> Any:\n    \"\"\"Import class and module based on provided string.\n\n    Parameters\n    ----------\n    module_class module+class to be imported.\n\n    Returns\n    -------\n    object  Class from specified input string.\n\n    \"\"\"\n    module_path, class_name = module_class.rsplit(\".\", 1)\n    module = import_module(module_path)\n\n    return getattr(module, class_name)\n
"},{"location":"api_reference/asyncio/index.html","title":"Asyncio","text":"

This module provides classes for asynchronous steps in the koheesio package.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep","title":"koheesio.asyncio.AsyncStep","text":"

Asynchronous step class that inherits from Step and uses the AsyncStepMetaClass metaclass.

Attributes:

Name Type Description Output AsyncStepOutput

The output class for the asynchronous step.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep.Output","title":"Output","text":"

Output class for asyncio step.

This class represents the output of the asyncio step. It inherits from the AsyncStepOutput class.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepMetaClass","title":"koheesio.asyncio.AsyncStepMetaClass","text":"

Metaclass for asynchronous steps.

This metaclass is used to define asynchronous steps in the Koheesio framework. It inherits from the StepMetaClass and provides additional functionality for executing asynchronous steps.

Attributes: None

Methods: _execute_wrapper: Wrapper method for executing asynchronous steps.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput","title":"koheesio.asyncio.AsyncStepOutput","text":"

Represents the output of an asynchronous step.

This class extends the base Step.Output class and provides additional functionality for merging key-value maps.

Attributes:

Name Type Description ...

Methods:

Name Description merge

Merge key-value map with self.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput.merge","title":"merge","text":"
merge(other: Union[Dict, StepOutput])\n

Merge key,value map with self

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n

Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}.

Parameters:

Name Type Description Default other Union[Dict, StepOutput]

Dict or another instance of a StepOutputs class that will be added to self

required Source code in src/koheesio/asyncio/__init__.py
def merge(self, other: Union[Dict, StepOutput]):\n    \"\"\"Merge key,value map with self\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Parameters\n    ----------\n    other: Union[Dict, StepOutput]\n        Dict or another instance of a StepOutputs class that will be added to self\n    \"\"\"\n    if isinstance(other, StepOutput):\n        other = other.model_dump()  # ensures we really have a dict\n\n    if not iscoroutine(other):\n        for k, v in other.items():\n            self.set(k, v)\n\n    return self\n
"},{"location":"api_reference/asyncio/http.html","title":"Http","text":"

This module contains async implementation of HTTP step.

"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep","title":"koheesio.asyncio.http.AsyncHttpGetStep","text":"

Represents an asynchronous HTTP GET step.

This class inherits from the AsyncHttpStep class and specifies the HTTP method as GET.

Attributes: method (HttpMethod): The HTTP method for the step, set to HttpMethod.GET.

"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = GET\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep","title":"koheesio.asyncio.http.AsyncHttpStep","text":"

Asynchronous HTTP step for making HTTP requests using aiohttp.

Parameters:

Name Type Description Default client_session Optional[ClientSession]

Aiohttp ClientSession.

required url List[URL]

List of yarl.URL.

required retry_options Optional[RetryOptionsBase]

Retry options for the request.

required connector Optional[BaseConnector]

Connector for the aiohttp request.

required headers Optional[Dict[str, Union[str, SecretStr]]]

Request headers.

required Output

responses_urls : Optional[List[Tuple[Dict[str, Any], yarl.URL]]] List of responses from the API and request URL.

Examples:

>>> import asyncio\n>>> from aiohttp import ClientSession\n>>> from aiohttp.connector import TCPConnector\n>>> from aiohttp_retry import ExponentialRetry\n>>> from koheesio.steps.async.http import AsyncHttpStep\n>>> from yarl import URL\n>>> from typing import Dict, Any, Union, List, Tuple\n>>>\n>>> # Initialize the AsyncHttpStep\n>>> async def main():\n>>>     session = ClientSession()\n>>>     urls = [URL('https://example.com/api/1'), URL('https://example.com/api/2')]\n>>>     retry_options = ExponentialRetry()\n>>>     connector = TCPConnector(limit=10)\n>>>     headers = {'Content-Type': 'application/json'}\n>>>     step = AsyncHttpStep(\n>>>         client_session=session,\n>>>         url=urls,\n>>>         retry_options=retry_options,\n>>>         connector=connector,\n>>>         headers=headers\n>>>     )\n>>>\n>>>     # Execute the step\n>>>     responses_urls=  await step.get()\n>>>\n>>>     return responses_urls\n>>>\n>>> # Run the main function\n>>> responses_urls = asyncio.run(main())\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.client_session","title":"client_session class-attribute instance-attribute","text":"
client_session: Optional[ClientSession] = Field(default=None, description='Aiohttp ClientSession', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.connector","title":"connector class-attribute instance-attribute","text":"
connector: Optional[BaseConnector] = Field(default=None, description='Connector for the aiohttp request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.headers","title":"headers class-attribute instance-attribute","text":"
headers: Dict[str, Union[str, SecretStr]] = Field(default_factory=dict, description='Request headers', alias='header', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.method","title":"method class-attribute instance-attribute","text":"
method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.retry_options","title":"retry_options class-attribute instance-attribute","text":"
retry_options: Optional[RetryOptionsBase] = Field(default=None, description='Retry options for the request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.timeout","title":"timeout class-attribute instance-attribute","text":"
timeout: None = Field(default=None, description='[Optional] Request timeout')\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.url","title":"url class-attribute instance-attribute","text":"
url: List[URL] = Field(default=None, alias='urls', description='Expecting list, as there is no value in executing async request for one value.\\n        yarl.URL is preferable, because params/data can be injected into URL instance', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output","title":"Output","text":"

Output class for Step

"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output.responses_urls","title":"responses_urls class-attribute instance-attribute","text":"
responses_urls: Optional[List[Tuple[Dict[str, Any], URL]]] = Field(default=None, description='List of responses from the API and request URL', repr=False)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.delete","title":"delete async","text":"
delete() -> List[Tuple[Dict[str, Any], URL]]\n

Make DELETE requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def delete(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make DELETE requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.DELETE)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.execute","title":"execute","text":"
execute() -> Output\n

Execute the step.

Raises:

Type Description ValueError

If the specified HTTP method is not implemented in AsyncHttpStep.

Source code in src/koheesio/asyncio/http.py
def execute(self) -> AsyncHttpStep.Output:\n    \"\"\"\n    Execute the step.\n\n    Raises\n    ------\n    ValueError\n        If the specified HTTP method is not implemented in AsyncHttpStep.\n    \"\"\"\n    # By design asyncio does not allow its event loop to be nested. This presents a practical problem:\n    #   When in an environment where the event loop is already running\n    #   it\u2019s impossible to run tasks and wait for the result.\n    #   Trying to do so will give the error \u201cRuntimeError: This event loop is already running\u201d.\n    #   The issue pops up in various environments, such as web servers, GUI applications and in\n    #   Jupyter/DataBricks notebooks.\n    nest_asyncio.apply()\n\n    map_method_func = {\n        HttpMethod.GET: self.get,\n        HttpMethod.POST: self.post,\n        HttpMethod.PUT: self.put,\n        HttpMethod.DELETE: self.delete,\n    }\n\n    if self.method not in map_method_func:\n        raise ValueError(f\"Method {self.method} not implemented in AsyncHttpStep.\")\n\n    self.output.responses_urls = asyncio.run(map_method_func[self.method]())\n\n    return self.output\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get","title":"get async","text":"
get() -> List[Tuple[Dict[str, Any], URL]]\n

Make GET requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def get(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make GET requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.GET)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_headers","title":"get_headers","text":"
get_headers()\n

Get the request headers.

Returns:

Type Description Optional[Dict[str, Union[str, SecretStr]]]

The request headers.

Source code in src/koheesio/asyncio/http.py
def get_headers(self):\n    \"\"\"\n    Get the request headers.\n\n    Returns\n    -------\n    Optional[Dict[str, Union[str, SecretStr]]]\n        The request headers.\n    \"\"\"\n    _headers = None\n\n    if self.headers:\n        _headers = {k: v.get_secret_value() if isinstance(v, SecretStr) else v for k, v in self.headers.items()}\n\n        for k, v in self.headers.items():\n            if isinstance(v, SecretStr):\n                self.headers[k] = v.get_secret_value()\n\n    return _headers or self.headers\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_options","title":"get_options","text":"
get_options()\n

Get the options of the step.

Source code in src/koheesio/asyncio/http.py
def get_options(self):\n    \"\"\"\n    Get the options of the step.\n    \"\"\"\n    warnings.warn(\"get_options is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.post","title":"post async","text":"
post() -> List[Tuple[Dict[str, Any], URL]]\n

Make POST requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def post(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make POST requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.POST)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.put","title":"put async","text":"
put() -> List[Tuple[Dict[str, Any], URL]]\n

Make PUT requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def put(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make PUT requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.PUT)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.request","title":"request async","text":"
request(method: HttpMethod, url: URL, **kwargs) -> Tuple[Dict[str, Any], URL]\n

Make an HTTP request.

Parameters:

Name Type Description Default method HttpMethod

The HTTP method to use for the request.

required url URL

The URL to make the request to.

required kwargs Any

Additional keyword arguments to pass to the request.

{}

Returns:

Type Description Tuple[Dict[str, Any], URL]

A tuple containing the response data and the request URL.

Source code in src/koheesio/asyncio/http.py
async def request(\n    self,\n    method: HttpMethod,\n    url: yarl.URL,\n    **kwargs,\n) -> Tuple[Dict[str, Any], yarl.URL]:\n    \"\"\"\n    Make an HTTP request.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        The HTTP method to use for the request.\n    url : yarl.URL\n        The URL to make the request to.\n    kwargs : Any\n        Additional keyword arguments to pass to the request.\n\n    Returns\n    -------\n    Tuple[Dict[str, Any], yarl.URL]\n        A tuple containing the response data and the request URL.\n    \"\"\"\n    async with self.__retry_client.request(method=method, url=url, **kwargs) as response:\n        res = await response.json()\n\n    return (res, response.request_info.url)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.set_outputs","title":"set_outputs","text":"
set_outputs(response)\n

Set the outputs of the step.

Parameters:

Name Type Description Default response Any

The response data.

required Source code in src/koheesio/asyncio/http.py
def set_outputs(self, response):\n    \"\"\"\n    Set the outputs of the step.\n\n    Parameters\n    ----------\n    response : Any\n        The response data.\n    \"\"\"\n    warnings.warn(\"set outputs is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.validate_timeout","title":"validate_timeout","text":"
validate_timeout(timeout)\n

Validate the 'data' field.

Parameters:

Name Type Description Default data Any

The value of the 'timeout' field.

required

Raises:

Type Description ValueError

If 'data' is not allowed in AsyncHttpStep.

Source code in src/koheesio/asyncio/http.py
@field_validator(\"timeout\")\ndef validate_timeout(cls, timeout):\n    \"\"\"\n    Validate the 'data' field.\n\n    Parameters\n    ----------\n    data : Any\n        The value of the 'timeout' field.\n\n    Raises\n    ------\n    ValueError\n        If 'data' is not allowed in AsyncHttpStep.\n    \"\"\"\n    if timeout:\n        raise ValueError(\"timeout is not allowed in AsyncHttpStep. Provide timeout through retry_options.\")\n
"},{"location":"api_reference/integrations/index.html","title":"Integrations","text":"

Nothing to see here, move along.

"},{"location":"api_reference/integrations/box.html","title":"Box","text":"

Box Module

The module is used to facilitate various interactions with Box service. The implementation is based on the functionalities available in Box Python SDK: https://github.com/box/box-python-sdk

Prerequisites
  • Box Application is created in the developer portal using the JWT auth method (Developer Portal - My Apps - Create)
  • Application is authorized for the enterprise (Developer Portal - MyApp - Authorization)
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box","title":"koheesio.integrations.box.Box","text":"
Box(**data)\n

Configuration details required for the authentication can be obtained in the Box Developer Portal by generating the Public / Private key pair in \"Application Name -> Configuration -> Add and Manage Public Keys\".

The downloaded JSON file will look like this:

{\n  \"boxAppSettings\": {\n    \"clientID\": \"client_id\",\n    \"clientSecret\": \"client_secret\",\n    \"appAuth\": {\n      \"publicKeyID\": \"public_key_id\",\n      \"privateKey\": \"private_key\",\n      \"passphrase\": \"pass_phrase\"\n    }\n  },\n  \"enterpriseID\": \"123456\"\n}\n
This class is used as a base for the rest of Box integrations, however it can also be used separately to obtain the Box client which is created at class initialization.

Examples:

b = Box(\n    client_id=\"client_id\",\n    client_secret=\"client_secret\",\n    enterprise_id=\"enterprise_id\",\n    jwt_key_id=\"jwt_key_id\",\n    rsa_private_key_data=\"rsa_private_key_data\",\n    rsa_private_key_passphrase=\"rsa_private_key_passphrase\",\n)\nb.client\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.auth_options","title":"auth_options property","text":"
auth_options\n

Get a dictionary of authentication options, that can be handily used in the child classes

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client","title":"client class-attribute instance-attribute","text":"
client: SkipValidation[Client] = None\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_id","title":"client_id class-attribute instance-attribute","text":"
client_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientID', description='Client ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_secret","title":"client_secret class-attribute instance-attribute","text":"
client_secret: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientSecret', description='Client Secret from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.enterprise_id","title":"enterprise_id class-attribute instance-attribute","text":"
enterprise_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='enterpriseID', description='Enterprise ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.jwt_key_id","title":"jwt_key_id class-attribute instance-attribute","text":"
jwt_key_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='publicKeyID', description='PublicKeyID for the public/private generated key pair.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_data","title":"rsa_private_key_data class-attribute instance-attribute","text":"
rsa_private_key_data: Union[SecretStr, SecretBytes] = Field(default=..., alias='privateKey', description='Private key generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_passphrase","title":"rsa_private_key_passphrase class-attribute instance-attribute","text":"
rsa_private_key_passphrase: Union[SecretStr, SecretBytes] = Field(default=..., alias='passphrase', description='Private key passphrase generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/box.py
def execute(self):\n    # Plug to be able to unit test ABC\n    pass\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.init_client","title":"init_client","text":"
init_client()\n

Set up the Box client.

Source code in src/koheesio/integrations/box.py
def init_client(self):\n    \"\"\"Set up the Box client.\"\"\"\n    if not self.client:\n        self.client = Client(JWTAuth(**self.auth_options))\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader","title":"koheesio.integrations.box.BoxCsvFileReader","text":"
BoxCsvFileReader(**data)\n

Class facilitates reading one or multiple CSV files with the same structure directly from Box and producing Spark Dataframe.

Notes

To manually identify the ID of the file in Box, open the file through Web UI, and copy ID from the page URL, e.g. https://foo.ent.box.com/file/1234567890 , where 1234567890 is the ID.

Examples:

from koheesio.steps.integrations.box import BoxCsvFileReader\nfrom pyspark.sql.types import StructType\n\nschema = StructType(...)\nb = BoxCsvFileReader(\n    client_id=\"\",\n    client_secret=\"\",\n    enterprise_id=\"\",\n    jwt_key_id=\"\",\n    rsa_private_key_data=\"\",\n    rsa_private_key_passphrase=\"\",\n    file=[\"1\", \"2\"],\n    schema=schema,\n).execute()\nb.df.show()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.file","title":"file class-attribute instance-attribute","text":"
file: Union[str, list[str]] = Field(default=..., description='ID or list of IDs for the files to read.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.execute","title":"execute","text":"
execute()\n

Loop through the list of provided file identifiers and load data into dataframe. For traceability purposes the following columns will be added to the dataframe: * meta_file_id: the identifier of the file on Box * meta_file_name: name of the file

Returns:

Type Description DataFrame Source code in src/koheesio/integrations/box.py
def execute(self):\n    \"\"\"\n    Loop through the list of provided file identifiers and load data into dataframe.\n    For traceability purposes the following columns will be added to the dataframe:\n        * meta_file_id: the identifier of the file on Box\n        * meta_file_name: name of the file\n\n    Returns\n    -------\n    DataFrame\n    \"\"\"\n    df = None\n    for f in self.file:\n        self.log.debug(f\"Reading contents of file with the ID '{f}' into Spark DataFrame\")\n        file = self.client.file(file_id=f)\n        data = file.content().decode(\"utf-8\").splitlines()\n        rdd = self.spark.sparkContext.parallelize(data)\n        temp_df = self.spark.read.csv(rdd, header=True, schema=self.schema_, **self.params)\n        temp_df = (\n            temp_df\n            # fmt: off\n            .withColumn(\"meta_file_id\", lit(file.object_id))\n            .withColumn(\"meta_file_name\", lit(file.get().name))\n            .withColumn(\"meta_load_timestamp\", expr(\"to_utc_timestamp(current_timestamp(), current_timezone())\"))\n            # fmt: on\n        )\n\n        df = temp_df if not df else df.union(temp_df)\n\n    self.output.df = df\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader","title":"koheesio.integrations.box.BoxCsvPathReader","text":"
BoxCsvPathReader(**data)\n

Read all CSV files from the specified path into the dataframe. Files can be filtered using the regular expression in the 'filter' parameter. The default behavior is to read all CSV / TXT files from the specified path.

Notes

The class does not contain archival capability as it is presumed that the user wants to make sure that the full pipeline is successful (for example, the source data was transformed and saved) prior to moving the source files. Use BoxToBoxFileMove class instead and provide the list of IDs from 'file_id' output.

Examples:

from koheesio.steps.integrations.box import BoxCsvPathReader\n\nauth_params = {...}\nb = BoxCsvPathReader(**auth_params, path=\"foo/bar/\").execute()\nb.df  # Spark Dataframe\n...  # do something with the dataframe\nfrom koheesio.steps.integrations.box import BoxToBoxFileMove\n\nbm = BoxToBoxFileMove(**auth_params, file=b.file_id, path=\"/foo/bar/archive\")\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.filter","title":"filter class-attribute instance-attribute","text":"
filter: Optional[str] = Field(default='.csv|.txt$', description='[Optional] Regexp to filter folder contents')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.path","title":"path class-attribute instance-attribute","text":"
path: str = Field(default=..., description='Box path')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.execute","title":"execute","text":"
execute()\n

Identify the list of files from the source Box path that match desired filter and load them into Dataframe

Source code in src/koheesio/integrations/box.py
def execute(self):\n    \"\"\"\n    Identify the list of files from the source Box path that match desired filter and load them into Dataframe\n    \"\"\"\n    folder = BoxFolderGet.from_step(self).execute().folder\n\n    # Identify the list of files that should be processed\n    files = [item for item in folder.get_items() if item.type == \"file\" and re.search(self.filter, item.name)]\n\n    if len(files) > 0:\n        self.log.info(\n            f\"A total of {len(files)} files, that match the mask '{self.mask}' has been detected in {self.path}.\"\n            f\" They will be loaded into Spark Dataframe: {files}\"\n        )\n    else:\n        raise BoxPathIsEmptyError(f\"Path '{self.path}' is empty or none of files match the mask '{self.filter}'\")\n\n    file = [file_id.object_id for file_id in files]\n    self.output.df = BoxCsvFileReader.from_step(self, file=file).read()\n    self.output.file = file  # e.g. if files should be archived after pipeline is successful\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase","title":"koheesio.integrations.box.BoxFileBase","text":"
BoxFileBase(**data)\n

Generic class to facilitate interactions with Box folders.

Box SDK provides File class that has various properties and methods to interact with Box files. The object can be obtained in multiple ways: * provide Box file identified to file parameter (the identifier can be obtained, for example, from URL) * provide existing object to file parameter (boxsdk.object.file.File)

Notes

Refer to BoxFolderBase for mor info about folder and path parameters

See Also

boxsdk.object.file.File

Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.files","title":"files class-attribute instance-attribute","text":"
files: conlist(Union[File, str], min_length=1) = Field(default=..., alias='file', description='List of Box file objects or identifiers')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.folder","title":"folder class-attribute instance-attribute","text":"
folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.path","title":"path class-attribute instance-attribute","text":"
path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.action","title":"action","text":"
action(file: File, folder: Folder)\n

Abstract class for File level actions.

Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n    \"\"\"\n    Abstract class for File level actions.\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.execute","title":"execute","text":"
execute()\n

Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects from various parameter inputs

Source code in src/koheesio/integrations/box.py
def execute(self):\n    \"\"\"\n    Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects\n    from various parameter inputs\n    \"\"\"\n    if self.path:\n        _folder = BoxFolderGet.from_step(self).execute().folder\n    else:\n        _folder = self.client.folder(folder_id=self.folder) if isinstance(self.folder, str) else self.folder\n\n    for _file in self.files:\n        _file = self.client.file(file_id=_file) if isinstance(_file, str) else _file\n        self.action(file=_file, folder=_folder)\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter","title":"koheesio.integrations.box.BoxFileWriter","text":"
BoxFileWriter(**data)\n

Write file or a file-like object to Box.

Examples:

from koheesio.steps.integrations.box import BoxFileWriter\n\nauth_params = {...}\nf1 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=\"path/to/my/file.ext\").execute()\n# or\nimport io\n\nb = io.BytesIO(b\"my-sample-data\")\nf2 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=b, name=\"file.ext\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.description","title":"description class-attribute instance-attribute","text":"
description: Optional[str] = Field(None, description='Optional description to add to the file in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file","title":"file class-attribute instance-attribute","text":"
file: Union[str, BytesIO] = Field(default=..., description='Path to file or a file-like object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file_name","title":"file_name class-attribute instance-attribute","text":"
file_name: Optional[str] = Field(default=None, description=\"When file path or name is provided to 'file' parameter, this will override the original name.When binary stream is provided, the 'name' should be used to set the desired name for the Box file.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output","title":"Output","text":"

Output class for BoxFileWriter.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.file","title":"file class-attribute instance-attribute","text":"
file: File = Field(default=..., description='File object in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.shared_link","title":"shared_link class-attribute instance-attribute","text":"
shared_link: str = Field(default=..., description='Shared link for the Box file')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.action","title":"action","text":"
action()\n
Source code in src/koheesio/integrations/box.py
def action(self):\n    _file = self.file\n    _name = self.file_name\n\n    if isinstance(_file, str):\n        _name = _name if _name else PurePath(_file).name\n        with open(_file, \"rb\") as f:\n            _file = BytesIO(f.read())\n\n    folder: Folder = BoxFolderGet.from_step(self, create_sub_folders=True).execute().folder\n    folder.preflight_check(size=0, name=_name)\n\n    self.log.info(f\"Uploading file '{_name}' to Box folder '{folder.get().name}'...\")\n    _box_file: File = folder.upload_stream(file_stream=_file, file_name=_name, file_description=self.description)\n\n    self.output.file = _box_file\n    self.output.shared_link = _box_file.get_shared_link()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.execute","title":"execute","text":"
execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n    self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.validate_name_for_binary_data","title":"validate_name_for_binary_data","text":"
validate_name_for_binary_data(values)\n

Validate 'file_name' parameter when providing a binary input for 'file'.

Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"before\")\ndef validate_name_for_binary_data(cls, values):\n    \"\"\"Validate 'file_name' parameter when providing a binary input for 'file'.\"\"\"\n    file, file_name = values.get(\"file\"), values.get(\"file_name\")\n    if not isinstance(file, str) and not file_name:\n        raise AttributeError(\"The parameter 'file_name' is mandatory when providing a binary input for 'file'.\")\n\n    return values\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase","title":"koheesio.integrations.box.BoxFolderBase","text":"
BoxFolderBase(**data)\n

Generic class to facilitate interactions with Box folders.

Box SDK provides Folder class that has various properties and methods to interact with Box folders. The object can be obtained in multiple ways: * provide Box folder identified to folder parameter (the identifier can be obtained, for example, from URL) * provide existing object to folder parameter (boxsdk.object.folder.Folder) * provide filesystem-like path to path parameter

See Also

boxsdk.object.folder.Folder

Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.folder","title":"folder class-attribute instance-attribute","text":"
folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.path","title":"path class-attribute instance-attribute","text":"
path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.root","title":"root class-attribute instance-attribute","text":"
root: Optional[Union[Folder, str]] = Field(default='0', description='Folder object or identifier of the folder that should be used as root')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output","title":"Output","text":"

Define outputs for the BoxFolderBase class

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output.folder","title":"folder class-attribute instance-attribute","text":"
folder: Optional[Folder] = Field(default=None, description='Box folder object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.action","title":"action","text":"
action()\n

Placeholder for 'action' method, that should be implemented in the child classes

Returns:

Type Description Folder or None Source code in src/koheesio/integrations/box.py
def action(self):\n    \"\"\"\n    Placeholder for 'action' method, that should be implemented in the child classes\n\n    Returns\n    -------\n        Folder or None\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.execute","title":"execute","text":"
execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n    self.output.folder = self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.validate_folder_or_path","title":"validate_folder_or_path","text":"
validate_folder_or_path()\n

Validations for 'folder' and 'path' parameter usage

Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"after\")\ndef validate_folder_or_path(self):\n    \"\"\"\n    Validations for 'folder' and 'path' parameter usage\n    \"\"\"\n    folder_value = self.folder\n    path_value = self.path\n\n    if folder_value and path_value:\n        raise AttributeError(\"Cannot user 'folder' and 'path' parameter at the same time\")\n\n    if not folder_value and not path_value:\n        raise AttributeError(\"Neither 'folder' nor 'path' parameters are set\")\n\n    return self\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate","title":"koheesio.integrations.box.BoxFolderCreate","text":"
BoxFolderCreate(**data)\n

Explicitly create the new Box folder object and parent directories.

Examples:

from koheesio.steps.integrations.box import BoxFolderCreate\n\nauth_params = {...}\nfolder = BoxFolderCreate(**auth_params, path=\"/foo/bar\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.create_sub_folders","title":"create_sub_folders class-attribute instance-attribute","text":"
create_sub_folders: bool = Field(default=True, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.validate_folder","title":"validate_folder","text":"
validate_folder(folder)\n

Validate 'folder' parameter

Source code in src/koheesio/integrations/box.py
@field_validator(\"folder\")\ndef validate_folder(cls, folder):\n    \"\"\"\n    Validate 'folder' parameter\n    \"\"\"\n    if folder:\n        raise AttributeError(\"Only 'path' parameter is allowed in the context of folder creation.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete","title":"koheesio.integrations.box.BoxFolderDelete","text":"
BoxFolderDelete(**data)\n

Delete existing Box folder based on object, identifier or path.

Examples:

from koheesio.steps.integrations.box import BoxFolderDelete\n\nauth_params = {...}\nBoxFolderDelete(**auth_params, path=\"/foo/bar\").execute()\n# or\nBoxFolderDelete(**auth_params, folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxFolderDelete(**auth_params, folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete.action","title":"action","text":"
action()\n

Delete folder action

Returns:

Type Description None Source code in src/koheesio/integrations/box.py
def action(self):\n    \"\"\"\n    Delete folder action\n\n    Returns\n    -------\n        None\n    \"\"\"\n    if self.folder:\n        folder = self._obj_from_id\n    else:  # path\n        folder = BoxFolderGet.from_step(self).action()\n\n    self.log.info(f\"Deleting Box folder '{folder}'...\")\n    folder.delete()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet","title":"koheesio.integrations.box.BoxFolderGet","text":"
BoxFolderGet(**data)\n

Get the Box folder object for an existing folder or create a new folder and parent directories.

Examples:

from koheesio.steps.integrations.box import BoxFolderGet\n\nauth_params = {...}\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\n# or\nfolder = BoxFolderGet(**auth_params, path=\"1\").execute().folder\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.create_sub_folders","title":"create_sub_folders class-attribute instance-attribute","text":"
create_sub_folders: Optional[bool] = Field(False, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.action","title":"action","text":"
action()\n

Get folder action

Returns:

Name Type Description folder Folder

Box Folder object as specified in Box SDK

Source code in src/koheesio/integrations/box.py
def action(self):\n    \"\"\"\n    Get folder action\n\n    Returns\n    -------\n    folder: Folder\n        Box Folder object as specified in Box SDK\n    \"\"\"\n    current_folder_object = None\n\n    if self.folder:\n        current_folder_object = self._obj_from_id\n\n    if self.path:\n        cleaned_path_parts = [p for p in PurePath(self.path).parts if p.strip() not in [None, \"\", \" \", \"/\"]]\n        current_folder_object = self.client.folder(folder_id=self.root) if isinstance(self.root, str) else self.root\n\n        for next_folder_name in cleaned_path_parts:\n            current_folder_object = self._get_or_create_folder(current_folder_object, next_folder_name)\n\n    self.log.info(f\"Folder identified or created: {current_folder_object}\")\n    return current_folder_object\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderNotFoundError","title":"koheesio.integrations.box.BoxFolderNotFoundError","text":"

Error when a provided box path does not exist.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxPathIsEmptyError","title":"koheesio.integrations.box.BoxPathIsEmptyError","text":"

Exception when provided Box path is empty or no files matched the mask.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase","title":"koheesio.integrations.box.BoxReaderBase","text":"
BoxReaderBase(**data)\n

Base class for Box readers.

Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the Spark reader.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.schema_","title":"schema_ class-attribute instance-attribute","text":"
schema_: Optional[StructType] = Field(None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output","title":"Output","text":"

Make default reader output optional to gracefully handle 'no-files / folder' cases.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.execute","title":"execute abstractmethod","text":"
execute() -> Output\n
Source code in src/koheesio/integrations/box.py
@abstractmethod\ndef execute(self) -> Output:\n    raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy","title":"koheesio.integrations.box.BoxToBoxFileCopy","text":"
BoxToBoxFileCopy(**data)\n

Copy one or multiple files to the target Box path.

Examples:

from koheesio.steps.integrations.box import BoxToBoxFileCopy\n\nauth_params = {...}\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileCopy(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy.action","title":"action","text":"
action(file: File, folder: Folder)\n

Copy file to the desired destination and extend file description with the processing info

Parameters:

Name Type Description Default file File

File object as specified in Box SDK

required folder Folder

Folder object as specified in Box SDK

required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n    \"\"\"\n    Copy file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Copying '{file.get()}' to '{folder.get()}'...\")\n    file.copy(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove","title":"koheesio.integrations.box.BoxToBoxFileMove","text":"
BoxToBoxFileMove(**data)\n

Move one or multiple files to the target Box path

Examples:

from koheesio.steps.integrations.box import BoxToBoxFileMove\n\nauth_params = {...}\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileMove(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove.action","title":"action","text":"
action(file: File, folder: Folder)\n

Move file to the desired destination and extend file description with the processing info

Parameters:

Name Type Description Default file File

File object as specified in Box SDK

required folder Folder

Folder object as specified in Box SDK

required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n    \"\"\"\n    Move file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Moving '{file.get()}' to '{folder.get()}'...\")\n    file.move(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n
"},{"location":"api_reference/integrations/spark/index.html","title":"Spark","text":""},{"location":"api_reference/integrations/spark/sftp.html","title":"Sftp","text":"

This module contains the SFTPWriter class and the SFTPWriteMode enum.

The SFTPWriter class is used to write data to a file on an SFTP server. It uses the Paramiko library to establish an SFTP connection and write data to the server. The data to be written is provided by a BufferWriter, which generates the data in a buffer. See the docstring of the SFTPWriter class for more details. Refer to koheesio.spark.writers.buffer for more details on the BufferWriter interface.

The SFTPWriteMode enum defines the different write modes that the SFTPWriter can use. These modes determine how the SFTPWriter behaves when the file it is trying to write to already exists on the server. For more details on each mode, see the docstring of the SFTPWriteMode enum.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode","title":"koheesio.integrations.spark.sftp.SFTPWriteMode","text":"

The different write modes for the SFTPWriter.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--overwrite","title":"OVERWRITE:","text":"
  • If the file exists, it will be overwritten.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--append","title":"APPEND:","text":"
  • If the file exists, the new data will be appended to it.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--ignore","title":"IGNORE:","text":"
  • If the file exists, the method will return without writing anything.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--exclusive","title":"EXCLUSIVE:","text":"
  • If the file exists, an error will be raised.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--backup","title":"BACKUP:","text":"
  • If the file exists and the new data is different from the existing data, a backup will be created and the file will be overwritten.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--update","title":"UPDATE:","text":"
  • If the file exists and the new data is different from the existing data, the file will be overwritten.
  • If the file exists and the new data is the same as the existing data, the method will return without writing anything.
  • If the file does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.BACKUP","title":"BACKUP class-attribute instance-attribute","text":"
BACKUP = 'backup'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.EXCLUSIVE","title":"EXCLUSIVE class-attribute instance-attribute","text":"
EXCLUSIVE = 'exclusive'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.IGNORE","title":"IGNORE class-attribute instance-attribute","text":"
IGNORE = 'ignore'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.OVERWRITE","title":"OVERWRITE class-attribute instance-attribute","text":"
OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.UPDATE","title":"UPDATE class-attribute instance-attribute","text":"
UPDATE = 'update'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.write_mode","title":"write_mode property","text":"
write_mode\n

Return the write mode for the given SFTPWriteMode.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.from_string","title":"from_string classmethod","text":"
from_string(mode: str)\n

Return the SFTPWriteMode for the given string.

Source code in src/koheesio/integrations/spark/sftp.py
@classmethod\ndef from_string(cls, mode: str):\n    \"\"\"Return the SFTPWriteMode for the given string.\"\"\"\n    return cls[mode.upper()]\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter","title":"koheesio.integrations.spark.sftp.SFTPWriter","text":"

Write a Dataframe to SFTP through a BufferWriter

Concept
  • This class uses Paramiko to connect to an SFTP server and write the contents of a buffer to a file on the server.
  • This implementation takes inspiration from https://github.com/springml/spark-sftp

Parameters:

Name Type Description Default path Union[str, Path]

Path to the folder to write to

required file_name Optional[str]

Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension.

None host str

SFTP Host

required port int

SFTP Port

required username SecretStr

SFTP Server Username

None password SecretStr

SFTP Server Password

None buffer_writer BufferWriter

This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.

required mode

Write mode: overwrite, append, ignore, exclusive, backup, or update. See the docstring of SFTPWriteMode for more details.

required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.buffer_writer","title":"buffer_writer class-attribute instance-attribute","text":"
buffer_writer: InstanceOf[BufferWriter] = Field(default=..., description='This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.client","title":"client property","text":"
client: SFTPClient\n

Return the SFTP client. If it doesn't exist, create it.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.file_name","title":"file_name class-attribute instance-attribute","text":"
file_name: Optional[str] = Field(default=None, description='Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension!', alias='filename')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.host","title":"host class-attribute instance-attribute","text":"
host: str = Field(default=..., description='SFTP Host')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.mode","title":"mode class-attribute instance-attribute","text":"
mode: SFTPWriteMode = Field(default=OVERWRITE, description='Write mode: overwrite, append, ignore, exclusive, backup, or update.' + __doc__)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.password","title":"password class-attribute instance-attribute","text":"
password: Optional[SecretStr] = Field(default=None, description='SFTP Server Password')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.path","title":"path class-attribute instance-attribute","text":"
path: Union[str, Path] = Field(default=..., description='Path to the folder to write to', alias='prefix')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.port","title":"port class-attribute instance-attribute","text":"
port: int = Field(default=..., description='SFTP Port')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.transport","title":"transport property","text":"
transport\n

Return the transport for the SFTP connection. If it doesn't exist, create it.

If the username and password are provided, use them to connect to the SFTP server.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.username","title":"username class-attribute instance-attribute","text":"
username: Optional[SecretStr] = Field(default=None, description='SFTP Server Username')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_mode","title":"write_mode property","text":"
write_mode\n

Return the write mode for the given SFTPWriteMode.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.check_file_exists","title":"check_file_exists","text":"
check_file_exists(file_path: str) -> bool\n

Check if a file exists on the SFTP server.

Source code in src/koheesio/integrations/spark/sftp.py
def check_file_exists(self, file_path: str) -> bool:\n    \"\"\"\n    Check if a file exists on the SFTP server.\n    \"\"\"\n    try:\n        self.client.stat(file_path)\n        return True\n    except IOError:\n        return False\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n    buffer_output: InstanceOf[BufferWriter.Output] = self.buffer_writer.write(self.df)\n\n    # write buffer to the SFTP server\n    try:\n        self._handle_write_mode(self.path.as_posix(), buffer_output)\n    finally:\n        self._close_client()\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_path_and_file_name","title":"validate_path_and_file_name","text":"
validate_path_and_file_name(data: dict) -> dict\n

Validate the path, make sure path and file_name are Path objects.

Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"before\")\ndef validate_path_and_file_name(cls, data: dict) -> dict:\n    \"\"\"Validate the path, make sure path and file_name are Path objects.\"\"\"\n    path_or_str = data.get(\"path\")\n\n    if isinstance(path_or_str, str):\n        # make sure the path is a Path object\n        path_or_str = Path(path_or_str)\n\n    if not isinstance(path_or_str, Path):\n        raise ValueError(f\"Invalid path: {path_or_str}\")\n\n    if file_name := data.get(\"file_name\", data.get(\"filename\")):\n        path_or_str = path_or_str / file_name\n        try:\n            del data[\"filename\"]\n        except KeyError:\n            pass\n        data[\"file_name\"] = file_name\n\n    data[\"path\"] = path_or_str\n    return data\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_sftp_host","title":"validate_sftp_host","text":"
validate_sftp_host(v) -> str\n

Validate the host

Source code in src/koheesio/integrations/spark/sftp.py
@field_validator(\"host\")\ndef validate_sftp_host(cls, v) -> str:\n    \"\"\"Validate the host\"\"\"\n    # remove the sftp:// prefix if present\n    if v.startswith(\"sftp://\"):\n        v = v.replace(\"sftp://\", \"\")\n\n    # remove the trailing slash if present\n    if v.endswith(\"/\"):\n        v = v[:-1]\n\n    return v\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_file","title":"write_file","text":"
write_file(file_path: str, buffer_output: InstanceOf[Output])\n

Using Paramiko, write the data in the buffer to SFTP.

Source code in src/koheesio/integrations/spark/sftp.py
def write_file(self, file_path: str, buffer_output: InstanceOf[BufferWriter.Output]):\n    \"\"\"\n    Using Paramiko, write the data in the buffer to SFTP.\n    \"\"\"\n    with self.client.open(file_path, self.write_mode) as file:\n        self.log.debug(f\"Writing file {file_path} to SFTP...\")\n        file.write(buffer_output.read())\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp","title":"koheesio.integrations.spark.sftp.SendCsvToSftp","text":"

Write a DataFrame to an SFTP server as a CSV file.

This class uses the PandasCsvBufferWriter to generate the CSV data and the SFTPWriter to write the data to the SFTP server.

Example
from koheesio.spark.writers import SendCsvToSftp\n\nwriter = SendCsvToSftp(\n    # SFTP Parameters\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.tsv.gz\",\n    # CSV Parameters\n    header=True,\n    sep=\"   \",\n    quote='\"',\n    timestampFormat=\"%Y-%m-%d\",\n    lineSep=os.linesep,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n

In this example, the DataFrame df is written to the file file.csv.gz in the folder /path/to/folder on the SFTP server. The file is written as a CSV file with a tab delimiter (TSV), double quotes as the quote character, and gzip compression.

Parameters:

Name Type Description Default path Union[str, Path]

Path to the folder to write to.

required file_name Optional[str]

Name of the file. If not provided, it's expected to be part of the path.

required host str

SFTP Host.

required port int

SFTP Port.

required username SecretStr

SFTP Server Username.

required password SecretStr

SFTP Server Password.

required mode

Write mode: overwrite, append, ignore, exclusive, backup, or update.

required header

Whether to write column names as the first line. Default is True.

required sep

Field delimiter for the output file. Default is ','.

required quote

Character used to quote fields. Default is '\"'.

required quoteAll

Whether all values should be enclosed in quotes. Default is False.

required escape

Character used to escape sep and quote when needed. Default is '\\'.

required timestampFormat

Date format for datetime objects. Default is '%Y-%m-%dT%H:%M:%S.%f'.

required lineSep

Character used as line separator. Default is os.linesep.

required compression

Compression to use for the output data. Default is None.

required For required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.buffer_writer","title":"buffer_writer class-attribute instance-attribute","text":"
buffer_writer: PandasCsvBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n    SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"
set_up_buffer_writer() -> SendCsvToSftp\n

Set up the buffer writer, passing all CSV related options to it.

Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendCsvToSftp\":\n    \"\"\"Set up the buffer writer, passing all CSV related options to it.\"\"\"\n    self.buffer_writer = PandasCsvBufferWriter(**self.get_options(options_type=\"kohesio_pandas_buffer_writer\"))\n    return self\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp","title":"koheesio.integrations.spark.sftp.SendJsonToSftp","text":"

Write a DataFrame to an SFTP server as a JSON file.

This class uses the PandasJsonBufferWriter to generate the JSON data and the SFTPWriter to write the data to the SFTP server.

Example
from koheesio.spark.writers import SendJsonToSftp\n\nwriter = SendJsonToSftp(\n    # SFTP Parameters (Inherited from SFTPWriter)\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.json.gz\",\n    # JSON Parameters (Inherited from PandasJsonBufferWriter)\n    orient=\"records\",\n    date_format=\"iso\",\n    double_precision=2,\n    date_unit=\"ms\",\n    lines=False,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n

In this example, the DataFrame df is written to the file file.json.gz in the folder /path/to/folder on the SFTP server. The file is written as a JSON file with gzip compression.

Parameters:

Name Type Description Default path Union[str, Path]

Path to the folder on the SFTP server.

required file_name Optional[str]

Name of the file, including extension. If not provided, expected to be part of the path.

required host str

SFTP Host.

required port int

SFTP Port.

required username SecretStr

SFTP Server Username.

required password SecretStr

SFTP Server Password.

required mode

Write mode: overwrite, append, ignore, exclusive, backup, or update.

required orient

Format of the JSON string. Default is 'records'.

required lines

If True, output is one JSON object per line. Only used when orient='records'. Default is True.

required date_format

Type of date conversion. Default is 'iso'.

required double_precision

Decimal places for encoding floating point values. Default is 10.

required force_ascii

If True, encoded string is ASCII. Default is True.

required compression

Compression to use for output data. Default is None.

required See Also

For more details on the JSON parameters, refer to the PandasJsonBufferWriter class documentation.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.buffer_writer","title":"buffer_writer class-attribute instance-attribute","text":"
buffer_writer: PandasJsonBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n    SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"
set_up_buffer_writer() -> SendJsonToSftp\n

Set up the buffer writer, passing all JSON related options to it.

Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendJsonToSftp\":\n    \"\"\"Set up the buffer writer, passing all JSON related options to it.\"\"\"\n    self.buffer_writer = PandasJsonBufferWriter(\n        **self.get_options(), compression=self.compression, columns=self.columns\n    )\n    return self\n
"},{"location":"api_reference/integrations/spark/dq/index.html","title":"Dq","text":""},{"location":"api_reference/integrations/spark/dq/spark_expectations.html","title":"Spark expectations","text":"

Koheesio step for running data quality rules with Spark Expectations engine.

"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","title":"koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","text":"

Run DQ rules for an input dataframe with Spark Expectations engine.

References

Spark Expectations: https://engineering.nike.com/spark-expectations/1.0.0/

"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.drop_meta_column","title":"drop_meta_column class-attribute instance-attribute","text":"
drop_meta_column: bool = Field(default=False, alias='drop_meta_columns', description='Whether to drop meta columns added by spark expectations on the output df')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.enable_debugger","title":"enable_debugger class-attribute instance-attribute","text":"
enable_debugger: bool = Field(default=False, alias='debugger', description='...')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_format","title":"error_writer_format class-attribute instance-attribute","text":"
error_writer_format: Optional[str] = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_mode","title":"error_writer_mode class-attribute instance-attribute","text":"
error_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writing_options","title":"error_writing_options class-attribute instance-attribute","text":"
error_writing_options: Optional[Dict[str, str]] = Field(default_factory=dict, alias='error_writing_options', description='Options for writing to the error table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the stats and err table. Separate output formats can be specified for each table using the error_writer_format and stats_writer_format params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.mode","title":"mode class-attribute instance-attribute","text":"
mode: Union[str, BatchOutputMode] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err and stats table. Separate output modes can be specified for each table using the error_writer_mode and stats_writer_mode params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.product_id","title":"product_id class-attribute instance-attribute","text":"
product_id: str = Field(default=..., description='Spark Expectations product identifier')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.rules_table","title":"rules_table class-attribute instance-attribute","text":"
rules_table: str = Field(default=..., alias='product_rules_table', description='DQ rules table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.se_user_conf","title":"se_user_conf class-attribute instance-attribute","text":"
se_user_conf: Dict[str, Any] = Field(default={se_notifications_enable_email: False, se_notifications_enable_slack: False}, alias='user_conf', description='SE user provided confs', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_streaming","title":"statistics_streaming class-attribute instance-attribute","text":"
statistics_streaming: Dict[str, Any] = Field(default={se_enable_streaming: False}, alias='stats_streaming_options', description='SE stats streaming options ', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_table","title":"statistics_table class-attribute instance-attribute","text":"
statistics_table: str = Field(default=..., alias='dq_stats_table_name', description='DQ stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_format","title":"stats_writer_format class-attribute instance-attribute","text":"
stats_writer_format: Optional[str] = Field(default='delta', alias='stats_writer_format', description='The format used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_mode","title":"stats_writer_mode class-attribute instance-attribute","text":"
stats_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='stats_writer_mode', description='The write mode that will be used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.target_table","title":"target_table class-attribute instance-attribute","text":"
target_table: str = Field(default=..., alias='target_table_name', description=\"The table that will contain good records. Won't write to it, but will write to the err table with same name plus _err suffix\")\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output","title":"Output","text":"

Output of the SparkExpectationsTransformation step.

"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.error_table_writer","title":"error_table_writer class-attribute instance-attribute","text":"
error_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations error table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.rules_df","title":"rules_df class-attribute instance-attribute","text":"
rules_df: DataFrame = Field(default=..., description='Output dataframe')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.se","title":"se class-attribute instance-attribute","text":"
se: SparkExpectations = Field(default=..., description='Spark Expectations object')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.stats_table_writer","title":"stats_table_writer class-attribute instance-attribute","text":"
stats_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations stats table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.execute","title":"execute","text":"
execute() -> Output\n

Apply data quality rules to a dataframe using the out-of-the-box SE decorator

Source code in src/koheesio/integrations/spark/dq/spark_expectations.py
def execute(self) -> Output:\n    \"\"\"\n    Apply data quality rules to a dataframe using the out-of-the-box SE decorator\n    \"\"\"\n    # read rules table\n    rules_df = self.spark.read.table(self.rules_table).cache()\n    self.output.rules_df = rules_df\n\n    @self._se.with_expectations(\n        target_table=self.target_table,\n        user_conf=self.se_user_conf,\n        # Below params are `False` by default, however exposing them here for extra visibility\n        # The writes can be handled by downstream Koheesio steps\n        write_to_table=False,\n        write_to_temp_table=False,\n    )\n    def inner(df: DataFrame) -> DataFrame:\n        \"\"\"Just a wrapper to be able to use Spark Expectations decorator\"\"\"\n        return df\n\n    output_df = inner(self.df)\n\n    if self.drop_meta_column:\n        output_df = output_df.drop(\"meta_dq_run_id\", \"meta_dq_run_datetime\")\n\n    self.output.df = output_df\n
"},{"location":"api_reference/models/index.html","title":"Models","text":"

Models package creates models that can be used to base other classes on.

  • Every model should be at least a pydantic BaseModel, but can also be a Step, or a StepOutput.
  • Every model is expected to be an ABC (Abstract Base Class)
  • Optionally a model can inherit ExtraParamsMixin that provides unpacking of kwargs into extra_params dict property removing need to create a dict before passing kwargs to a model initializer.

A Model class can be exceptionally handy when you need similar Pydantic models in multiple places, for example across Transformation and Reader classes.

"},{"location":"api_reference/models/index.html#koheesio.models.ListOfColumns","title":"koheesio.models.ListOfColumns module-attribute","text":"
ListOfColumns = Annotated[List[str], BeforeValidator(_list_of_columns_validation)]\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel","title":"koheesio.models.BaseModel","text":"

Base model for all models.

Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.

Additional methods and properties: Different Modes

This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.

  • Normal mode: you need to know the values ahead of time

    normal_mode = YourOwnModel(a=\"foo\", b=42)\n

  • Lazy mode: being able to defer the validation until later

    lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
    The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end.

  • With statements: With statements are also allowed. The validate_output method from the earlier example will run upon exit of the with-statement.

    with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n
    Note: that a lazy mode BaseModel object is required to work with a with-statement.

Examples:

from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n

In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output method is then called to validate the instance.

Koheesio specific configuration:

Koheesio models are configured differently from Pydantic defaults. The following configuration is used:

  1. extra=\"allow\"

    This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.

  2. arbitrary_types_allowed=True

    This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.

  3. populate_by_name=True

    This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.

  4. validate_assignment=False

    This setting determines whether the model should be revalidated when the data is changed. If set to True, every time a field is assigned a new value, the entire model is validated again.

    Pydantic default is (also) False, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.

  5. revalidate_instances=\"subclass-instances\"

    This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never, which means that the model and dataclass instances are not revalidated during validation.

  6. validate_default=True

    This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.

  7. frozen=False

    This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.

  8. coerce_numbers_to_str=True

    This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number type to str. Pydantic doesn't allow number types (int, float, Decimal) to be coerced as type str by default.

  9. use_enum_values=True

    This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.

"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--fields","title":"Fields","text":"

Every Koheesio BaseModel has two fields: name and description. These fields are used to provide a name and a description to the model.

  • name: This is the name of the Model. If not provided, it defaults to the class name.

  • description: This is the description of the Model. It has several default behaviors:

    • If not provided, it defaults to the docstring of the class.
    • If the docstring is not provided, it defaults to the name of the class.
    • For multi-line descriptions, it has the following behaviors:
      • Only the first non-empty line is used.
      • Empty lines are removed.
      • Only the first 3 lines are considered.
      • Only the first 120 characters are considered.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--validators","title":"Validators","text":"
  • _set_name_and_description: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--properties","title":"Properties","text":"
  • log: Returns a logger with the name of the class.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--class-methods","title":"Class Methods","text":"
  • from_basemodel: Returns a new BaseModel instance based on the data of another BaseModel.
  • from_context: Creates BaseModel instance from a given Context.
  • from_dict: Creates BaseModel instance from a given dictionary.
  • from_json: Creates BaseModel instance from a given JSON string.
  • from_toml: Creates BaseModel object from a given toml file.
  • from_yaml: Creates BaseModel object from a given yaml file.
  • lazy: Constructs the model without doing validation.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--dunder-methods","title":"Dunder Methods","text":"
  • __add__: Allows to add two BaseModel instances together.
  • __enter__: Allows for using the model in a with-statement.
  • __exit__: Allows for using the model in a with-statement.
  • __setitem__: Set Item dunder method for BaseModel.
  • __getitem__: Get Item dunder method for BaseModel.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--instance-methods","title":"Instance Methods","text":"
  • hasattr: Check if given key is present in the model.
  • get: Get an attribute of the model, but don't fail if not present.
  • merge: Merge key,value map with self.
  • set: Allows for subscribing / assigning to class[key].
  • to_context: Converts the BaseModel instance to a Context object.
  • to_dict: Converts the BaseModel instance to a dictionary.
  • to_json: Converts the BaseModel instance to a JSON string.
  • to_yaml: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.description","title":"description class-attribute instance-attribute","text":"
description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.log","title":"log property","text":"
log: Logger\n

Returns a logger with the name of the class

"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.name","title":"name class-attribute instance-attribute","text":"
name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_basemodel","title":"from_basemodel classmethod","text":"
from_basemodel(basemodel: BaseModel, **kwargs)\n

Returns a new BaseModel instance based on the data of another BaseModel

Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_context","title":"from_context classmethod","text":"
from_context(context: Context) -> BaseModel\n

Creates BaseModel instance from a given Context

You have to make sure that the Context object has the necessary attributes to create the model.

Examples:

class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n

Parameters:

Name Type Description Default context Context required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_dict","title":"from_dict classmethod","text":"
from_dict(data: Dict[str, Any]) -> BaseModel\n

Creates BaseModel instance from a given dictionary

Parameters:

Name Type Description Default data Dict[str, Any] required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel instance from a given JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.from_json : Deserializes a JSON string to a Context object

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel object from a given toml file

Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file, or string containing toml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> BaseModel\n

Creates BaseModel object from a given yaml file

Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.get","title":"get","text":"
get(key: str, default: Optional[Any] = None)\n

Get an attribute of the model, but don't fail if not present

Similar to dict.get()

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n

Parameters:

Name Type Description Default key str

name of the key to get

required default Optional[Any]

Default value in case the attribute does not exist

None

Returns:

Type Description Any

The value of the attribute

Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.hasattr","title":"hasattr","text":"
hasattr(key: str) -> bool\n

Check if given key is present in the model

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.lazy","title":"lazy classmethod","text":"
lazy()\n

Constructs the model without doing validation

Essentially an alias to BaseModel.construct()

Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.merge","title":"merge","text":"
merge(other: Union[Dict, BaseModel])\n

Merge key,value map with self

Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}.

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n

Parameters:

Name Type Description Default other Union[Dict, BaseModel]

Dict or another instance of a BaseModel class that will be added to self

required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.set","title":"set","text":"
set(key: str, value: Any)\n

Allows for subscribing / assigning to class[key].

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n

Parameters:

Name Type Description Default key str

The key of the attribute to assign to

required value Any

Value that should be assigned to the given key

required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_context","title":"to_context","text":"
to_context() -> Context\n

Converts the BaseModel instance to a Context object

Returns:

Type Description Context Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Converts the BaseModel instance to a dictionary

Returns:

Type Description Dict[str, Any] Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_json","title":"to_json","text":"
to_json(pretty: bool = False)\n

Converts the BaseModel instance to a JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.to_json : Serializes a Context object to a JSON string

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Converts the BaseModel instance to a YAML string

BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.validate","title":"validate","text":"
validate() -> BaseModel\n

Validate the BaseModel instance

This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.

This method is intended to be used with the lazy method. The lazy method is used to create an instance of the BaseModel without immediate validation. The validate method is then used to validate the instance after.

Note: in the Pydantic BaseModel, the validate method throws a deprecated warning. This is because Pydantic recommends using the validate_model method instead. However, we are using the validate method here in a different context and a slightly different way.

Examples:

class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate method is then called to validate the instance.

Returns:

Type Description BaseModel

The BaseModel instance

Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin","title":"koheesio.models.ExtraParamsMixin","text":"

Mixin class that adds support for arbitrary keyword arguments to Pydantic models.

The keyword arguments are extracted from the model's values and moved to a params dictionary.

"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.extra_params","title":"extra_params cached property","text":"
extra_params: Dict[str, Any]\n

Extract params (passed as arbitrary kwargs) from values and move them to params dict

"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.params","title":"params class-attribute instance-attribute","text":"
params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/models/sql.html","title":"Sql","text":"

This module contains the base class for SQL steps.

"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep","title":"koheesio.models.sql.SqlBaseStep","text":"

Base class for SQL steps

params are used as placeholders for templating. These are identified with ${placeholder} in the SQL script.

Parameters:

Name Type Description Default sql_path

Path to a SQL file

required sql

SQL script to apply

required params

Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.

Note: any arbitrary kwargs passed to the class will be added to params.

required"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.params","title":"params class-attribute instance-attribute","text":"
params: Dict[str, Any] = Field(default_factory=dict, description='Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script. Note: any arbitrary kwargs passed to the class will be added to params.')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.query","title":"query property","text":"
query\n

Returns the query while performing params replacement

"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql","title":"sql class-attribute instance-attribute","text":"
sql: Optional[str] = Field(default=None, description='SQL script to apply')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql_path","title":"sql_path class-attribute instance-attribute","text":"
sql_path: Optional[Union[Path, str]] = Field(default=None, description='Path to a SQL file')\n
"},{"location":"api_reference/notifications/index.html","title":"Notifications","text":"

Notification module for sending messages to notification services (e.g. Slack, Email, etc.)

"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity","title":"koheesio.notifications.NotificationSeverity","text":"

Enumeration of allowed message severities

"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.ERROR","title":"ERROR class-attribute instance-attribute","text":"
ERROR = 'error'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.INFO","title":"INFO class-attribute instance-attribute","text":"
INFO = 'info'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.SUCCESS","title":"SUCCESS class-attribute instance-attribute","text":"
SUCCESS = 'success'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.WARN","title":"WARN class-attribute instance-attribute","text":"
WARN = 'warn'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.alert_icon","title":"alert_icon property","text":"
alert_icon: str\n

Return a colored circle in slack markup

"},{"location":"api_reference/notifications/slack.html","title":"Slack","text":"

Classes to ease interaction with Slack

"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification","title":"koheesio.notifications.slack.SlackNotification","text":"

Generic Slack notification class via the Blocks API

NOTE: channel parameter is used only with Slack Web API: https://api.slack.com/messaging/sending If webhook is used, the channel specification is not required

Example:

s = SlackNotification(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\",\n)\ns.execute()\n

"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.channel","title":"channel class-attribute instance-attribute","text":"
channel: Optional[str] = Field(default=None, description='Slack channel id')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.headers","title":"headers class-attribute instance-attribute","text":"
headers: Optional[Dict[str, Any]] = {'Content-type': 'application/json'}\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.message","title":"message class-attribute instance-attribute","text":"
message: str = Field(default=..., description='The message that gets posted to Slack')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.execute","title":"execute","text":"
execute()\n

Generate payload and send post request

Source code in src/koheesio/notifications/slack.py
def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.get_payload","title":"get_payload","text":"
get_payload()\n

Generate payload with Block Kit. More details: https://api.slack.com/block-kit

Source code in src/koheesio/notifications/slack.py
def get_payload(self):\n    \"\"\"\n    Generate payload with `Block Kit`.\n    More details: https://api.slack.com/block-kit\n    \"\"\"\n    payload = {\n        \"attachments\": [\n            {\n                \"blocks\": [\n                    {\n                        \"type\": \"section\",\n                        \"text\": {\n                            \"type\": \"mrkdwn\",\n                            \"text\": self.message,\n                        },\n                    }\n                ],\n            }\n        ]\n    }\n\n    if self.channel:\n        payload[\"channel\"] = self.channel\n\n    return json.dumps(payload)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity","title":"koheesio.notifications.slack.SlackNotificationWithSeverity","text":"

Slack notification class via the Blocks API with etra severity information and predefined extra fields

Example: from koheesio.steps.integrations.notifications import NotificationSeverity

s = SlackNotificationWithSeverity(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\"\n    severity=NotificationSeverity.ERROR,\n    title=\"Title\",\n    environment=\"dev\",\n    application=\"Application\"\n)\ns.execute()\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.application","title":"application class-attribute instance-attribute","text":"
application: str = Field(default=..., description='Pipeline or application name')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.environment","title":"environment class-attribute instance-attribute","text":"
environment: str = Field(default=..., description='Environment description, e.g. dev / qa /prod')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(use_enum_values=False)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.severity","title":"severity class-attribute instance-attribute","text":"
severity: NotificationSeverity = Field(default=..., description='Severity of the message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.timestamp","title":"timestamp class-attribute instance-attribute","text":"
timestamp: datetime = Field(default=utcnow(), alias='execution_timestamp', description='Pipeline or application execution timestamp')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.title","title":"title class-attribute instance-attribute","text":"
title: str = Field(default=..., description='Title of your message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.execute","title":"execute","text":"
execute()\n

Generate payload and send post request

Source code in src/koheesio/notifications/slack.py
def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.message = self.get_payload_message()\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.get_payload_message","title":"get_payload_message","text":"
get_payload_message()\n

Generate payload message based on the predefined set of parameters

Source code in src/koheesio/notifications/slack.py
def get_payload_message(self):\n    \"\"\"\n    Generate payload message based on the predefined set of parameters\n    \"\"\"\n    return dedent(\n        f\"\"\"\n            {self.severity.alert_icon}   *{self.severity.name}:*  {self.title}\n            *Environment:* {self.environment}\n            *Application:* {self.application}\n            *Message:* {self.message}\n            *Timestamp:* {self.timestamp}\n        \"\"\"\n    )\n
"},{"location":"api_reference/secrets/index.html","title":"Secrets","text":"

Module for secret integrations.

Contains abstract class for various secret integrations also known as SecretContext.

"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret","title":"koheesio.secrets.Secret","text":"

Abstract class for various secret integrations. All secrets are wrapped into Context class for easy access. Either existing context can be provided, or new context will be created and returned at runtime.

Secrets are wrapped into the pydantic.SecretStr.

"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.context","title":"context class-attribute instance-attribute","text":"
context: Optional[Context] = Field(Context({}), description='Existing `Context` instance can be used for secrets, otherwise new empty context will be created.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.parent","title":"parent class-attribute instance-attribute","text":"
parent: Optional[str] = Field(default=..., description='Group secrets from one secure path under this friendly name', pattern='^[a-zA-Z0-9_]+$')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.root","title":"root class-attribute instance-attribute","text":"
root: Optional[str] = Field(default='secrets', description='All secrets will be grouped under this root.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output","title":"Output","text":"

Output class for Secret.

"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output.context","title":"context class-attribute instance-attribute","text":"
context: Context = Field(default=..., description='Koheesio context')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.encode_secret_values","title":"encode_secret_values classmethod","text":"
encode_secret_values(data: dict)\n

Encode secret values in the dictionary.

Ensures that all values in the dictionary are wrapped in SecretStr.

Source code in src/koheesio/secrets/__init__.py
@classmethod\ndef encode_secret_values(cls, data: dict):\n    \"\"\"Encode secret values in the dictionary.\n\n    Ensures that all values in the dictionary are wrapped in SecretStr.\n    \"\"\"\n    encoded_dict = {}\n    for key, value in data.items():\n        if isinstance(value, dict):\n            encoded_dict[key] = cls.encode_secret_values(value)\n        else:\n            encoded_dict[key] = SecretStr(value)\n    return encoded_dict\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.execute","title":"execute","text":"
execute()\n

Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.

Source code in src/koheesio/secrets/__init__.py
def execute(self):\n    \"\"\"\n    Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.\n    \"\"\"\n    context = Context(self.encode_secret_values(data={self.root: {self.parent: self._get_secrets()}}))\n    self.output.context = self.context.merge(context=context)\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.get","title":"get","text":"
get() -> Context\n

Convenience method to return context with secrets.

Source code in src/koheesio/secrets/__init__.py
def get(self) -> Context:\n    \"\"\"\n    Convenience method to return context with secrets.\n    \"\"\"\n    self.execute()\n    return self.output.context\n
"},{"location":"api_reference/secrets/cerberus.html","title":"Cerberus","text":"

Module for retrieving secrets from Cerberus.

Secrets are stored as SecretContext and can be accessed accordingly.

See CerberusSecret for more information.

"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret","title":"koheesio.secrets.cerberus.CerberusSecret","text":"

Retrieve secrets from Cerberus and wrap them into Context class for easy access. All secrets are stored under the \"secret\" root and \"parent\". \"Parent\" either derived from the secure data path by replacing \"/\" and \"-\", or manually provided by the user. Secrets are wrapped into the pydantic.SecretStr.

Example:

context = {\n    \"secrets\": {\n        \"parent\": {\n            \"webhook\": SecretStr(\"**********\"),\n            \"description\": SecretStr(\"**********\"),\n        }\n    }\n}\n

Values can be decoded like this:

context.secrets.parent.webhook.get_secret_value()\n
or if working with dictionary is preferable:
for key, value in context.get_all().items():\n    value.get_secret_value()\n

"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.aws_session","title":"aws_session class-attribute instance-attribute","text":"
aws_session: Optional[Session] = Field(default=None, description='AWS Session to pass to Cerberus client, can be used for local execution.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.path","title":"path class-attribute instance-attribute","text":"
path: str = Field(default=..., description=\"Secure data path, eg. 'app/my-sdb/my-secrets'\")\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.token","title":"token class-attribute instance-attribute","text":"
token: Optional[SecretStr] = Field(default=get('CERBERUS_TOKEN', None), description='Cerberus token, can be used for local development without AWS auth mechanism.Note: Token has priority over AWS session.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., description='Cerberus URL, eg. https://cerberus.domain.com')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.verbose","title":"verbose class-attribute instance-attribute","text":"
verbose: bool = Field(default=False, description='Enable verbose for Cerberus client')\n
"},{"location":"api_reference/spark/index.html","title":"Spark","text":"

Spark step module

"},{"location":"api_reference/spark/index.html#koheesio.spark.AnalysisException","title":"koheesio.spark.AnalysisException module-attribute","text":"
AnalysisException = AnalysisException\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.DataFrame","title":"koheesio.spark.DataFrame module-attribute","text":"
DataFrame = DataFrame\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkSession","title":"koheesio.spark.SparkSession module-attribute","text":"
SparkSession = SparkSession\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep","title":"koheesio.spark.SparkStep","text":"

Base class for a Spark step

Extends the Step class with SparkSession support. The following: - Spark steps are expected to return a Spark DataFrame as output. - spark property is available to access the active SparkSession instance.

"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.spark","title":"spark property","text":"
spark: Optional[SparkSession]\n

Get active SparkSession instance

"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output","title":"Output","text":"

Output class for SparkStep

"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.current_timestamp_utc","title":"koheesio.spark.current_timestamp_utc","text":"
current_timestamp_utc(spark: SparkSession) -> Column\n

Get the current timestamp in UTC

Source code in src/koheesio/spark/__init__.py
def current_timestamp_utc(spark: SparkSession) -> Column:\n    \"\"\"Get the current timestamp in UTC\"\"\"\n    return F.to_utc_timestamp(F.current_timestamp(), spark.conf.get(\"spark.sql.session.timeZone\"))\n
"},{"location":"api_reference/spark/delta.html","title":"Delta","text":"

Module for creating and managing Delta tables.

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep","title":"koheesio.spark.delta.DeltaTableStep","text":"

Class for creating and managing Delta tables.

DeltaTable aims to provide a simple interface to create and manage Delta tables. It is a wrapper around the Spark SQL API for Delta tables.

Example
from koheesio.steps import DeltaTableStep\n\nDeltaTableStep(\n    table=\"my_table\",\n    database=\"my_database\",\n    catalog=\"my_catalog\",\n    create_if_not_exists=True,\n    default_create_properties={\n        \"delta.randomizeFilePrefixes\": \"true\",\n        \"delta.checkpoint.writeStatsAsStruct\": \"true\",\n        \"delta.minReaderVersion\": \"2\",\n        \"delta.minWriterVersion\": \"5\",\n    },\n)\n

Methods:

Name Description get_persisted_properties

Get persisted properties of table.

add_property

Alter table and set table property.

add_properties

Alter table and add properties.

execute

Nothing to execute on a Table.

max_version_ts_of_last_execution

Max version timestamp of last execution. If no timestamp is found, returns 1900-01-01 00:00:00. Note: will raise an error if column VERSION_TIMESTAMP does not exist.

Properties
  • name -> str Deprecated. Use .table_name instead.
  • table_name -> str Table name.
  • dataframe -> DataFrame Returns a DataFrame to be able to interact with this table.
  • columns -> Optional[List[str]] Returns all column names as a list.
  • has_change_type -> bool Checks if a column named _change_type is present in the table.
  • exists -> bool Check if table exists.

Parameters:

Name Type Description Default table str

Table name.

required database str

Database or Schema name.

None catalog str

Catalog name.

None create_if_not_exists bool

Force table creation if it doesn't exist. Note: Default properties will be applied to the table during CREATION.

False default_create_properties Dict[str, str]

Default table properties to be applied during CREATION if force_creation True.

{\"delta.randomizeFilePrefixes\": \"true\", \"delta.checkpoint.writeStatsAsStruct\": \"true\", \"delta.minReaderVersion\": \"2\", \"delta.minWriterVersion\": \"5\"}"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.catalog","title":"catalog class-attribute instance-attribute","text":"
catalog: Optional[str] = Field(default=None, description='Catalog name. Note: Can be ignored if using a SparkCatalog that does not support catalog notation (e.g. Hive)')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.columns","title":"columns property","text":"
columns: Optional[List[str]]\n

Returns all column names as a list.

Example

DeltaTableStep(...).columns\n
Would for example return ['age', 'name'] if the table has columns age and name.

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.create_if_not_exists","title":"create_if_not_exists class-attribute instance-attribute","text":"
create_if_not_exists: bool = Field(default=False, alias='force_creation', description=\"Force table creation if it doesn't exist.Note: Default properties will be applied to the table during CREATION.\")\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.database","title":"database class-attribute instance-attribute","text":"
database: Optional[str] = Field(default=None, description='Database or Schema name.')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.dataframe","title":"dataframe property","text":"
dataframe: DataFrame\n

Returns a DataFrame to be able to interact with this table

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.default_create_properties","title":"default_create_properties class-attribute instance-attribute","text":"
default_create_properties: Dict[str, Union[str, bool, int]] = Field(default={'delta.randomizeFilePrefixes': 'true', 'delta.checkpoint.writeStatsAsStruct': 'true', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'}, description='Default table properties to be applied during CREATION if `create_if_not_exists` True')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.exists","title":"exists property","text":"
exists: bool\n

Check if table exists

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.has_change_type","title":"has_change_type property","text":"
has_change_type: bool\n

Checks if a column named _change_type is present in the table

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.is_cdf_active","title":"is_cdf_active property","text":"
is_cdf_active: bool\n

Check if CDF property is set and activated

Returns:

Type Description bool

delta.enableChangeDataFeed property is set to 'true'

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table","title":"table instance-attribute","text":"
table: str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table_name","title":"table_name property","text":"
table_name: str\n

Fully qualified table name in the form of catalog.database.table

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_properties","title":"add_properties","text":"
add_properties(properties: Dict[str, Union[str, bool, int]], override: bool = False)\n

Alter table and add properties.

Parameters:

Name Type Description Default properties Dict[str, Union[str, int, bool]]

Properties to be added to table.

required override bool

Enable override of existing value for property in table.

False Source code in src/koheesio/spark/delta.py
def add_properties(self, properties: Dict[str, Union[str, bool, int]], override: bool = False):\n    \"\"\"Alter table and add properties.\n\n    Parameters\n    ----------\n    properties : Dict[str, Union[str, int, bool]]\n        Properties to be added to table.\n    override : bool, optional, default=False\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    for k, v in properties.items():\n        v_str = str(v) if not isinstance(v, bool) else str(v).lower()\n        self.add_property(key=k, value=v_str, override=override)\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_property","title":"add_property","text":"
add_property(key: str, value: Union[str, int, bool], override: bool = False)\n

Alter table and set table property.

Parameters:

Name Type Description Default key str

Property key(name).

required value Union[str, int, bool]

Property value.

required override bool

Enable override of existing value for property in table.

False Source code in src/koheesio/spark/delta.py
def add_property(self, key: str, value: Union[str, int, bool], override: bool = False):\n    \"\"\"Alter table and set table property.\n\n    Parameters\n    ----------\n    key: str\n        Property key(name).\n    value: Union[str, int, bool]\n        Property value.\n    override: bool\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    persisted_properties = self.get_persisted_properties()\n    v_str = str(value) if not isinstance(value, bool) else str(value).lower()\n\n    def _alter_table() -> None:\n        property_pair = f\"'{key}'='{v_str}'\"\n\n        try:\n            # noinspection SqlNoDataSourceInspection\n            self.spark.sql(f\"ALTER TABLE {self.table_name} SET TBLPROPERTIES ({property_pair})\")\n            self.log.debug(f\"Table `{self.table_name}` has been altered. Property `{property_pair}` added.\")\n        except Py4JJavaError as e:\n            msg = f\"Property `{key}` can not be applied to table `{self.table_name}`. Exception: {e}\"\n            self.log.warning(msg)\n            warnings.warn(msg)\n\n    if self.exists:\n        if key in persisted_properties and persisted_properties[key] != v_str:\n            if override:\n                self.log.debug(\n                    f\"Property `{key}` presents in `{self.table_name}` and has value `{persisted_properties[key]}`.\"\n                    f\"Override is enabled.The value will be changed to `{v_str}`.\"\n                )\n                _alter_table()\n            else:\n                self.log.debug(\n                    f\"Skipping adding property `{key}`, because it is already set \"\n                    f\"for table `{self.table_name}` to `{v_str}`. To override it, provide override=True\"\n                )\n        else:\n            _alter_table()\n    else:\n        self.default_create_properties[key] = v_str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.execute","title":"execute","text":"
execute()\n

Nothing to execute on a Table

Source code in src/koheesio/spark/delta.py
def execute(self):\n    \"\"\"Nothing to execute on a Table\"\"\"\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_column_type","title":"get_column_type","text":"
get_column_type(column: str) -> Optional[DataType]\n

Get the type of a column in the table.

Parameters:

Name Type Description Default column str

Column name.

required

Returns:

Type Description Optional[DataType]

Column type.

Source code in src/koheesio/spark/delta.py
def get_column_type(self, column: str) -> Optional[DataType]:\n    \"\"\"Get the type of a column in the table.\n\n    Parameters\n    ----------\n    column : str\n        Column name.\n\n    Returns\n    -------\n    Optional[DataType]\n        Column type.\n    \"\"\"\n    return self.dataframe.schema[column].dataType if self.columns and column in self.columns else None\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_persisted_properties","title":"get_persisted_properties","text":"
get_persisted_properties() -> Dict[str, str]\n

Get persisted properties of table.

Returns:

Type Description Dict[str, str]

Persisted properties as a dictionary.

Source code in src/koheesio/spark/delta.py
def get_persisted_properties(self) -> Dict[str, str]:\n    \"\"\"Get persisted properties of table.\n\n    Returns\n    -------\n    Dict[str, str]\n        Persisted properties as a dictionary.\n    \"\"\"\n    persisted_properties = {}\n    raw_options = self.spark.sql(f\"SHOW TBLPROPERTIES {self.table_name}\").collect()\n\n    for ro in raw_options:\n        key, value = ro.asDict().values()\n        persisted_properties[key] = value\n\n    return persisted_properties\n
"},{"location":"api_reference/spark/etl_task.html","title":"Etl task","text":"

ETL Task

Extract -> Transform -> Load

"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask","title":"koheesio.spark.etl_task.EtlTask","text":"

ETL Task

Etl stands for: Extract -> Transform -> Load

This task is a composition of a Reader (extract), a series of Transformations (transform) and a Writer (load). In other words, it reads data from a source, applies a series of transformations, and writes the result to a target.

Parameters:

Name Type Description Default name str

Name of the task

required description str

Description of the task

required source Reader

Source to read from [extract]

required transformations list[Transformation]

Series of transformations [transform]. The order of the transformations is important!

required target Writer

Target to write to [load]

required Example
from koheesio.tasks import EtlTask\n\nfrom koheesio.steps.readers import CsvReader\nfrom koheesio.steps.transformations.repartition import Repartition\nfrom koheesio.steps.writers import CsvWriter\n\netl_task = EtlTask(\n    name=\"My ETL Task\",\n    description=\"This is an example ETL task\",\n    source=CsvReader(path=\"path/to/source.csv\"),\n    transformations=[Repartition(num_partitions=2)],\n    target=DummyWriter(),\n)\n\netl_task.execute()\n

This code will read from a CSV file, repartition the DataFrame to 2 partitions, and write the result to the console.

Extending the EtlTask

The EtlTask is designed to be a simple and flexible way to define ETL processes. It is not designed to be a one-size-fits-all solution, but rather a starting point for building more complex ETL processes. If you need more complex functionality, you can extend the EtlTask class and override the extract, transform and load methods. You can also implement your own execute method to define the entire ETL process from scratch should you need more flexibility.

Advantages of using the EtlTask
  • It is a simple way to define ETL processes
  • It is easy to understand and extend
  • It is easy to test and debug
  • It is easy to maintain and refactor
  • It is easy to integrate with other tools and libraries
  • It is easy to use in a production environment
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.etl_date","title":"etl_date class-attribute instance-attribute","text":"
etl_date: datetime = Field(default=utcnow(), description=\"Date time when this object was created as iso format. Example: '2023-01-24T09:39:23.632374'\")\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.source","title":"source class-attribute instance-attribute","text":"
source: InstanceOf[Reader] = Field(default=..., description='Source to read from [extract]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.target","title":"target class-attribute instance-attribute","text":"
target: InstanceOf[Writer] = Field(default=..., description='Target to write to [load]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transformations","title":"transformations class-attribute instance-attribute","text":"
transformations: conlist(min_length=0, item_type=InstanceOf[Transformation]) = Field(default_factory=list, description='Series of transformations', alias='transforms')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output","title":"Output","text":"

Output class for EtlTask

"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.source_df","title":"source_df class-attribute instance-attribute","text":"
source_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .extract() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.target_df","title":"target_df class-attribute instance-attribute","text":"
target_df: DataFrame = Field(default=..., description='The Spark DataFrame used by .load() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.transform_df","title":"transform_df class-attribute instance-attribute","text":"
transform_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .transform() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.execute","title":"execute","text":"
execute()\n

Run the ETL process

Source code in src/koheesio/spark/etl_task.py
def execute(self):\n    \"\"\"Run the ETL process\"\"\"\n    self.log.info(f\"Task started at {self.etl_date}\")\n\n    # extract from source\n    self.output.source_df = self.extract()\n\n    # transform\n    self.output.transform_df = self.transform(self.output.source_df)\n\n    # load to target\n    self.output.target_df = self.load(self.output.transform_df)\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.extract","title":"extract","text":"
extract() -> DataFrame\n

Read from Source

logging is handled by the Reader.execute()-method's @do_execute decorator

Source code in src/koheesio/spark/etl_task.py
def extract(self) -> DataFrame:\n    \"\"\"Read from Source\n\n    logging is handled by the Reader.execute()-method's @do_execute decorator\n    \"\"\"\n    reader: Reader = self.source\n    return reader.read()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.load","title":"load","text":"
load(df: DataFrame) -> DataFrame\n

Write to Target

logging is handled by the Writer.execute()-method's @do_execute decorator

Source code in src/koheesio/spark/etl_task.py
def load(self, df: DataFrame) -> DataFrame:\n    \"\"\"Write to Target\n\n    logging is handled by the Writer.execute()-method's @do_execute decorator\n    \"\"\"\n    writer: Writer = self.target\n    writer.write(df)\n    return df\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.run","title":"run","text":"
run()\n

alias of execute

Source code in src/koheesio/spark/etl_task.py
def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transform","title":"transform","text":"
transform(df: DataFrame) -> DataFrame\n

Transform recursively

logging is handled by the Transformation.execute()-method's @do_execute decorator

Source code in src/koheesio/spark/etl_task.py
def transform(self, df: DataFrame) -> DataFrame:\n    \"\"\"Transform recursively\n\n    logging is handled by the Transformation.execute()-method's @do_execute decorator\n    \"\"\"\n    for t in self.transformations:\n        df = t.transform(df)\n    return df\n
"},{"location":"api_reference/spark/snowflake.html","title":"Snowflake","text":"

Snowflake steps and tasks for Koheesio

Every class in this module is a subclass of Step or Task and is used to perform operations on Snowflake.

Notes

Every Step in this module is based on SnowflakeBaseModel. The following parameters are available for every Step.

Parameters:

Name Type Description Default url str

Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL. required user str

Login name for the Snowflake user. Alias for sfUser.

required password SecretStr

Password for the Snowflake user. Alias for sfPassword.

required database str

The database to use for the session after connecting. Alias for sfDatabase.

required sfSchema str

The schema to use for the session after connecting. Alias for schema (\"schema\" is a reserved name in Pydantic, so we use sfSchema as main name instead).

required role str

The default security role to use for the session after connecting. Alias for sfRole.

required warehouse str

The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse.

required authenticator Optional[str]

Authenticator for the Snowflake user. Example: \"okta.com\".

None options Optional[Dict[str, Any]]

Extra options to pass to the Snowflake connector.

{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"} format str

The default snowflake format can be used natively in Databricks, use net.snowflake.spark.snowflake in other environments and make sure to install required JARs.

\"snowflake\""},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn","title":"koheesio.spark.snowflake.AddColumn","text":"

Add an empty column to a Snowflake table with given name and DataType

Example
AddColumn(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    col=\"MY_COL\",\n    dataType=StringType(),\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.column","title":"column class-attribute instance-attribute","text":"
column: str = Field(default=..., description='The name of the new column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The name of the Snowflake table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.type","title":"type class-attribute instance-attribute","text":"
type: DataType = Field(default=..., description='The DataType represented as a Spark DataType')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output","title":"Output","text":"

Output class for AddColumn

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='Query that was executed to add the column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    query = f\"ALTER TABLE {self.table} ADD COLUMN {self.column} {map_spark_type(self.type)}\".upper()\n    self.output.query = query\n    RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","title":"koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","text":"

Create (or Replace) a Snowflake table which has the same schema as a Spark DataFrame

Can be used as any Transformation. The DataFrame is however left unchanged, and only used for determining the schema of the Snowflake Table that is to be created (or replaced).

Example
CreateOrReplaceTableFromDataFrame(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=\"super-secret-password\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    df=df,\n).execute()\n

Or, as a Transformation:

CreateOrReplaceTableFromDataFrame(\n    ...\n    table=\"MY_TABLE\",\n).transform(df)\n

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., alias='table_name', description='The name of the (new) table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output","title":"Output","text":"

Output class for CreateOrReplaceTableFromDataFrame

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.input_schema","title":"input_schema class-attribute instance-attribute","text":"
input_schema: StructType = Field(default=..., description='The original schema from the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='Query that was executed to create the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.snowflake_schema","title":"snowflake_schema class-attribute instance-attribute","text":"
snowflake_schema: str = Field(default=..., description='Derived Snowflake table schema based on the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    self.output.df = self.df\n\n    input_schema = self.df.schema\n    self.output.input_schema = input_schema\n\n    snowflake_schema = \", \".join([f\"{c.name} {map_spark_type(c.dataType)}\" for c in input_schema])\n    self.output.snowflake_schema = snowflake_schema\n\n    table_name = f\"{self.database}.{self.sfSchema}.{self.table}\"\n    query = f\"CREATE OR REPLACE TABLE {table_name} ({snowflake_schema})\"\n    self.output.query = query\n\n    RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery","title":"koheesio.spark.snowflake.DbTableQuery","text":"

Read table from Snowflake using the dbtable option instead of query

Example
DbTableQuery(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"user\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"db.schema.table\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery.dbtable","title":"dbtable class-attribute instance-attribute","text":"
dbtable: str = Field(default=..., alias='table', description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema","title":"koheesio.spark.snowflake.GetTableSchema","text":"

Get the schema from a Snowflake table as a Spark Schema

Notes
  • This Step will execute a SELECT * FROM <table> LIMIT 1 query to get the schema of the table.
  • The schema will be stored in the table_schema attribute of the output.
  • table_schema is used as the attribute name to avoid conflicts with the schema attribute of Pydantic's BaseModel.
Example
schema = (\n    GetTableSchema(\n        database=\"MY_DB\",\n        schema_=\"MY_SCHEMA\",\n        warehouse=\"MY_WH\",\n        user=\"gid.account@nike.com\",\n        password=\"super-secret-password\",\n        role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n        table=\"MY_TABLE\",\n    )\n    .execute()\n    .table_schema\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The Snowflake table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output","title":"Output","text":"

Output class for GetTableSchema

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output.table_schema","title":"table_schema class-attribute instance-attribute","text":"
table_schema: StructType = Field(default=..., serialization_alias='schema', description='The Spark Schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.execute","title":"execute","text":"
execute() -> Output\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> Output:\n    query = f\"SELECT * FROM {self.table} LIMIT 1\"  # nosec B608: hardcoded_sql_expressions\n    df = Query(**self.get_options(), query=query).execute().df\n    self.output.table_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","text":"

Grant Snowflake privileges to a set of roles on a fully qualified object, i.e. database.schema.object_name

This class is a subclass of GrantPrivilegesOnObject and is used to grant privileges on a fully qualified object. The advantage of using this class is that it sets the object name to be fully qualified, i.e. database.schema.object_name.

Meaning, you can set the database, schema and object separately and the object name will be set to be fully qualified, i.e. database.schema.object_name.

Example
GrantPrivilegesOnFullyQualifiedObject(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    ...\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    ...\n)\n

In this example, the object name will be set to be fully qualified, i.e. MY_DB.MY_SCHEMA.MY_TABLE. If you were to use GrantPrivilegesOnObject instead, you would have to set the object name to be fully qualified yourself.

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject.set_object_name","title":"set_object_name","text":"
set_object_name()\n

Set the object name to be fully qualified, i.e. database.schema.object_name

Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef set_object_name(self):\n    \"\"\"Set the object name to be fully qualified, i.e. database.schema.object_name\"\"\"\n    # database, schema, obj_name\n    db = self.database\n    schema = self.model_dump()[\"sfSchema\"]  # since \"schema\" is a reserved name\n    obj_name = self.object\n\n    self.object = f\"{db}.{schema}.{obj_name}\"\n\n    return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnObject","text":"

A wrapper on Snowflake GRANT privileges

With this Step, you can grant Snowflake privileges to a set of roles on a table, a view, or an object

See Also

https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html

Parameters:

Name Type Description Default warehouse str

The name of the warehouse. Alias for sfWarehouse

required user str

The username. Alias for sfUser

required password SecretStr

The password. Alias for sfPassword

required role str

The role name

required object str

The name of the object to grant privileges on

required type str

The type of object to grant privileges on, e.g. TABLE, VIEW

required privileges Union[conlist(str, min_length=1), str]

The Privilege/Permission or list of Privileges/Permissions to grant on the given object.

required roles Union[conlist(str, min_length=1), str]

The Role or list of Roles to grant the privileges to

required Example
GrantPermissionsOnTable(\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    permissions=[\"SELECT\", \"INSERT\"],\n).execute()\n

In this example, the APPLICATION.SNOWFLAKE.ADMIN role will be granted SELECT and INSERT privileges on the MY_TABLE table using the MY_WH warehouse.

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.object","title":"object class-attribute instance-attribute","text":"
object: str = Field(default=..., description='The name of the object to grant privileges on')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.privileges","title":"privileges class-attribute instance-attribute","text":"
privileges: Union[conlist(str, min_length=1), str] = Field(default=..., alias='permissions', description='The Privilege/Permission or list of Privileges/Permissions to grant on the given object. See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.roles","title":"roles class-attribute instance-attribute","text":"
roles: Union[conlist(str, min_length=1), str] = Field(default=..., alias='role', validation_alias='roles', description='The Role or list of Roles to grant the privileges to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.type","title":"type class-attribute instance-attribute","text":"
type: str = Field(default=..., description='The type of object to grant privileges on, e.g. TABLE, VIEW')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output","title":"Output","text":"

Output class for GrantPrivilegesOnObject

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output.query","title":"query class-attribute instance-attribute","text":"
query: conlist(str, min_length=1) = Field(default=..., description='Query that was executed to grant privileges', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    self.output.query = []\n    roles = self.roles\n\n    for role in roles:\n        query = self.get_query(role)\n        self.output.query.append(query)\n        RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.get_query","title":"get_query","text":"
get_query(role: str)\n

Build the GRANT query

Parameters:

Name Type Description Default role str

The role name

required

Returns:

Name Type Description query str

The Query that performs the grant

Source code in src/koheesio/spark/snowflake.py
def get_query(self, role: str):\n    \"\"\"Build the GRANT query\n\n    Parameters\n    ----------\n    role: str\n        The role name\n\n    Returns\n    -------\n    query : str\n        The Query that performs the grant\n    \"\"\"\n    query = f\"GRANT {','.join(self.privileges)} ON {self.type} {self.object} TO ROLE {role}\".upper()\n    return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.set_roles_privileges","title":"set_roles_privileges","text":"
set_roles_privileges(values)\n

Coerce roles and privileges to be lists if they are not already.

Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"before\")\ndef set_roles_privileges(cls, values):\n    \"\"\"Coerce roles and privileges to be lists if they are not already.\"\"\"\n    roles_value = values.get(\"roles\") or values.get(\"role\")\n    privileges_value = values.get(\"privileges\")\n\n    if not (roles_value and privileges_value):\n        raise ValueError(\"You have to specify roles AND privileges when using 'GrantPrivilegesOnObject'.\")\n\n    # coerce values to be lists\n    values[\"roles\"] = [roles_value] if isinstance(roles_value, str) else roles_value\n    values[\"role\"] = values[\"roles\"][0]  # hack to keep the validator happy\n    values[\"privileges\"] = [privileges_value] if isinstance(privileges_value, str) else privileges_value\n\n    return values\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.validate_object_and_object_type","title":"validate_object_and_object_type","text":"
validate_object_and_object_type()\n

Validate that the object and type are set.

Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef validate_object_and_object_type(self):\n    \"\"\"Validate that the object and type are set.\"\"\"\n    object_value = self.object\n    if not object_value:\n        raise ValueError(\"You must provide an `object`, this should be the name of the object. \")\n\n    object_type = self.type\n    if not object_type:\n        raise ValueError(\n            \"You must provide a `type`, e.g. TABLE, VIEW, DATABASE. \"\n            \"See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html\"\n        )\n\n    return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable","title":"koheesio.spark.snowflake.GrantPrivilegesOnTable","text":"

Grant Snowflake privileges to a set of roles on a table

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.object","title":"object class-attribute instance-attribute","text":"
object: str = Field(default=..., alias='table', description='The name of the Table to grant Privileges on. This should be just the name of the table; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.type","title":"type class-attribute instance-attribute","text":"
type: str = 'TABLE'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView","title":"koheesio.spark.snowflake.GrantPrivilegesOnView","text":"

Grant Snowflake privileges to a set of roles on a view

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.object","title":"object class-attribute instance-attribute","text":"
object: str = Field(default=..., alias='view', description='The name of the View to grant Privileges on. This should be just the name of the view; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.type","title":"type class-attribute instance-attribute","text":"
type: str = 'VIEW'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query","title":"koheesio.spark.snowflake.Query","text":"

Query data from Snowflake and return the result as a DataFrame

Example
Query(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"SELECT * FROM MY_TABLE\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='The query to run')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.get_options","title":"get_options","text":"
get_options()\n

add query to options

Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    \"\"\"add query to options\"\"\"\n    options = super().get_options()\n    options[\"query\"] = self.query\n    return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.validate_query","title":"validate_query","text":"
validate_query(query)\n

Replace escape characters

Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    query = query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n    return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery","title":"koheesio.spark.snowflake.RunQuery","text":"

Run a query on Snowflake that does not return a result, e.g. create table statement

This is a wrapper around 'net.snowflake.spark.snowflake.Utils.runQuery' on the JVM

Example
RunQuery(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"account\",\n    password=\"***\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"CREATE TABLE test (col1 string)\",\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='The query to run', alias='sql')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.execute","title":"execute","text":"
execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n    if not self.query:\n        self.log.warning(\"Empty string given as query input, skipping execution\")\n        return\n    # noinspection PyProtectedMember\n    self.spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(self.get_options(), self.query)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.get_options","title":"get_options","text":"
get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    # Executing the RunQuery without `host` option in Databricks throws:\n    # An error occurred while calling z:net.snowflake.spark.snowflake.Utils.runQuery.\n    # : java.util.NoSuchElementException: key not found: host\n    options = super().get_options()\n    options[\"host\"] = options[\"sfURL\"]\n    return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.validate_query","title":"validate_query","text":"
validate_query(query)\n

Replace escape characters

Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    return query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel","title":"koheesio.spark.snowflake.SnowflakeBaseModel","text":"

BaseModel for setting up Snowflake Driver options.

Notes
  • Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
  • Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
  • Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector

Parameters:

Name Type Description Default url str

Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL. required user str

Login name for the Snowflake user. Alias for sfUser.

required password SecretStr

Password for the Snowflake user. Alias for sfPassword.

required database str

The database to use for the session after connecting. Alias for sfDatabase.

required sfSchema str

The schema to use for the session after connecting. Alias for schema (\"schema\" is a reserved name in Pydantic, so we use sfSchema as main name instead).

required role str

The default security role to use for the session after connecting. Alias for sfRole.

required warehouse str

The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse.

required authenticator Optional[str]

Authenticator for the Snowflake user. Example: \"okta.com\".

None options Optional[Dict[str, Any]]

Extra options to pass to the Snowflake connector.

{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"} format str

The default snowflake format can be used natively in Databricks, use net.snowflake.spark.snowflake in other environments and make sure to install required JARs.

\"snowflake\""},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.authenticator","title":"authenticator class-attribute instance-attribute","text":"
authenticator: Optional[str] = Field(default=None, description='Authenticator for the Snowflake user', examples=['okta.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.database","title":"database class-attribute instance-attribute","text":"
database: str = Field(default=..., alias='sfDatabase', description='The database to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='snowflake', description='The default `snowflake` format can be used natively in Databricks, use `net.snowflake.spark.snowflake` in other environments and make sure to install required JARs.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field(default={'sfCompress': 'on', 'continue_on_error': 'off'}, description='Extra options to pass to the Snowflake connector')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.password","title":"password class-attribute instance-attribute","text":"
password: SecretStr = Field(default=..., alias='sfPassword', description='Password for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.role","title":"role class-attribute instance-attribute","text":"
role: str = Field(default=..., alias='sfRole', description='The default security role to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.sfSchema","title":"sfSchema class-attribute instance-attribute","text":"
sfSchema: str = Field(default=..., alias='schema', description='The schema to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., alias='sfURL', description='Hostname for the Snowflake account, e.g. <account>.snowflakecomputing.com', examples=['example.snowflakecomputing.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.user","title":"user class-attribute instance-attribute","text":"
user: str = Field(default=..., alias='sfUser', description='Login name for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.warehouse","title":"warehouse class-attribute instance-attribute","text":"
warehouse: str = Field(default=..., alias='sfWarehouse', description='The default virtual warehouse to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.get_options","title":"get_options","text":"
get_options()\n

Get the sfOptions as a dictionary.

Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    \"\"\"Get the sfOptions as a dictionary.\"\"\"\n    return {\n        key: value\n        for key, value in {\n            \"sfURL\": self.url,\n            \"sfUser\": self.user,\n            \"sfPassword\": self.password.get_secret_value(),\n            \"authenticator\": self.authenticator,\n            \"sfDatabase\": self.database,\n            \"sfSchema\": self.sfSchema,\n            \"sfRole\": self.role,\n            \"sfWarehouse\": self.warehouse,\n            **self.options,\n        }.items()\n        if value is not None\n    }\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader","title":"koheesio.spark.snowflake.SnowflakeReader","text":"

Wrapper around JdbcReader for Snowflake.

Example
sr = SnowflakeReader(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n)\ndf = sr.read()\n
Notes
  • Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
  • Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
  • Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: Optional[str] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeStep","title":"koheesio.spark.snowflake.SnowflakeStep","text":"

Expands the SnowflakeBaseModel so that it can be used as a Step

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep","title":"koheesio.spark.snowflake.SnowflakeTableStep","text":"

Expands the SnowflakeStep, adding a 'table' parameter

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.get_options","title":"get_options","text":"
get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    options = super().get_options()\n    options[\"table\"] = self.table\n    return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTransformation","title":"koheesio.spark.snowflake.SnowflakeTransformation","text":"

Adds Snowflake parameters to the Transformation class

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter","title":"koheesio.spark.snowflake.SnowflakeWriter","text":"

Class for writing to Snowflake

See Also
  • koheesio.steps.writers.Writer
  • koheesio.steps.writers.BatchOutputMode
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.insert_type","title":"insert_type class-attribute instance-attribute","text":"
insert_type: Optional[BatchOutputMode] = Field(APPEND, alias='mode', description='The insertion type, append or overwrite')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='Target table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.execute","title":"execute","text":"
execute()\n

Write to Snowflake

Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    \"\"\"Write to Snowflake\"\"\"\n    self.log.debug(f\"writing to {self.table} with mode {self.insert_type}\")\n    self.df.write.format(self.format).options(**self.get_options()).option(\"dbtable\", self.table).mode(\n        self.insert_type\n    ).save()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema","title":"koheesio.spark.snowflake.SyncTableAndDataFrameSchema","text":"

Sync the schema's of a Snowflake table and a DataFrame. This will add NULL columns for the columns that are not in both and perform type casts where needed.

The Snowflake table will take priority in case of type conflicts.

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.df","title":"df class-attribute instance-attribute","text":"
df: DataFrame = Field(default=..., description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.dry_run","title":"dry_run class-attribute instance-attribute","text":"
dry_run: Optional[bool] = Field(default=False, description='Only show schema differences, do not apply changes')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output","title":"Output","text":"

Output class for SyncTableAndDataFrameSchema

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_df_schema","title":"new_df_schema class-attribute instance-attribute","text":"
new_df_schema: StructType = Field(default=..., description='New DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_sf_schema","title":"new_sf_schema class-attribute instance-attribute","text":"
new_sf_schema: StructType = Field(default=..., description='New Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_df_schema","title":"original_df_schema class-attribute instance-attribute","text":"
original_df_schema: StructType = Field(default=..., description='Original DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_sf_schema","title":"original_sf_schema class-attribute instance-attribute","text":"
original_sf_schema: StructType = Field(default=..., description='Original Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.sf_table_altered","title":"sf_table_altered class-attribute instance-attribute","text":"
sf_table_altered: bool = Field(default=False, description='Flag to indicate whether Snowflake schema has been altered')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    self.log.warning(\"Snowflake table will always take a priority in case of data type conflicts!\")\n\n    # spark side\n    df_schema = self.df.schema\n    self.output.original_df_schema = deepcopy(df_schema)  # using deepcopy to avoid storing in place changes\n    df_cols = [c.name.lower() for c in df_schema]\n\n    # snowflake side\n    sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n    self.output.original_sf_schema = sf_schema\n    sf_cols = [c.name.lower() for c in sf_schema]\n\n    if self.dry_run:\n        # Display differences between Spark DataFrame and Snowflake schemas\n        # and provide dummy values that are expected as class outputs.\n        self.log.warning(f\"Columns to be added to Snowflake table: {set(df_cols) - set(sf_cols)}\")\n        self.log.warning(f\"Columns to be added to Spark DataFrame: {set(sf_cols) - set(df_cols)}\")\n\n        self.output.new_df_schema = t.StructType()\n        self.output.new_sf_schema = t.StructType()\n        self.output.df = self.df\n        self.output.sf_table_altered = False\n\n    else:\n        # Add columns to SnowFlake table that exist in DataFrame\n        for df_column in df_schema:\n            if df_column.name.lower() not in sf_cols:\n                AddColumn(\n                    **self.get_options(),\n                    table=self.table,\n                    column=df_column.name,\n                    type=df_column.dataType,\n                ).execute()\n                self.output.sf_table_altered = True\n\n        if self.output.sf_table_altered:\n            sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n            sf_cols = [c.name.lower() for c in sf_schema]\n\n        self.output.new_sf_schema = sf_schema\n\n        # Add NULL columns to the DataFrame if they exist in SnowFlake but not in the df\n        df = self.df\n        for sf_col in self.output.original_sf_schema:\n            sf_col_name = sf_col.name.lower()\n            if sf_col_name not in df_cols:\n                sf_col_type = sf_col.dataType\n                df = df.withColumn(sf_col_name, f.lit(None).cast(sf_col_type))\n\n        # Put DataFrame columns in the same order as the Snowflake table\n        df = df.select(*sf_cols)\n\n        self.output.df = df\n        self.output.new_df_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","title":"koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","text":"

Synchronize a Delta table to a Snowflake table

  • Overwrite - only in batch mode
  • Append - supports batch and streaming mode
  • Merge - only in streaming mode
Example
SynchronizeDeltaToSnowflakeTask(\n    url=\"acme.snowflakecomputing.com\",\n    user=\"admin\",\n    role=\"ADMIN\",\n    warehouse=\"SF_WAREHOUSE\",\n    database=\"SF_DATABASE\",\n    schema=\"SF_SCHEMA\",\n    source_table=DeltaTableStep(...),\n    target_table=\"my_sf_table\",\n    key_columns=[\n        \"id\",\n    ],\n    streaming=False,\n).run()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.checkpoint_location","title":"checkpoint_location class-attribute instance-attribute","text":"
checkpoint_location: Optional[str] = Field(default=None, description='Checkpoint location to use')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.enable_deletion","title":"enable_deletion class-attribute instance-attribute","text":"
enable_deletion: Optional[bool] = Field(default=False, description='In case of merge synchronisation_mode add deletion statement in merge query.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.key_columns","title":"key_columns class-attribute instance-attribute","text":"
key_columns: Optional[List[str]] = Field(default_factory=list, description='Key columns on which merge statements will be MERGE statement will be applied.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.non_key_columns","title":"non_key_columns property","text":"
non_key_columns: List[str]\n

Columns of source table that aren't part of the (composite) primary key

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.persist_staging","title":"persist_staging class-attribute instance-attribute","text":"
persist_staging: Optional[bool] = Field(default=False, description='In case of debugging, set `persist_staging` to True to retain the staging table for inspection after synchronization.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.reader","title":"reader property","text":"
reader\n

DeltaTable reader

Returns:
DeltaTableReader the will yield source delta table\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.schema_tracking_location","title":"schema_tracking_location class-attribute instance-attribute","text":"
schema_tracking_location: Optional[str] = Field(default=None, description='Schema tracking location to use. Info: https://docs.delta.io/latest/delta-streaming.html#-schema-tracking')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.source_table","title":"source_table class-attribute instance-attribute","text":"
source_table: DeltaTableStep = Field(default=..., description='Source delta table to synchronize')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table","title":"staging_table property","text":"
staging_table\n

Intermediate table on snowflake where staging results are stored

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table_name","title":"staging_table_name class-attribute instance-attribute","text":"
staging_table_name: Optional[str] = Field(default=None, alias='staging_table', description='Optional snowflake staging name', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: Optional[bool] = Field(default=False, description=\"Should synchronisation happen in streaming or in batch mode. Streaming is supported in 'APPEND' and 'MERGE' mode. Batch is supported in 'OVERWRITE' and 'APPEND' mode.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.synchronisation_mode","title":"synchronisation_mode class-attribute instance-attribute","text":"
synchronisation_mode: BatchOutputMode = Field(default=MERGE, description=\"Determines if synchronisation will 'overwrite' any existing table, 'append' new rows or 'merge' with existing rows.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.target_table","title":"target_table class-attribute instance-attribute","text":"
target_table: str = Field(default=..., description='Target table in snowflake to synchronize to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer","title":"writer property","text":"
writer: Union[ForEachBatchStreamWriter, SnowflakeWriter]\n

Writer to persist to snowflake

Depending on configured options, this returns an SnowflakeWriter or ForEachBatchStreamWriter: - OVERWRITE/APPEND mode yields SnowflakeWriter - MERGE mode yields ForEachBatchStreamWriter

Returns:

Type Description Union[ForEachBatchStreamWriter, SnowflakeWriter]"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer_","title":"writer_ class-attribute instance-attribute","text":"
writer_: Optional[Union[ForEachBatchStreamWriter, SnowflakeWriter]] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.drop_table","title":"drop_table","text":"
drop_table(snowflake_table)\n

Drop a given snowflake table

Source code in src/koheesio/spark/snowflake.py
def drop_table(self, snowflake_table):\n    \"\"\"Drop a given snowflake table\"\"\"\n    self.log.warning(f\"Dropping table {snowflake_table} from snowflake\")\n    drop_table_query = f\"\"\"DROP TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(**self.get_options(), query=drop_table_query)\n    query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.execute","title":"execute","text":"
execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n    # extract\n    df = self.extract()\n    self.output.source_df = df\n\n    # synchronize\n    self.output.target_df = df\n    self.load(df)\n    if not self.persist_staging:\n        # If it's a streaming job, await for termination before dropping staging table\n        if self.streaming:\n            self.writer.await_termination()\n        self.drop_table(self.staging_table)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.extract","title":"extract","text":"
extract() -> DataFrame\n

Extract source table

Source code in src/koheesio/spark/snowflake.py
def extract(self) -> DataFrame:\n    \"\"\"\n    Extract source table\n    \"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        if not self.source_table.is_cdf_active:\n            raise RuntimeError(\n                f\"Source table {self.source_table.table_name} does not have CDF enabled. \"\n                f\"Set TBLPROPERTIES ('delta.enableChangeDataFeed' = true) to enable. \"\n                f\"Current properties = {self.source_table_properties}\"\n            )\n\n    df = self.reader.read()\n    self.output.source_df = df\n    return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.load","title":"load","text":"
load(df) -> DataFrame\n

Load source table into snowflake

Source code in src/koheesio/spark/snowflake.py
def load(self, df) -> DataFrame:\n    \"\"\"Load source table into snowflake\"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        self.log.info(f\"Truncating staging table {self.staging_table}\")\n        self.truncate_table(self.staging_table)\n    self.writer.write(df)\n    self.output.target_df = df\n    return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.run","title":"run","text":"
run()\n

alias of execute

Source code in src/koheesio/spark/snowflake.py
def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.truncate_table","title":"truncate_table","text":"
truncate_table(snowflake_table)\n

Truncate a given snowflake table

Source code in src/koheesio/spark/snowflake.py
def truncate_table(self, snowflake_table):\n    \"\"\"Truncate a given snowflake table\"\"\"\n    truncate_query = f\"\"\"TRUNCATE TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(\n        **self.get_options(),\n        query=truncate_query,\n    )\n    query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists","title":"koheesio.spark.snowflake.TableExists","text":"

Check if the table exists in Snowflake by using INFORMATION_SCHEMA.

Example
k = TableExists(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n    table=\"table\",\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output","title":"Output","text":"

Output class for TableExists

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output.exists","title":"exists class-attribute instance-attribute","text":"
exists: bool = Field(default=..., description='Whether or not the table exists')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    query = (\n        dedent(\n            # Force upper case, due to case-sensitivity of where clause\n            f\"\"\"\n        SELECT *\n        FROM INFORMATION_SCHEMA.TABLES\n        WHERE TABLE_CATALOG     = '{self.database}'\n          AND TABLE_SCHEMA      = '{self.sfSchema}'\n          AND TABLE_TYPE        = 'BASE TABLE'\n          AND upper(TABLE_NAME) = '{self.table.upper()}'\n        \"\"\"  # nosec B608: hardcoded_sql_expressions\n        )\n        .upper()\n        .strip()\n    )\n\n    self.log.debug(f\"Query that was executed to check if the table exists:\\n{query}\")\n\n    df = Query(**self.get_options(), query=query).read()\n\n    exists = df.count() > 0\n    self.log.info(f\"Table {self.table} {'exists' if exists else 'does not exist'}\")\n    self.output.exists = exists\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery","title":"koheesio.spark.snowflake.TagSnowflakeQuery","text":"

Provides Snowflake query tag pre-action that can be used to easily find queries through SF history search and further group them for debugging and cost tracking purposes.

Takes in query tag attributes as kwargs and additional Snowflake options dict that can optionally contain other set of pre-actions to be applied to a query, in that case existing pre-action aren't dropped, query tag pre-action will be added to them.

Passed Snowflake options dictionary is not modified in-place, instead anew dictionary containing updated pre-actions is returned.

Notes

See this article for explanation: https://select.dev/posts/snowflake-query-tags

Arbitrary tags can be applied, such as team, dataset names, business capability, etc.

Example
query_tag = AddQueryTag(\n    options={\"preactions\": ...},\n    task_name=\"cleanse_task\",\n    pipeline_name=\"ingestion-pipeline\",\n    etl_date=\"2022-01-01\",\n    pipeline_execution_time=\"2022-01-01T00:00:00\",\n    task_execution_time=\"2022-01-01T01:00:00\",\n    environment=\"dev\",\n    trace_id=\"e0fdec43-a045-46e5-9705-acd4f3f96045\",\n    span_id=\"cb89abea-1c12-471f-8b12-546d2d66f6cb\",\n    ),\n).execute().options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.options","title":"options class-attribute instance-attribute","text":"
options: Dict = Field(default_factory=dict, description='Additional Snowflake options, optionally containing additional preactions')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output","title":"Output","text":"

Output class for AddQueryTag

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output.options","title":"options class-attribute instance-attribute","text":"
options: Dict = Field(default=..., description='Copy of provided SF options, with added query tag preaction')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.execute","title":"execute","text":"
execute()\n

Add query tag preaction to Snowflake options

Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    \"\"\"Add query tag preaction to Snowflake options\"\"\"\n    tag_json = json.dumps(self.extra_params, indent=4, sort_keys=True)\n    tag_preaction = f\"ALTER SESSION SET QUERY_TAG = '{tag_json}';\"\n    preactions = self.options.get(\"preactions\", \"\")\n    preactions = f\"{preactions}\\n{tag_preaction}\".strip()\n    updated_options = dict(self.options)\n    updated_options[\"preactions\"] = preactions\n    self.output.options = updated_options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.map_spark_type","title":"koheesio.spark.snowflake.map_spark_type","text":"
map_spark_type(spark_type: DataType)\n

Translates Spark DataFrame Schema type to SnowFlake type

Basic Types Snowflake Type StringType STRING NullType STRING BooleanType BOOLEAN Numeric Types Snowflake Type LongType BIGINT IntegerType INT ShortType SMALLINT DoubleType DOUBLE FloatType FLOAT NumericType FLOAT ByteType BINARY Date / Time Types Snowflake Type DateType DATE TimestampType TIMESTAMP Advanced Types Snowflake Type DecimalType DECIMAL MapType VARIANT ArrayType VARIANT StructType VARIANT References
  • Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
  • Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html

Parameters:

Name Type Description Default spark_type DataType

DataType taken out of the StructField

required

Returns:

Type Description str

The Snowflake data type

Source code in src/koheesio/spark/snowflake.py
def map_spark_type(spark_type: t.DataType):\n    \"\"\"\n    Translates Spark DataFrame Schema type to SnowFlake type\n\n    | Basic Types       | Snowflake Type |\n    |-------------------|----------------|\n    | StringType        | STRING         |\n    | NullType          | STRING         |\n    | BooleanType       | BOOLEAN        |\n\n    | Numeric Types     | Snowflake Type |\n    |-------------------|----------------|\n    | LongType          | BIGINT         |\n    | IntegerType       | INT            |\n    | ShortType         | SMALLINT       |\n    | DoubleType        | DOUBLE         |\n    | FloatType         | FLOAT          |\n    | NumericType       | FLOAT          |\n    | ByteType          | BINARY         |\n\n    | Date / Time Types | Snowflake Type |\n    |-------------------|----------------|\n    | DateType          | DATE           |\n    | TimestampType     | TIMESTAMP      |\n\n    | Advanced Types    | Snowflake Type |\n    |-------------------|----------------|\n    | DecimalType       | DECIMAL        |\n    | MapType           | VARIANT        |\n    | ArrayType         | VARIANT        |\n    | StructType        | VARIANT        |\n\n    References\n    ----------\n    - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html\n    - Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html\n\n    Parameters\n    ----------\n    spark_type : pyspark.sql.types.DataType\n        DataType taken out of the StructField\n\n    Returns\n    -------\n    str\n        The Snowflake data type\n    \"\"\"\n    # StructField means that the entire Field was passed, we need to extract just the dataType before continuing\n    if isinstance(spark_type, t.StructField):\n        spark_type = spark_type.dataType\n\n    # Check if the type is DayTimeIntervalType\n    if isinstance(spark_type, t.DayTimeIntervalType):\n        warn(\n            \"DayTimeIntervalType is being converted to STRING. \"\n            \"Consider converting to a more supported date/time/timestamp type in Snowflake.\"\n        )\n\n    # fmt: off\n    # noinspection PyUnresolvedReferences\n    data_type_map = {\n        # Basic Types\n        t.StringType: \"STRING\",\n        t.NullType: \"STRING\",\n        t.BooleanType: \"BOOLEAN\",\n\n        # Numeric Types\n        t.LongType: \"BIGINT\",\n        t.IntegerType: \"INT\",\n        t.ShortType: \"SMALLINT\",\n        t.DoubleType: \"DOUBLE\",\n        t.FloatType: \"FLOAT\",\n        t.NumericType: \"FLOAT\",\n        t.ByteType: \"BINARY\",\n        t.BinaryType: \"VARBINARY\",\n\n        # Date / Time Types\n        t.DateType: \"DATE\",\n        t.TimestampType: \"TIMESTAMP\",\n        t.DayTimeIntervalType: \"STRING\",\n\n        # Advanced Types\n        t.DecimalType:\n            f\"DECIMAL({spark_type.precision},{spark_type.scale})\"  # pylint: disable=no-member\n            if isinstance(spark_type, t.DecimalType) else \"DECIMAL(38,0)\",\n        t.MapType: \"VARIANT\",\n        t.ArrayType: \"VARIANT\",\n        t.StructType: \"VARIANT\",\n    }\n    return data_type_map.get(type(spark_type), 'STRING')\n
"},{"location":"api_reference/spark/utils.html","title":"Utils","text":"

Spark Utility functions

"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_minor_version","title":"koheesio.spark.utils.spark_minor_version module-attribute","text":"
spark_minor_version: float = get_spark_minor_version()\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype","title":"koheesio.spark.utils.SparkDatatype","text":"

Allowed spark datatypes

The following table lists the data types that are supported by Spark SQL.

Data type SQL name ByteType BYTE, TINYINT ShortType SHORT, SMALLINT IntegerType INT, INTEGER LongType LONG, BIGINT FloatType FLOAT, REAL DoubleType DOUBLE DecimalType DECIMAL, DEC, NUMERIC StringType STRING BinaryType BINARY BooleanType BOOLEAN TimestampType TIMESTAMP, TIMESTAMP_LTZ DateType DATE ArrayType ARRAY MapType MAP NullType VOID Not supported yet
  • TimestampNTZType TIMESTAMP_NTZ
  • YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
  • DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
See Also

https://spark.apache.org/docs/latest/sql-ref-datatypes.html#supported-data-types

"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.ARRAY","title":"ARRAY class-attribute instance-attribute","text":"
ARRAY = 'array'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BIGINT","title":"BIGINT class-attribute instance-attribute","text":"
BIGINT = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BINARY","title":"BINARY class-attribute instance-attribute","text":"
BINARY = 'binary'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BOOLEAN","title":"BOOLEAN class-attribute instance-attribute","text":"
BOOLEAN = 'boolean'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BYTE","title":"BYTE class-attribute instance-attribute","text":"
BYTE = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DATE","title":"DATE class-attribute instance-attribute","text":"
DATE = 'date'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DEC","title":"DEC class-attribute instance-attribute","text":"
DEC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DECIMAL","title":"DECIMAL class-attribute instance-attribute","text":"
DECIMAL = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DOUBLE","title":"DOUBLE class-attribute instance-attribute","text":"
DOUBLE = 'double'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.FLOAT","title":"FLOAT class-attribute instance-attribute","text":"
FLOAT = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INT","title":"INT class-attribute instance-attribute","text":"
INT = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INTEGER","title":"INTEGER class-attribute instance-attribute","text":"
INTEGER = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.LONG","title":"LONG class-attribute instance-attribute","text":"
LONG = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.MAP","title":"MAP class-attribute instance-attribute","text":"
MAP = 'map'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.NUMERIC","title":"NUMERIC class-attribute instance-attribute","text":"
NUMERIC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.REAL","title":"REAL class-attribute instance-attribute","text":"
REAL = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SHORT","title":"SHORT class-attribute instance-attribute","text":"
SHORT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SMALLINT","title":"SMALLINT class-attribute instance-attribute","text":"
SMALLINT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.STRING","title":"STRING class-attribute instance-attribute","text":"
STRING = 'string'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP","title":"TIMESTAMP class-attribute instance-attribute","text":"
TIMESTAMP = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP_LTZ","title":"TIMESTAMP_LTZ class-attribute instance-attribute","text":"
TIMESTAMP_LTZ = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TINYINT","title":"TINYINT class-attribute instance-attribute","text":"
TINYINT = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.VOID","title":"VOID class-attribute instance-attribute","text":"
VOID = 'void'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.spark_type","title":"spark_type property","text":"
spark_type: DataType\n

Returns the spark type for the given enum value

"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.from_string","title":"from_string classmethod","text":"
from_string(value: str) -> SparkDatatype\n

Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive

Source code in src/koheesio/spark/utils.py
@classmethod\ndef from_string(cls, value: str) -> \"SparkDatatype\":\n    \"\"\"Allows for getting the right Enum value by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.get_spark_minor_version","title":"koheesio.spark.utils.get_spark_minor_version","text":"
get_spark_minor_version() -> float\n

Returns the minor version of the spark instance.

For example, if the spark version is 3.3.2, this function would return 3.3

Source code in src/koheesio/spark/utils.py
def get_spark_minor_version() -> float:\n    \"\"\"Returns the minor version of the spark instance.\n\n    For example, if the spark version is 3.3.2, this function would return 3.3\n    \"\"\"\n    return float(\".\".join(spark_version.split(\".\")[:2]))\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.on_databricks","title":"koheesio.spark.utils.on_databricks","text":"
on_databricks() -> bool\n

Retrieve if we're running on databricks or elsewhere

Source code in src/koheesio/spark/utils.py
def on_databricks() -> bool:\n    \"\"\"Retrieve if we're running on databricks or elsewhere\"\"\"\n    dbr_version = os.getenv(\"DATABRICKS_RUNTIME_VERSION\", None)\n    return dbr_version is not None and dbr_version != \"\"\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.schema_struct_to_schema_str","title":"koheesio.spark.utils.schema_struct_to_schema_str","text":"
schema_struct_to_schema_str(schema: StructType) -> str\n

Converts a StructType to a schema str

Source code in src/koheesio/spark/utils.py
def schema_struct_to_schema_str(schema: StructType) -> str:\n    \"\"\"Converts a StructType to a schema str\"\"\"\n    if not schema:\n        return \"\"\n    return \",\\n\".join([f\"{field.name} {field.dataType.typeName().upper()}\" for field in schema.fields])\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_array","title":"koheesio.spark.utils.spark_data_type_is_array","text":"
spark_data_type_is_array(data_type: DataType) -> bool\n

Check if the column's dataType is of type ArrayType

Source code in src/koheesio/spark/utils.py
def spark_data_type_is_array(data_type: DataType) -> bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, ArrayType)\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_numeric","title":"koheesio.spark.utils.spark_data_type_is_numeric","text":"
spark_data_type_is_numeric(data_type: DataType) -> bool\n

Check if the column's dataType is of type ArrayType

Source code in src/koheesio/spark/utils.py
def spark_data_type_is_numeric(data_type: DataType) -> bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, (IntegerType, LongType, FloatType, DoubleType, DecimalType))\n
"},{"location":"api_reference/spark/readers/index.html","title":"Readers","text":"

Readers are a type of Step that read data from a source based on the input parameters and stores the result in self.output.df.

For a comprehensive guide on the usage, examples, and additional features of Reader classes, please refer to the reference/concepts/steps/readers section of the Koheesio documentation.

"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader","title":"koheesio.spark.readers.Reader","text":"

Base class for all Readers

A Reader is a Step that reads data from a source based on the input parameters and stores the result in self.output.df (DataFrame).

When implementing a Reader, the execute() method should be implemented. The execute() method should read from the source and store the result in self.output.df.

The Reader class implements a standard read() method that calls the execute() method and returns the result. This method can be used to read data from a Reader without having to call the execute() method directly. Read method does not need to be implemented in the child class.

Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession.

The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the output.df is None, .execute() will be run first.

"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.df","title":"df property","text":"
df: Optional[DataFrame]\n

Shorthand for accessing self.output.df If the output.df is None, .execute() will be run first

"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.execute","title":"execute abstractmethod","text":"
execute()\n

Execute on a Reader should handle self.output.df (output) as a minimum Read from whichever source -> store result in self.output.df

Source code in src/koheesio/spark/readers/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Reader should handle self.output.df (output) as a minimum\n    Read from whichever source -> store result in self.output.df\n    \"\"\"\n    # self.output.df  # output dataframe\n    ...\n
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.read","title":"read","text":"
read() -> Optional[DataFrame]\n

Read from a Reader without having to call the execute() method directly

Source code in src/koheesio/spark/readers/__init__.py
def read(self) -> Optional[DataFrame]:\n    \"\"\"Read from a Reader without having to call the execute() method directly\"\"\"\n    self.execute()\n    return self.output.df\n
"},{"location":"api_reference/spark/readers/delta.html","title":"Delta","text":"

Read data from a Delta table and return a DataFrame or DataStream

Classes:

Name Description DeltaTableReader

Reads data from a Delta table and returns a DataFrame

DeltaTableStreamReader

Reads data from a Delta table and returns a DataStream

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS","title":"koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS module-attribute","text":"
STREAMING_ONLY_OPTIONS = ['ignore_deletes', 'ignore_changes', 'starting_version', 'starting_timestamp', 'schema_tracking_location']\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING","title":"koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING module-attribute","text":"
STREAMING_SCHEMA_WARNING = '\\nImportant!\\nAlthough you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the data with an incorrect schema.'\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader","title":"koheesio.spark.readers.delta.DeltaTableReader","text":"

Reads data from a Delta table and returns a DataFrame Delta Table can be read in batch or streaming mode It also supports reading change data feed (CDF) in both batch mode and streaming mode

Parameters:

Name Type Description Default table Union[DeltaTableStep, str]

The table to read

required filter_cond Optional[Union[Column, str]]

Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions. For example: f.col('state') == 'Ohio', state = 'Ohio' or (col('col1') > 3) & (col('col2') < 9)

required columns

Columns to select from the table. One or many columns can be provided as strings. For example: ['col1', 'col2'], ['col1'] or 'col1'

required streaming Optional[bool]

Whether to read the table as a Stream or not

required read_change_feed bool

readChangeFeed: Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html

required starting_version str

startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.

required starting_timestamp str

startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)

required ignore_deletes bool

ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes

required ignore_changes bool

ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.

required"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[ListOfColumns] = Field(default=None, description=\"Columns to select from the table. One or many columns can be provided as strings. For example: `['col1', 'col2']`, `['col1']` or `'col1'` \")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.filter_cond","title":"filter_cond class-attribute instance-attribute","text":"
filter_cond: Optional[Union[Column, str]] = Field(default=None, alias='filterCondition', description=\"Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions For example: `f.col('state') == 'Ohio'`, `state = 'Ohio'` or  `(col('col1') > 3) & (col('col2') < 9)`\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_changes","title":"ignore_changes class-attribute instance-attribute","text":"
ignore_changes: bool = Field(default=False, alias='ignoreChanges', description='ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_deletes","title":"ignore_deletes class-attribute instance-attribute","text":"
ignore_deletes: bool = Field(default=False, alias='ignoreDeletes', description='ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.read_change_feed","title":"read_change_feed class-attribute instance-attribute","text":"
read_change_feed: bool = Field(default=False, alias='readChangeFeed', description=\"Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.reader","title":"reader property","text":"
reader: Union[DataStreamReader, DataFrameReader]\n

Return the reader for the DeltaTableReader based on the streaming attribute

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.schema_tracking_location","title":"schema_tracking_location class-attribute instance-attribute","text":"
schema_tracking_location: Optional[str] = Field(default=None, alias='schemaTrackingLocation', description='schemaTrackingLocation: Track the location of source schema. Note: Recommend to enable Delta reader version: 3 and writer version: 7 for this option. For more info see https://docs.delta.io/latest/delta-column-mapping.html' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.skip_change_commits","title":"skip_change_commits class-attribute instance-attribute","text":"
skip_change_commits: bool = Field(default=False, alias='skipChangeCommits', description='skipChangeCommits: Skip processing of change commits. Note: Only supported for streaming tables. (not supported in Open Source Delta Implementation). Prefer using skipChangeCommits over ignoreDeletes and ignoreChanges starting DBR12.1 and above. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#skip-change-commits')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_timestamp","title":"starting_timestamp class-attribute instance-attribute","text":"
starting_timestamp: Optional[str] = Field(default=None, alias='startingTimestamp', description='startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_version","title":"starting_version class-attribute instance-attribute","text":"
starting_version: Optional[str] = Field(default=None, alias='startingVersion', description='startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: Optional[bool] = Field(default=False, description='Whether to read the table as a Stream or not')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.table","title":"table class-attribute instance-attribute","text":"
table: Union[DeltaTableStep, str] = Field(default=..., description='The table to read')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.temp_view_name","title":"temp_view_name property","text":"
temp_view_name\n

Get the temporary view name for the dataframe for SQL queries

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.view","title":"view property","text":"
view\n

Create a temporary view of the dataframe for SQL queries

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/delta.py
def execute(self):\n    df = self.reader.table(self.table.table_name)\n    if self.filter_cond is not None:\n        df = df.filter(f.expr(self.filter_cond) if isinstance(self.filter_cond, str) else self.filter_cond)\n    if self.columns is not None:\n        df = df.select(*self.columns)\n    self.output.df = df\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.get_options","title":"get_options","text":"
get_options() -> Dict[str, Any]\n

Get the options for the DeltaTableReader based on the streaming attribute

Source code in src/koheesio/spark/readers/delta.py
def get_options(self) -> Dict[str, Any]:\n    \"\"\"Get the options for the DeltaTableReader based on the `streaming` attribute\"\"\"\n    options = {\n        # Enable Change Data Feed (CDF) feature\n        \"readChangeFeed\": self.read_change_feed,\n        # Initial position, one of:\n        \"startingVersion\": self.starting_version,\n        \"startingTimestamp\": self.starting_timestamp,\n    }\n\n    # Streaming only options\n    if self.streaming:\n        options = {\n            **options,\n            # Ignore updates and deletes, one of:\n            \"ignoreDeletes\": self.ignore_deletes,\n            \"ignoreChanges\": self.ignore_changes,\n            \"skipChangeCommits\": self.skip_change_commits,\n            \"schemaTrackingLocation\": self.schema_tracking_location,\n        }\n    # Batch only options\n    else:\n        pass  # there are none... for now :)\n\n    def normalize(v: Union[str, bool]):\n        \"\"\"normalize values\"\"\"\n        # True becomes \"true\", False becomes \"false\"\n        v = str(v).lower() if isinstance(v, bool) else v\n        return v\n\n    # Any options with `value == None` are filtered out\n    return {k: normalize(v) for k, v in options.items() if v is not None}\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.set_temp_view_name","title":"set_temp_view_name","text":"
set_temp_view_name()\n

Set a temporary view name for the dataframe for SQL queries

Source code in src/koheesio/spark/readers/delta.py
@model_validator(mode=\"after\")\ndef set_temp_view_name(self):\n    \"\"\"Set a temporary view name for the dataframe for SQL queries\"\"\"\n    table_name = self.table.table\n    vw_name = get_random_string(prefix=f\"tmp_{table_name}\")\n    self.__temp_view_name__ = vw_name\n    return self\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader","title":"koheesio.spark.readers.delta.DeltaTableStreamReader","text":"

Reads data from a Delta table and returns a DataStream

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: bool = True\n
"},{"location":"api_reference/spark/readers/dummy.html","title":"Dummy","text":"

A simple DummyReader that returns a DataFrame with an id-column of the given range

"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader","title":"koheesio.spark.readers.dummy.DummyReader","text":"

A simple DummyReader that returns a DataFrame with an id-column of the given range

Can be used in place of any Reader without having to read from a real source.

Wraps SparkSession.range(). Output DataFrame will have a single column named \"id\" of type Long and length of the given range.

Parameters:

Name Type Description Default range int

How large to make the Dataframe

required Example
from koheesio.spark.readers.dummy import DummyReader\n\noutput_df = DummyReader(range=100).read()\n

output_df: Output DataFrame will have a single column named \"id\" of type Long containing 100 rows (0-99).

id 0 1 ... 99"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.range","title":"range class-attribute instance-attribute","text":"
range: int = Field(default=100, description='How large to make the Dataframe')\n
"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/dummy.py
def execute(self):\n    self.output.df = self.spark.range(self.range)\n
"},{"location":"api_reference/spark/readers/file_loader.html","title":"File loader","text":"

Generic file Readers for different file formats.

Supported file formats: - CSV - Parquet - Avro - JSON - ORC - Text

Examples:

from koheesio.spark.readers import (\n    CsvReader,\n    ParquetReader,\n    AvroReader,\n    JsonReader,\n    OrcReader,\n)\n\ncsv_reader = CsvReader(path=\"path/to/file.csv\", header=True)\nparquet_reader = ParquetReader(path=\"path/to/file.parquet\")\navro_reader = AvroReader(path=\"path/to/file.avro\")\njson_reader = JsonReader(path=\"path/to/file.json\")\norc_reader = OrcReader(path=\"path/to/file.orc\")\n

For more information about the available options, see Spark's official documentation.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader","title":"koheesio.spark.readers.file_loader.AvroReader","text":"

Reads an Avro file.

This class is a convenience class that sets the format field to FileFormat.avro.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = AvroReader(path=\"path/to/file.avro\", mergeSchema=True)\n

Make sure to have the spark-avro package installed in your environment.

For more information about the available options, see the official documentation.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = avro\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader","title":"koheesio.spark.readers.file_loader.CsvReader","text":"

Reads a CSV file.

This class is a convenience class that sets the format field to FileFormat.csv.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = CsvReader(path=\"path/to/file.csv\", header=True)\n

For more information about the available options, see the official pyspark documentation and read about CSV data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = csv\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat","title":"koheesio.spark.readers.file_loader.FileFormat","text":"

Supported file formats.

This enum represents the supported file formats that can be used with the FileLoader class. The available file formats are: - csv: Comma-separated values format - parquet: Apache Parquet format - avro: Apache Avro format - json: JavaScript Object Notation format - orc: Apache ORC format - text: Plain text format

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.avro","title":"avro class-attribute instance-attribute","text":"
avro = 'avro'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.csv","title":"csv class-attribute instance-attribute","text":"
csv = 'csv'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.json","title":"json class-attribute instance-attribute","text":"
json = 'json'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.orc","title":"orc class-attribute instance-attribute","text":"
orc = 'orc'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.parquet","title":"parquet class-attribute instance-attribute","text":"
parquet = 'parquet'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.text","title":"text class-attribute instance-attribute","text":"
text = 'text'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader","title":"koheesio.spark.readers.file_loader.FileLoader","text":"

Generic file reader.

Available file formats:\n- CSV\n- Parquet\n- Avro\n- JSON\n- ORC\n- Text (default)\n\nExtra parameters can be passed to the reader using the `extra_params` attribute or as keyword arguments.\n\nExample:\n```python\nreader = FileLoader(path=\"path/to/textfile.txt\", format=\"text\", header=True, lineSep=\"\n

\") ```

For more information about the available options, see Spark's\n[official pyspark documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.text.html)\nand [read about text data source](https://spark.apache.org/docs/latest/sql-data-sources-text.html).\n\nAlso see the [data sources generic options](https://spark.apache.org/docs/3.5.0/sql-data-sources-generic-options.html).\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = Field(default=text, description='File format to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.path","title":"path class-attribute instance-attribute","text":"
path: Union[Path, str] = Field(default=..., description='Path to the file to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.schema_","title":"schema_ class-attribute instance-attribute","text":"
schema_: Optional[Union[StructType, str]] = Field(default=None, description='Schema to use when reading the file', validate_default=False, alias='schema')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.ensure_path_is_str","title":"ensure_path_is_str","text":"
ensure_path_is_str(v)\n

Ensure that the path is a string as required by Spark.

Source code in src/koheesio/spark/readers/file_loader.py
@field_validator(\"path\")\ndef ensure_path_is_str(cls, v):\n    \"\"\"Ensure that the path is a string as required by Spark.\"\"\"\n    if isinstance(v, Path):\n        return str(v.absolute().as_posix())\n    return v\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.execute","title":"execute","text":"
execute()\n

Reads the file using the specified format, schema, while applying any extra parameters.

Source code in src/koheesio/spark/readers/file_loader.py
def execute(self):\n    \"\"\"Reads the file using the specified format, schema, while applying any extra parameters.\"\"\"\n    reader = self.spark.read.format(self.format)\n\n    if self.schema_:\n        reader.schema(self.schema_)\n\n    if self.extra_params:\n        reader = reader.options(**self.extra_params)\n\n    self.output.df = reader.load(self.path)\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader","title":"koheesio.spark.readers.file_loader.JsonReader","text":"

Reads a JSON file.

This class is a convenience class that sets the format field to FileFormat.json.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = JsonReader(path=\"path/to/file.json\", allowComments=True)\n

For more information about the available options, see the official pyspark documentation and read about JSON data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = json\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader","title":"koheesio.spark.readers.file_loader.OrcReader","text":"

Reads an ORC file.

This class is a convenience class that sets the format field to FileFormat.orc.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = OrcReader(path=\"path/to/file.orc\", mergeSchema=True)\n

For more information about the available options, see the official documentation and read about ORC data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = orc\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader","title":"koheesio.spark.readers.file_loader.ParquetReader","text":"

Reads a Parquet file.

This class is a convenience class that sets the format field to FileFormat.parquet.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = ParquetReader(path=\"path/to/file.parquet\", mergeSchema=True)\n

For more information about the available options, see the official pyspark documentation and read about Parquet data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = parquet\n
"},{"location":"api_reference/spark/readers/hana.html","title":"Hana","text":"

HANA reader.

"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader","title":"koheesio.spark.readers.hana.HanaReader","text":"

Wrapper around JdbcReader for SAP HANA

Notes
  • Refer to JdbcReader for the list of all available parameters.
  • Refer to SAP HANA Client Interface Programming Reference docs for the list of all available connection string parameters: https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/109397c2206a4ab2a5386d494f4cf75e.html
Example

Note: jars should be added to the Spark session manually. This class does not take care of that.

This example depends on the SAP HANA ngdbc JAR. e.g. ngdbc-2.5.49.

from koheesio.spark.readers.hana import HanaReader\njdbc_hana = HanaReader(\n    url=\"jdbc:sap://<domain_or_ip>:<port>/?<options>\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\"\n)\ndf = jdbc_hana.read()\n

Parameters:

Name Type Description Default url str

JDBC connection string. Refer to SAP HANA docs for the list of all available connection string parameters. Example: jdbc:sap://:[/?] required user str required password SecretStr required dbtable str

Database table name, also include schema name

required options Optional[Dict[str, Any]]

Extra options to pass to the SAP HANA JDBC driver. Refer to SAP HANA docs for the list of all available connection string parameters. Example: {\"fetchsize\": 2000, \"numPartitions\": 10}

required query Optional[str]

Query

required format str

The type of format to load. Defaults to 'jdbc'. Should not be changed.

required driver str

Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.

required"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: str = Field(default='com.sap.db.jdbc.Driver', description='Make sure that the necessary JARs are available in the cluster: ngdbc-2-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field(default={'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the SAP HANA JDBC driver')\n
"},{"location":"api_reference/spark/readers/jdbc.html","title":"Jdbc","text":"

Module for reading data from JDBC sources.

Classes:

Name Description JdbcReader

Reader for JDBC tables.

"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader","title":"koheesio.spark.readers.jdbc.JdbcReader","text":"

Reader for JDBC tables.

Wrapper around Spark's jdbc read format

Notes
  • Query has precedence over dbtable. If query and dbtable both are filled in, dbtable will be ignored!
  • Extra options to the spark reader can be passed through the options input. Refer to Spark documentation for details: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
  • Consider using fetchsize as one of the options, as it is greatly increases the performance of the reader
  • Consider using numPartitions, partitionColumn, lowerBound, upperBound together with real or synthetic partitioning column as it will improve the reader performance

When implementing a JDBC reader, the get_options() method should be implemented. The method should return a dict of options required for the specific JDBC driver. The get_options() method can be overridden in the child class. Additionally, the driver parameter should be set to the name of the JDBC driver. Be aware that the driver jar needs to be included in the Spark session; this class does not (and can not) take care of that!

Example

Note: jars should be added to the Spark session manually. This class does not take care of that.

This example depends on the jar for MS SQL: https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre8/mssql-jdbc-9.2.1.jre8.jar

from koheesio.spark.readers.jdbc import JdbcReader\n\njdbc_mssql = JdbcReader(\n    driver=\"com.microsoft.sqlserver.jdbc.SQLServerDriver\",\n    url=\"jdbc:sqlserver://10.xxx.xxx.xxx:1433;databaseName=YOUR_DATABASE\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n    options={\"fetchsize\": 100},\n)\ndf = jdbc_mssql.read()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.dbtable","title":"dbtable class-attribute instance-attribute","text":"
dbtable: Optional[str] = Field(default=None, description='Database table name, also include schema name')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: str = Field(default=..., description='Driver name. Be aware that the driver jar needs to be passed to the task')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='jdbc', description=\"The type of format to load. Defaults to 'jdbc'.\")\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field(default_factory=dict, description='Extra options to pass to spark reader')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.password","title":"password class-attribute instance-attribute","text":"
password: SecretStr = Field(default=..., description='Password belonging to the username')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.query","title":"query class-attribute instance-attribute","text":"
query: Optional[str] = Field(default=None, description='Query')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., description='URL for the JDBC driver. Note, in some environments you need to use the IP Address instead of the hostname of the server.')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.user","title":"user class-attribute instance-attribute","text":"
user: str = Field(default=..., description='User to authenticate to the server')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.execute","title":"execute","text":"
execute()\n

Wrapper around Spark's jdbc read format

Source code in src/koheesio/spark/readers/jdbc.py
def execute(self):\n    \"\"\"Wrapper around Spark's jdbc read format\"\"\"\n\n    # Can't have both dbtable and query empty\n    if not self.dbtable and not self.query:\n        raise ValueError(\"Please do not leave dbtable and query both empty!\")\n\n    if self.query and self.dbtable:\n        self.log.info(\"Both 'query' and 'dbtable' are filled in, 'dbtable' will be ignored!\")\n\n    options = self.get_options()\n\n    if pw := self.password:\n        options[\"password\"] = pw.get_secret_value()\n\n    if query := self.query:\n        options[\"query\"] = query\n        self.log.info(f\"Executing query: {self.query}\")\n    else:\n        options[\"dbtable\"] = self.dbtable\n\n    self.output.df = self.spark.read.format(self.format).options(**options).load()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.get_options","title":"get_options","text":"
get_options()\n

Dictionary of options required for the specific JDBC driver.

Note: override this method if driver requires custom names, e.g. Snowflake: sfUrl, sfUser, etc.

Source code in src/koheesio/spark/readers/jdbc.py
def get_options(self):\n    \"\"\"\n    Dictionary of options required for the specific JDBC driver.\n\n    Note: override this method if driver requires custom names, e.g. Snowflake: `sfUrl`, `sfUser`, etc.\n    \"\"\"\n    return {\n        \"driver\": self.driver,\n        \"url\": self.url,\n        \"user\": self.user,\n        \"password\": self.password,\n        **self.options,\n    }\n
"},{"location":"api_reference/spark/readers/kafka.html","title":"Kafka","text":"

Module for KafkaReader and KafkaStreamReader.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader","title":"koheesio.spark.readers.kafka.KafkaReader","text":"

Reader for Kafka topics.

Wrapper around Spark's kafka read format. Supports both batch and streaming reads.

Parameters:

Name Type Description Default read_broker str

Kafka brokers to read from. Should be passed as a single string with multiple brokers passed in a comma separated list

required topic str

Kafka topic to consume.

required streaming Optional[bool]

Whether to read the kafka topic as a stream or not.

required params Optional[Dict[str, str]]

Arbitrary options to be applied when creating NSP Reader. If a user provides values for subscribe or kafka.bootstrap.servers, they will be ignored in favor of configuration passed through topic and read_broker respectively. Defaults to an empty dictionary.

required Notes
  • The read_broker and topic parameters are required.
  • The streaming parameter defaults to False.
  • The params parameter defaults to an empty dictionary. This parameter is also aliased as kafka_options.
  • Any extra kafka options can also be passed as key-word arguments; these will be merged with the params parameter
Example
from koheesio.spark.readers.kafka import KafkaReader\n\nkafka_reader = KafkaReader(\n    read_broker=\"kafka-broker-1:9092,kafka-broker-2:9092\",\n    topic=\"my-topic\",\n    streaming=True,\n    # extra kafka options can be passed as key-word arguments\n    startingOffsets=\"earliest\",\n)\n

In the example above, the KafkaReader will read from the my-topic Kafka topic, using the brokers kafka-broker-1:9092 and kafka-broker-2:9092. The reader will read the topic as a stream and will start reading from the earliest available offset.

The stream can be started by calling the read or execute method on the kafka_reader object.

Note: The KafkaStreamReader could be used in the example above to achieve the same result. streaming would default to True in that case and could be omitted from the parameters.

See Also
  • Official Spark Documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.batch_reader","title":"batch_reader property","text":"
batch_reader\n

Returns the Spark read object for batch processing.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.logged_option_keys","title":"logged_option_keys property","text":"
logged_option_keys\n

Keys that are allowed to be logged for the options.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.options","title":"options property","text":"
options\n

Merge fixed parameters with arbitrary options provided by user.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, str]] = Field(default_factory=dict, alias='kafka_options', description=\"Arbitrary options to be applied when creating NSP Reader. If a user provides values for 'subscribe' or 'kafka.bootstrap.servers', they will be ignored in favor of configuration passed through 'topic' and 'read_broker' respectively.\")\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.read_broker","title":"read_broker class-attribute instance-attribute","text":"
read_broker: str = Field(..., description='Kafka brokers to read from, should be passed as a single string with multiple brokers passed in a comma separated list')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.reader","title":"reader property","text":"
reader\n

Returns the appropriate reader based on the streaming flag.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.stream_reader","title":"stream_reader property","text":"
stream_reader\n

Returns the Spark readStream object.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: Optional[bool] = Field(default=False, description='Whether to read the kafka topic as a stream or not. Defaults to False.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.topic","title":"topic class-attribute instance-attribute","text":"
topic: str = Field(default=..., description='Kafka topic to consume.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/kafka.py
def execute(self):\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self.output.df = self.reader.format(\"kafka\").options(**self.options).load()\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader","title":"koheesio.spark.readers.kafka.KafkaStreamReader","text":"

KafkaStreamReader is a KafkaReader that reads data as a stream

This class is identical to KafkaReader, with the streaming parameter defaulting to True.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: bool = True\n
"},{"location":"api_reference/spark/readers/memory.html","title":"Memory","text":"

Create Spark DataFrame directly from the data stored in a Python variable

"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat","title":"koheesio.spark.readers.memory.DataFormat","text":"

Data formats supported by the InMemoryDataReader

"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.CSV","title":"CSV class-attribute instance-attribute","text":"
CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'json'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader","title":"koheesio.spark.readers.memory.InMemoryDataReader","text":"

Directly read data from a Python variable and convert it to a Spark DataFrame.

Read the data, that is stored in one of the supported formats (see DataFormat) directly from the variable and convert it to the Spark DataFrame. The use cases include converting JSON output of the API into the dataframe; reading the CSV data via the API (e.g. Box API).

The advantage of using this reader is that it allows to read the data directly from the Python variable, without the need to store it on the disk. This can be useful when the data is small and does not need to be stored permanently.

Parameters:

Name Type Description Default data Union[str, list, dict, bytes]

Source data

required format DataFormat

File / data format

required schema_ Optional[StructType]

Schema that will be applied during the creation of Spark DataFrame

None params Optional[Dict[str, Any]]

Set of extra parameters that should be passed to the appropriate reader (csv / json). Optionally, the user can pass the parameters that are specific to the reader (e.g. multiLine for JSON reader) as key-word arguments. These will be merged with the params parameter.

dict Example
# Read CSV data from a string\ndf1 = InMemoryDataReader(format=DataFormat.CSV, data='foo,bar\\nA,1\\nB,2')\n\n# Read JSON data from a string\ndf2 = InMemoryDataReader(format=DataFormat.JSON, data='{\"foo\": A, \"bar\": 1}'\ndf3 = InMemoryDataReader(format=DataFormat.JSON, data=['{\"foo\": \"A\", \"bar\": 1}', '{\"foo\": \"B\", \"bar\": 2}']\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.data","title":"data class-attribute instance-attribute","text":"
data: Union[str, list, dict, bytes] = Field(default=..., description='Source data')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.format","title":"format class-attribute instance-attribute","text":"
format: DataFormat = Field(default=..., description='File / data format')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the appropriate reader (csv / json)')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.schema_","title":"schema_ class-attribute instance-attribute","text":"
schema_: Optional[StructType] = Field(default=None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.execute","title":"execute","text":"
execute()\n

Execute method appropriate to the specific data format

Source code in src/koheesio/spark/readers/memory.py
def execute(self):\n    \"\"\"\n    Execute method appropriate to the specific data format\n    \"\"\"\n    _func = getattr(InMemoryDataReader, f\"_{self.format}\")\n    _df = partial(_func, self, self._rdd)()\n    self.output.df = _df\n
"},{"location":"api_reference/spark/readers/metastore.html","title":"Metastore","text":"

Create Spark DataFrame from table in Metastore

"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader","title":"koheesio.spark.readers.metastore.MetastoreReader","text":"

Reader for tables/views from Spark Metastore

Parameters:

Name Type Description Default table str

Table name in spark metastore

required"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='Table name in spark metastore')\n
"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/metastore.py
def execute(self):\n    self.output.df = self.spark.table(self.table)\n
"},{"location":"api_reference/spark/readers/rest_api.html","title":"Rest api","text":"

This module provides the RestApiReader class for interacting with RESTful APIs.

The RestApiReader class is designed to fetch data from RESTful APIs and store the response in a DataFrame. It supports different transports, e.g. Paginated Http or Async HTTP. The main entry point is the execute method, which performs transport.execute() call and provide data from the API calls.

For more details on how to use this class and its methods, refer to the class docstring.

"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader","title":"koheesio.spark.readers.rest_api.RestApiReader","text":"

A reader class that executes an API call and stores the response in a DataFrame.

Parameters:

Name Type Description Default transport Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]

The HTTP transport step.

required spark_schema Union[str, StructType, List[str], Tuple[str, ...], AtomicType]

The pyspark schema of the response.

required

Attributes:

Name Type Description transport Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]

The HTTP transport step.

spark_schema Union[str, StructType, List[str], Tuple[str, ...], AtomicType]

The pyspark schema of the response.

Returns:

Type Description Output

The output of the reader, which includes the DataFrame.

Examples:

Here are some examples of how to use this class:

Example 1: Paginated Transport

import requests\nfrom urllib3 import Retry\n\nfrom koheesio.steps.http import HttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = requests.Session()\nretry_logic = Retry(total=max_retries, status_forcelist=[503])\nsession.mount(\"https://\", HTTPAdapter(max_retries=retry_logic))\nsession.mount(\"http://\", HTTPAdapter(max_retries=retry_logic))\n\ntransport = PaginatedHtppGetStep(\n    url=\"https://api.example.com/data?page={page}\",\n    paginate=True,\n    pages=3,\n    session=session,\n)\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n

Example 2: Async Transport

from aiohttp import ClientSession, TCPConnector\nfrom aiohttp_retry import ExponentialRetry\nfrom yarl import URL\n\nfrom koheesio.steps.asyncio.http import AsyncHttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = ClientSession()\nurls = [URL(\"http://httpbin.org/get\"), URL(\"http://httpbin.org/get\")]\nretry_options = ExponentialRetry()\nconnector = TCPConnector(limit=10)\ntransport = AsyncHttpGetStep(\n    client_session=session,\n    url=urls,\n    retry_options=retry_options,\n    connector=connector,\n)\n\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n

"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.spark_schema","title":"spark_schema class-attribute instance-attribute","text":"
spark_schema: Union[str, StructType, List[str], Tuple[str, ...], AtomicType] = Field(..., description='The pyspark schema of the response')\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.transport","title":"transport class-attribute instance-attribute","text":"
transport: Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]] = Field(..., description='HTTP transport step', exclude=True)\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.execute","title":"execute","text":"
execute() -> Output\n

Executes the API call and stores the response in a DataFrame.

Returns:

Type Description Output

The output of the reader, which includes the DataFrame.

Source code in src/koheesio/spark/readers/rest_api.py
def execute(self) -> Reader.Output:\n    \"\"\"\n    Executes the API call and stores the response in a DataFrame.\n\n    Returns\n    -------\n    Reader.Output\n        The output of the reader, which includes the DataFrame.\n    \"\"\"\n    raw_data = self.transport.execute()\n\n    if isinstance(raw_data, HttpGetStep.Output):\n        data = raw_data.response_json\n    elif isinstance(raw_data, AsyncHttpGetStep.Output):\n        data = [d for d, _ in raw_data.responses_urls]  # type: ignore\n\n    if data:\n        self.output.df = self.spark.createDataFrame(data=data, schema=self.spark_schema)  # type: ignore\n
"},{"location":"api_reference/spark/readers/snowflake.html","title":"Snowflake","text":"

Module containing Snowflake reader classes.

This module contains classes for reading data from Snowflake. The classes are used to create a Spark DataFrame from a Snowflake table or a query.

Classes:

Name Description SnowflakeReader

Reader for Snowflake tables.

Query

Reader for Snowflake queries.

DbTableQuery

Reader for Snowflake queries that return a single row.

Notes

The classes are defined in the koheesio.steps.integrations.snowflake module; this module simply inherits from the classes defined there.

See Also
  • koheesio.spark.readers.Reader Base class for all Readers.
  • koheesio.steps.integrations.snowflake Module containing Snowflake classes.

More detailed class descriptions can be found in the class docstrings.

"},{"location":"api_reference/spark/readers/spark_sql_reader.html","title":"Spark sql reader","text":"

This module contains the SparkSqlReader class which reads the SparkSQL compliant query and returns the dataframe.

"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader","title":"koheesio.spark.readers.spark_sql_reader.SparkSqlReader","text":"

SparkSqlReader reads the SparkSQL compliant query and returns the dataframe.

This SQL can originate from a string or a file and may contain placeholder (parameters) for templating. - Placeholders are identified with ${placeholder}. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).

Example

SQL script (example.sql):

SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n

Python code:

from koheesio.spark.readers import SparkSqlReader\n\nreader = SparkSqlReader(\n    sql_path=\"example.sql\",\n    # params can also be passed as kwargs\n    dynamic_column\"=\"name\",\n    \"table_name\"=\"my_table\"\n)\nreader.execute()\n

In this example, the SQL script is read from a file and the placeholders are replaced with the given params. The resulting SQL query is:

SELECT id, id + 1 AS incremented_id, name AS extra_column\nFROM my_table\n

The query is then executed and the resulting DataFrame is stored in the output.df attribute.

Parameters:

Name Type Description Default sql_path str or Path

Path to a SQL file

required sql str

SQL query to execute

required params dict

Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.

required Notes

Any arbitrary kwargs passed to the class will be added to params.

"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/spark_sql_reader.py
def execute(self):\n    self.output.df = self.spark.sql(self.query)\n
"},{"location":"api_reference/spark/readers/teradata.html","title":"Teradata","text":"

Teradata reader.

"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader","title":"koheesio.spark.readers.teradata.TeradataReader","text":"

Wrapper around JdbcReader for Teradata.

Notes
  • Consider using synthetic partitioning column when using partitioned read: MOD(HASHBUCKET(HASHROW(<TABLE>.<COLUMN>)), <NUM_PARTITIONS>)
  • Relevant jars should be added to the Spark session manually. This class does not take care of that.
See Also
  • Refer to JdbcReader for the list of all available parameters.
  • Refer to Teradata docs for the list of all available connection string parameters: https://teradata-docs.s3.amazonaws.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_2.html#BABJIHBJ
Example

This example depends on the Teradata terajdbc4 JAR. e.g. terajdbc4-17.20.00.15. Keep in mind that older versions of terajdbc4 drivers also require tdgssconfig JAR.

from koheesio.spark.readers.teradata import TeradataReader\n\ntd = TeradataReader(\n    url=\"jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n)\n

Parameters:

Name Type Description Default url str

JDBC connection string. Refer to Teradata docs for the list of all available connection string parameters. Example: jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on

required user str

Username

required password SecretStr

Password

required dbtable str

Database table name, also include schema name

required options Optional[Dict[str, Any]]

Extra options to pass to the Teradata JDBC driver. Refer to Teradata docs for the list of all available connection string parameters.

{\"fetchsize\": 2000, \"numPartitions\": 10} query Optional[str]

Query

None format str

The type of format to load. Defaults to 'jdbc'. Should not be changed.

required driver str

Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.

required"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: str = Field('com.teradata.jdbc.TeraDriver', description='Make sure that the necessary JARs are available in the cluster: terajdbc4-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field({'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the Teradata JDBC driver')\n
"},{"location":"api_reference/spark/readers/databricks/index.html","title":"Databricks","text":""},{"location":"api_reference/spark/readers/databricks/autoloader.html","title":"Autoloader","text":"

Read from a location using Databricks' autoloader

Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader","title":"koheesio.spark.readers.databricks.autoloader.AutoLoader","text":"

Read from a location using Databricks' autoloader

Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

Notes

autoloader is a Spark Structured Streaming function!

Although most transformations are compatible with Spark Structured Streaming, not all of them are. As a result, be mindful with your downstream transformations.

Parameters:

Name Type Description Default format Union[str, AutoLoaderFormat]

The file format, used in cloudFiles.format. Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

required location str

The location where the files are located, used in cloudFiles.location

required schema_location str

The location for storing inferred schema and supporting schema evolution, used in cloudFiles.schemaLocation.

required options Optional[Dict[str, str]]

Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html

{} Example
from koheesio.spark.readers.databricks import AutoLoader, AutoLoaderFormat\n\nresult_df = AutoLoader(\n    format=AutoLoaderFormat.JSON,\n    location=\"some_s3_path\",\n    schema_location=\"other_s3_path\",\n    options={\"multiLine\": \"true\"},\n).read()\n
See Also

Some other useful documentation:

  • autoloader: https://docs.databricks.com/ingestion/auto-loader/index.html
  • Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.format","title":"format class-attribute instance-attribute","text":"
format: Union[str, AutoLoaderFormat] = Field(default=..., description=__doc__)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.location","title":"location class-attribute instance-attribute","text":"
location: str = Field(default=..., description='The location where the files are located, used in `cloudFiles.location`')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, str]] = Field(default_factory=dict, description='Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.schema_location","title":"schema_location class-attribute instance-attribute","text":"
schema_location: str = Field(default=..., alias='schemaLocation', description='The location for storing inferred schema and supporting schema evolution, used in `cloudFiles.schemaLocation`.')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.execute","title":"execute","text":"
execute()\n

Reads from the given location with the given options using Autoloader

Source code in src/koheesio/spark/readers/databricks/autoloader.py
def execute(self):\n    \"\"\"Reads from the given location with the given options using Autoloader\"\"\"\n    self.output.df = self.reader().load(self.location)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.get_options","title":"get_options","text":"
get_options()\n

Get the options for the autoloader

Source code in src/koheesio/spark/readers/databricks/autoloader.py
def get_options(self):\n    \"\"\"Get the options for the autoloader\"\"\"\n    self.options.update(\n        {\n            \"cloudFiles.format\": self.format,\n            \"cloudFiles.schemaLocation\": self.schema_location,\n        }\n    )\n    return self.options\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.reader","title":"reader","text":"
reader()\n

Return the reader for the autoloader

Source code in src/koheesio/spark/readers/databricks/autoloader.py
def reader(self):\n    \"\"\"Return the reader for the autoloader\"\"\"\n    return self.spark.readStream.format(\"cloudFiles\").options(**self.get_options())\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.validate_format","title":"validate_format","text":"
validate_format(format_specified)\n

Validate format value

Source code in src/koheesio/spark/readers/databricks/autoloader.py
@field_validator(\"format\")\ndef validate_format(cls, format_specified):\n    \"\"\"Validate `format` value\"\"\"\n    if isinstance(format_specified, str):\n        if format_specified.upper() in [f.value.upper() for f in AutoLoaderFormat]:\n            format_specified = getattr(AutoLoaderFormat, format_specified.upper())\n    return str(format_specified.value)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","title":"koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","text":"

The file format, used in cloudFiles.format Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.AVRO","title":"AVRO class-attribute instance-attribute","text":"
AVRO = 'avro'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.BINARYFILE","title":"BINARYFILE class-attribute instance-attribute","text":"
BINARYFILE = 'binaryfile'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.CSV","title":"CSV class-attribute instance-attribute","text":"
CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'json'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.ORC","title":"ORC class-attribute instance-attribute","text":"
ORC = 'orc'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.PARQUET","title":"PARQUET class-attribute instance-attribute","text":"
PARQUET = 'parquet'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.TEXT","title":"TEXT class-attribute instance-attribute","text":"
TEXT = 'text'\n
"},{"location":"api_reference/spark/transformations/index.html","title":"Transformations","text":"

This module contains the base classes for all transformations.

See class docstrings for more information.

References

For a comprehensive guide on the usage, examples, and additional features of Transformation classes, please refer to the reference/concepts/steps/transformations section of the Koheesio documentation.

Classes:

Name Description Transformation

Base class for all transformations

ColumnsTransformation

Extended Transformation class with a preset validator for handling column(s) data

ColumnsTransformationWithTarget

Extended ColumnsTransformation class with an additional target_column field

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation","title":"koheesio.spark.transformations.ColumnsTransformation","text":"

Extended Transformation class with a preset validator for handling column(s) data with a standardized input for a single column or multiple columns.

Concept

A ColumnsTransformation is a Transformation with a standardized input for column or columns. The columns are stored as a list. Either a single string, or a list of strings can be passed to enter the columns. column and columns are aliases to one another - internally the name columns should be used though.

  • columns are stored as a list
  • either a single string, or a list of strings can be passed to enter the columns
  • column and columns are aliases to one another - internally the name columns should be used though.

If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns

Configuring the ColumnsTransformation

The ColumnsTransformation class has a ColumnConfig class that can be used to configure the behavior of the class. This class has the following fields: - run_for_all_data_type allows to run the transformation for all columns of a given type.

  • limit_data_type allows to limit the transformation to a specific data type.

  • data_type_strict_mode Toggles strict mode for data type validation. Will only work if limit_data_type is set.

Note that Data types need to be specified as a SparkDatatype enum.

See the docstrings of the ColumnConfig class for more information. See the SparkDatatype enum for a list of available data types.

Users should not have to interact with the ColumnConfig class directly.

Parameters:

Name Type Description Default columns

The column (or list of columns) to apply the transformation to. Alias: column

required Example
from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.columns","title":"columns class-attribute instance-attribute","text":"
columns: ListOfColumns = Field(default='', alias='column', description='The column (or list of columns) to apply the transformation to. Alias: column')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.data_type_strict_mode_is_set","title":"data_type_strict_mode_is_set property","text":"
data_type_strict_mode_is_set: bool\n

Returns True if data_type_strict_mode is set

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.limit_data_type_is_set","title":"limit_data_type_is_set property","text":"
limit_data_type_is_set: bool\n

Returns True if limit_data_type is set

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.run_for_all_is_set","title":"run_for_all_is_set property","text":"
run_for_all_is_set: bool\n

Returns True if the transformation should be run for all columns of a given type

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig","title":"ColumnConfig","text":"

Koheesio ColumnsTransformation specific Config

Parameters:

Name Type Description Default run_for_all_data_type

allows to run the transformation for all columns of a given type. A user can trigger this behavior by either omitting the columns parameter or by passing a single * as a column name. In both cases, the run_for_all_data_type will be used to determine the data type. Value should be be passed as a SparkDatatype enum. (default: [None])

required limit_data_type

allows to limit the transformation to a specific data type. Value should be passed as a SparkDatatype enum. (default: [None])

required data_type_strict_mode

Toggles strict mode for data type validation. Will only work if limit_data_type is set. - when True, a ValueError will be raised if any column does not adhere to the limit_data_type - when False, a warning will be thrown and the column will be skipped instead (default: False)

required"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute instance-attribute","text":"
data_type_strict_mode: bool = False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.column_type_of_col","title":"column_type_of_col","text":"
column_type_of_col(col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True) -> Union[DataType, str]\n

Returns the dataType of a Column object as a string.

The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type based on the column name. We retrieve the name of the column from the Column object by calling toString() from the JVM.

Examples:

input_df: | str_column | int_column | |------------|------------| | hello | 1 | | world | 2 |

# using the AddOne transformation from the example above\nadd_one = AddOne(\n    columns=[\"str_column\", \"int_column\"],\n    df=input_df,\n)\nadd_one.column_type_of_col(\"str_column\")  # returns \"string\"\nadd_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n# returns IntegerType\nadd_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n

Parameters:

Name Type Description Default col Union[str, Column]

The column to check the type of

required df Optional[DataFrame]

The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor will be used.

None simple_return_mode bool

If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.

True

Returns:

Name Type Description datatype str

The type of the column as a string

Source code in src/koheesio/spark/transformations/__init__.py
def column_type_of_col(\n    self, col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True\n) -> Union[DataType, str]:\n    \"\"\"\n    Returns the dataType of a Column object as a string.\n\n    The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type\n    based on the column name. We retrieve the name of the column from the Column object by calling toString() from\n    the JVM.\n\n    Examples\n    --------\n    __input_df:__\n    | str_column | int_column |\n    |------------|------------|\n    | hello      | 1          |\n    | world      | 2          |\n\n    ```python\n    # using the AddOne transformation from the example above\n    add_one = AddOne(\n        columns=[\"str_column\", \"int_column\"],\n        df=input_df,\n    )\n    add_one.column_type_of_col(\"str_column\")  # returns \"string\"\n    add_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n    # returns IntegerType\n    add_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n    ```\n\n    Parameters\n    ----------\n    col: Union[str, Column]\n        The column to check the type of\n\n    df: Optional[DataFrame]\n        The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    simple_return_mode: bool\n        If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.\n\n    Returns\n    -------\n    datatype: str\n        The type of the column as a string\n    \"\"\"\n    df = df or self.df\n    if not df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n\n    if not isinstance(col, Column):\n        col = f.col(col)\n\n    # ask the JVM for the name of the column\n    # noinspection PyProtectedMember\n    col_name = col._jc.toString()\n\n    # In order to check the datatype of the column, we have to ask the DataFrame its schema\n    df_col = [c for c in df.schema if c.name == col_name][0]\n\n    if simple_return_mode:\n        return SparkDatatype(df_col.dataType.typeName()).value\n\n    return df_col.dataType\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_all_columns_of_specific_type","title":"get_all_columns_of_specific_type","text":"
get_all_columns_of_specific_type(data_type: Union[str, SparkDatatype]) -> List[str]\n

Get all columns from the dataframe of a given type

A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will be raised.

Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you have to call this method multiple times.

Parameters:

Name Type Description Default data_type Union[str, SparkDatatype]

The data type to get the columns for

required

Returns:

Type Description List[str]

A list of column names of the given data type

Source code in src/koheesio/spark/transformations/__init__.py
def get_all_columns_of_specific_type(self, data_type: Union[str, SparkDatatype]) -> List[str]:\n    \"\"\"Get all columns from the dataframe of a given type\n\n    A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will\n    be raised.\n\n    Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you\n    have to call this method multiple times.\n\n    Parameters\n    ----------\n    data_type: Union[str, SparkDatatype]\n        The data type to get the columns for\n\n    Returns\n    -------\n    List[str]\n        A list of column names of the given data type\n    \"\"\"\n    if not self.df:\n        raise ValueError(\"No dataframe available - cannot get columns\")\n\n    expected_data_type = (SparkDatatype.from_string(data_type) if isinstance(data_type, str) else data_type).value\n\n    columns_of_given_type: List[str] = [\n        col for col in self.df.columns if self.df.schema[col].dataType.typeName() == expected_data_type\n    ]\n    return columns_of_given_type\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_columns","title":"get_columns","text":"
get_columns() -> iter\n

Return an iterator of the columns

Source code in src/koheesio/spark/transformations/__init__.py
def get_columns(self) -> iter:\n    \"\"\"Return an iterator of the columns\"\"\"\n    # If `run_for_all_is_set` is True, we want to run the transformation for all columns of a given type\n    if self.run_for_all_is_set:\n        columns = []\n        for data_type in self.ColumnConfig.run_for_all_data_type:\n            columns += self.get_all_columns_of_specific_type(data_type)\n    else:\n        columns = self.columns\n\n    for column in columns:\n        if self.is_column_type_correct(column):\n            yield column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_limit_data_types","title":"get_limit_data_types","text":"
get_limit_data_types()\n

Get the limit_data_type as a list of strings

Source code in src/koheesio/spark/transformations/__init__.py
def get_limit_data_types(self):\n    \"\"\"Get the limit_data_type as a list of strings\"\"\"\n    return [dt.value for dt in self.ColumnConfig.limit_data_type]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.is_column_type_correct","title":"is_column_type_correct","text":"
is_column_type_correct(column)\n

Check if column type is correct and handle it if not, when limit_data_type is set

Source code in src/koheesio/spark/transformations/__init__.py
def is_column_type_correct(self, column):\n    \"\"\"Check if column type is correct and handle it if not, when limit_data_type is set\"\"\"\n    if not self.limit_data_type_is_set:\n        return True\n\n    if self.column_type_of_col(column) in (limit_data_types := self.get_limit_data_types()):\n        return True\n\n    # Raises a ValueError if the Column object is not of a given type and data_type_strict_mode is set\n    if self.data_type_strict_mode_is_set:\n        raise ValueError(\n            f\"Critical error: {column} is not of type {limit_data_types}. Exception is raised because \"\n            f\"`data_type_strict_mode` is set to True for {self.name}.\"\n        )\n\n    # Otherwise, throws a warning that the Column object is not of a given type\n    self.log.warning(f\"Column `{column}` is not of type `{limit_data_types}` and will be skipped.\")\n    return False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.set_columns","title":"set_columns","text":"
set_columns(columns_value)\n

Validate columns through the columns configuration provided

Source code in src/koheesio/spark/transformations/__init__.py
@field_validator(\"columns\", mode=\"before\")\ndef set_columns(cls, columns_value):\n    \"\"\"Validate columns through the columns configuration provided\"\"\"\n    columns = columns_value\n    run_for_all_data_type = cls.ColumnConfig.run_for_all_data_type\n\n    if run_for_all_data_type and len(columns) == 0:\n        columns = [\"*\"]\n\n    if columns[0] == \"*\" and not run_for_all_data_type:\n        raise ValueError(\"Cannot use '*' as a column name when no run_for_all_data_type is set\")\n\n    return columns\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget","title":"koheesio.spark.transformations.ColumnsTransformationWithTarget","text":"

Extended ColumnsTransformation class with an additional target_column field

Using this class makes implementing Transformations significantly easier.

Concept

A ColumnsTransformationWithTarget is a ColumnsTransformation with an additional target_column field. This field can be used to store the result of the transformation in a new column.

If the target_column is not provided, the result will be stored in the source column.

If more than one column is passed, the behavior of the Class changes this way:

  • the transformation will be run in a loop against all the given columns
  • automatically handles the renaming of the columns when more than one column is passed
  • the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed

The func method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target method to loop over all the columns and apply this function to transform the DataFrame.

Parameters:

Name Type Description Default columns ListOfColumns

The column (or list of columns) to apply the transformation to. Alias: column. If not provided, the run_for_all_data_type will be used to determine the data type. If run_for_all_data_type is not set, the transformation will be run for all columns of a given type.

* target_column Optional[str]

The name of the column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this input will be used as a suffix instead.

None Example

Writing your own transformation using the ColumnsTransformationWithTarget class:

from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n

In the above example, the func method is implemented to add 1 to the values of a given column.

In order to use this transformation, we can call the transform method:

from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOneWithTarget(column=\"id\", target_column=\"new_id\").transform(df)\n

The output_df will now contain the original DataFrame with an additional column called new_id with the values of id + 1.

output_df:

id new_id 0 1 1 2 2 3

Note: The target_column will be used as a suffix when more than one column is given as source. Leaving this blank will result in the original columns being renamed.

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: Optional[str] = Field(default=None, alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.execute","title":"execute","text":"
execute()\n

Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output) This can be left unchanged, and hence should not be implemented in the child class.

Source code in src/koheesio/spark/transformations/__init__.py
def execute(self):\n    \"\"\"Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output)\n    This can be left unchanged, and hence should not be implemented in the child class.\n    \"\"\"\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n    self.output.df = df\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.func","title":"func abstractmethod","text":"
func(column: Column) -> Column\n

The function that will be run on a single Column of the DataFrame

The func method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target method to loop over all the columns and apply this function to transform the DataFrame.

Parameters:

Name Type Description Default column Column

The column to apply the transformation to

required

Returns:

Type Description Column

The transformed column

Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef func(self, column: Column) -> Column:\n    \"\"\"The function that will be run on a single Column of the DataFrame\n\n    The `func` method should be implemented in the child class. This method should return the transformation that\n    will be applied to the column(s). The execute method (already preset) will use the `get_columns_with_target`\n    method to loop over all the columns and apply this function to transform the DataFrame.\n\n    Parameters\n    ----------\n    column: Column\n        The column to apply the transformation to\n\n    Returns\n    -------\n    Column\n        The transformed column\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.get_columns_with_target","title":"get_columns_with_target","text":"
get_columns_with_target() -> iter\n

Return an iterator of the columns

Works just like in get_columns from the ColumnsTransformation class except that it handles the target_column as well.

If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns - the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.

Returns:

Type Description iter

An iterator of tuples containing the target column name and the original column name

Source code in src/koheesio/spark/transformations/__init__.py
def get_columns_with_target(self) -> iter:\n    \"\"\"Return an iterator of the columns\n\n    Works just like in get_columns from the  ColumnsTransformation class except that it handles the `target_column`\n    as well.\n\n    If more than one column is passed, the behavior of the Class changes this way:\n    - the transformation will be run in a loop against all the given columns\n    - the target_column will be used as a suffix. Leaving this blank will result in the original columns being\n        renamed.\n\n    Returns\n    -------\n    iter\n        An iterator of tuples containing the target column name and the original column name\n    \"\"\"\n    columns = [*self.get_columns()]\n\n    for column in columns:\n        # ensures that we at least use the original column name\n        target_column = self.target_column or column\n\n        if len(columns) > 1:  # target_column becomes a suffix when more than 1 column is given\n            # dict.fromkeys is used to avoid duplicates in the name while maintaining order\n            _cols = [column, target_column]\n            target_column = \"_\".join(list(dict.fromkeys(_cols)))\n\n        yield target_column, column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation","title":"koheesio.spark.transformations.Transformation","text":"

Base class for all transformations

Concept

A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is transformed based on the logic implemented in the execute method. Any additional parameters that are needed for the transformation can be passed to the constructor.

Parameters:

Name Type Description Default df

The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the transform-method.

required Example
from koheesio.steps.transformations import Transformation\nfrom pyspark.sql import functions as f\n\n\nclass AddOne(Transformation):\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n

In the example above, the execute method is implemented to add 1 to the values of the old_column and store the result in a new column called new_column.

In order to use this transformation, we can call the transform method:

from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOne().transform(df)\n

The output_df will now contain the original DataFrame with an additional column called new_column with the values of old_column + 1.

output_df:

id new_column 0 1 1 2 2 3 ...

Alternatively, we can pass the DataFrame to the constructor and call the execute or transform method without any arguments:

output_df = AddOne(df).transform()\n# or\noutput_df = AddOne(df).execute().output.df\n

Note: that the transform method was not implemented explicitly in the AddOne class. This is because the transform method is already implemented in the Transformation class. This means that all classes that inherit from the Transformation class will have the transform method available. Only the execute method needs to be implemented.

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.execute","title":"execute abstractmethod","text":"
execute() -> Output\n

Execute on a Transformation should handle self.df (input) and set self.output.df (output)

This method should be implemented in the child class. The input DataFrame is available as self.df and the output DataFrame should be stored in self.output.df.

For example:

def execute(self):\n    self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n

The transform method will call this method and return the output DataFrame.

Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef execute(self) -> SparkStep.Output:\n    \"\"\"Execute on a Transformation should handle self.df (input) and set self.output.df (output)\n\n    This method should be implemented in the child class. The input DataFrame is available as `self.df` and the\n    output DataFrame should be stored in `self.output.df`.\n\n    For example:\n    ```python\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n    ```\n\n    The transform method will call this method and return the output DataFrame.\n    \"\"\"\n    # self.df  # input dataframe\n    # self.output.df # output dataframe\n    self.output.df = ...  # implement the transformation logic\n    raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.transform","title":"transform","text":"
transform(df: Optional[DataFrame] = None) -> DataFrame\n

Execute the transformation and return the output DataFrame

Note: when creating a child from this, don't implement this transform method. Instead, implement execute!

See Also

Transformation.execute

Parameters:

Name Type Description Default df Optional[DataFrame]

The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor will be used.

None

Returns:

Type Description DataFrame

The transformed DataFrame

Source code in src/koheesio/spark/transformations/__init__.py
def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n    \"\"\"Execute the transformation and return the output DataFrame\n\n    Note: when creating a child from this, don't implement this transform method. Instead, implement execute!\n\n    See Also\n    --------\n    `Transformation.execute`\n\n    Parameters\n    ----------\n    df: Optional[DataFrame]\n        The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    Returns\n    -------\n    DataFrame\n        The transformed DataFrame\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output.df\n
"},{"location":"api_reference/spark/transformations/arrays.html","title":"Arrays","text":"

A collection of classes for performing various transformations on arrays in PySpark.

These transformations include operations such as removing duplicates, exploding arrays into separate rows, reversing the order of elements, sorting elements, removing certain values, and calculating aggregate statistics like minimum, maximum, sum, mean, and median.

Concept
  • Every transformation in this module is implemented as a class that inherits from the ArrayTransformation class.
  • The ArrayTransformation class is a subclass of ColumnsTransformationWithTarget
  • The ArrayTransformation class implements the func method, which is used to define the transformation logic.
  • The func method takes a column as input and returns a Column object.
  • The Column object is a PySpark column that can be used to perform transformations on a DataFrame column.
  • The ArrayTransformation limits the data type of the transformation to array by setting the ColumnConfig class to run_for_all_data_type = [SparkDatatype.ARRAY] and limit_data_type = [SparkDatatype.ARRAY].
See Also
  • koheesio.spark.transformations Module containing all transformation classes.
  • koheesio.spark.transformations.ColumnsTransformationWithTarget Base class for all transformations that operate on columns and have a target column.
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortAsc","title":"koheesio.spark.transformations.arrays.ArraySortAsc module-attribute","text":"
ArraySortAsc = ArraySort\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct","title":"koheesio.spark.transformations.arrays.ArrayDistinct","text":"

Remove duplicates from array

Example
ArrayDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.filter_empty","title":"filter_empty class-attribute instance-attribute","text":"
filter_empty: bool = Field(default=True, description='Remove null, nan, and empty values from array. Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    _fn = F.array_distinct(column)\n\n    # noinspection PyUnresolvedReferences\n    element_type = self.column_type_of_col(column, None, False).elementType\n    is_numeric = spark_data_type_is_numeric(element_type)\n\n    if self.filter_empty:\n        # Remove null values from array\n        if spark_minor_version >= 3.4:\n            # Run array_compact if spark version is 3.4 or higher\n            # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_compact.html\n            # pylint: disable=E0611\n            from pyspark.sql.functions import array_compact as _array_compact\n\n            _fn = _array_compact(_fn)\n            # pylint: enable=E0611\n        else:\n            # Otherwise, remove null from array using array_except\n            _fn = F.array_except(_fn, F.array(F.lit(None)))\n\n        # Remove nan or empty values from array (depends on the type of the elements in array)\n        if is_numeric:\n            # Remove nan from array (float/int/numbers)\n            _fn = F.array_except(_fn, F.array(F.lit(float(\"nan\")).cast(element_type)))\n        else:\n            # Remove empty values from array (string/text)\n            _fn = F.array_except(_fn, F.array(F.lit(\"\"), F.lit(\" \")))\n\n    return _fn\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax","title":"koheesio.spark.transformations.arrays.ArrayMax","text":"

Return the maximum value in the array

Example
ArrayMax(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    # Call for processing of nan values\n    column = super().func(column)\n\n    return F.array_max(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean","title":"koheesio.spark.transformations.arrays.ArrayMean","text":"

Return the mean of the values in the array.

Note: Only numeric values are supported for calculating the mean.

Example
ArrayMean(column=\"array_column\", target_column=\"average\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean.func","title":"func","text":"
func(column: Column) -> Column\n

Calculate the mean of the values in the array

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"Calculate the mean of the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(col=column, df=None, simple_return_mode=False).elementType\n\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for calculating a mean.\"\n        )\n\n    _sum = ArraySum.from_step(self).func(column)\n    # Call for processing of nan values\n    column = super().func(column)\n    _size = F.size(column)\n    # return 0 if the size of the array is 0 to avoid division by zero\n    return F.when(_size == 0, F.lit(0)).otherwise(_sum / _size)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian","title":"koheesio.spark.transformations.arrays.ArrayMedian","text":"

Return the median of the values in the array.

The median is the middle value in a sorted, ascending or descending, list of numbers.

  • If the size of the array is even, the median is the average of the two middle numbers.
  • If the size of the array is odd, the median is the middle number.

Note: Only numeric values are supported for calculating the median.

Example
ArrayMedian(column=\"array_column\", target_column=\"median\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian.func","title":"func","text":"
func(column: Column) -> Column\n

Calculate the median of the values in the array

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"Calculate the median of the values in the array\"\"\"\n    # Call for processing of nan values\n    column = super().func(column)\n\n    sorted_array = ArraySort.from_step(self).func(column)\n    _size: Column = F.size(sorted_array)\n\n    # Calculate the middle index. If the size is odd, PySpark discards the fractional part.\n    # Use floor function to ensure the result is an integer\n    middle: Column = F.floor((_size + 1) / 2).cast(\"int\")\n\n    # Define conditions\n    is_size_zero: Column = _size == 0\n    is_column_null: Column = column.isNull()\n    is_size_even: Column = _size % 2 == 0\n\n    # Define actions / responses\n    # For even-sized arrays, calculate the average of the two middle elements\n    average_of_middle_elements = (F.element_at(sorted_array, middle) + F.element_at(sorted_array, middle + 1)) / 2\n    # For odd-sized arrays, select the middle element\n    middle_element = F.element_at(sorted_array, middle)\n    # In case the array is empty, return either None or 0\n    none_value = F.lit(None)\n    zero_value = F.lit(0)\n\n    median = (\n        # Check if the size of the array is 0\n        F.when(\n            is_size_zero,\n            # If the size of the array is 0 and the column is null, return None\n            # If the size of the array is 0 and the column is not null, return 0\n            F.when(is_column_null, none_value).otherwise(zero_value),\n        ).otherwise(\n            # If the size of the array is not 0, calculate the median\n            F.when(is_size_even, average_of_middle_elements).otherwise(middle_element)\n        )\n    )\n\n    return median\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin","title":"koheesio.spark.transformations.arrays.ArrayMin","text":"

Return the minimum value in the array

Example
ArrayMin(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    return F.array_min(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess","title":"koheesio.spark.transformations.arrays.ArrayNullNanProcess","text":"

Process an array by removing NaN and/or NULL values from elements.

Parameters:

Name Type Description Default keep_nan bool

Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.

False keep_null bool

Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.

False

Returns:

Name Type Description column Column

The processed column with NaN and/or NULL values removed from elements.

Examples:

>>> input_data = [(1, [1.1, 2.1, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=False)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1]\n\n>>> input_data = [(1, [1.1, 2.2, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=True)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1, nan]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_nan","title":"keep_nan class-attribute instance-attribute","text":"
keep_nan: bool = Field(False, description='Whether to keep nan values in the array. Default is False. If set to True, the nan values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_null","title":"keep_null class-attribute instance-attribute","text":"
keep_null: bool = Field(False, description='Whether to keep null values in the array. Default is False. If set to True, the null values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.func","title":"func","text":"
func(column: Column) -> Column\n

Process the given column by removing NaN and/or NULL values from elements.

Parameters:

column : Column The column to be processed.

Returns:

column : Column The processed column with NaN and/or NULL values removed from elements.

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"\n    Process the given column by removing NaN and/or NULL values from elements.\n\n    Parameters:\n    -----------\n    column : Column\n        The column to be processed.\n\n    Returns:\n    --------\n    column : Column\n        The processed column with NaN and/or NULL values removed from elements.\n    \"\"\"\n\n    def apply_logic(x: Column):\n        if self.keep_nan is False and self.keep_null is False:\n            logic = x.isNotNull() & ~F.isnan(x)\n        elif self.keep_nan is False:\n            logic = ~F.isnan(x)\n        elif self.keep_null is False:\n            logic = x.isNotNull()\n\n        return logic\n\n    if self.keep_nan is False or self.keep_null is False:\n        column = F.filter(column, apply_logic)\n\n    return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove","title":"koheesio.spark.transformations.arrays.ArrayRemove","text":"

Remove a certain value from the array

Parameters:

Name Type Description Default keep_nan bool

Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.

False keep_null bool

Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.

False Example
ArrayRemove(column=\"array_column\", value=\"value_to_remove\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.make_distinct","title":"make_distinct class-attribute instance-attribute","text":"
make_distinct: bool = Field(default=False, description='Whether to remove duplicates from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.value","title":"value class-attribute instance-attribute","text":"
value: Any = Field(default=None, description='The value to remove from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    value = self.value\n\n    column = super().func(column)\n\n    def filter_logic(x: Column, _val: Any):\n        if self.keep_null and self.keep_nan:\n            logic = (x != F.lit(_val)) | x.isNull() | F.isnan(x)\n        elif self.keep_null:\n            logic = (x != F.lit(_val)) | x.isNull()\n        elif self.keep_nan:\n            logic = (x != F.lit(_val)) | F.isnan(x)\n        else:\n            logic = x != F.lit(_val)\n\n        return logic\n\n    # Check if the value is iterable (i.e., a list, tuple, or set)\n    if isinstance(value, (list, tuple, set)):\n        result = reduce(lambda res, val: F.filter(res, lambda x: filter_logic(x, val)), value, column)\n    else:\n        # If the value is not iterable, simply remove the value from the array\n        result = F.filter(column, lambda x: filter_logic(x, value))\n\n    if self.make_distinct:\n        result = F.array_distinct(result)\n\n    return result\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse","title":"koheesio.spark.transformations.arrays.ArrayReverse","text":"

Reverse the order of elements in the array

Example
ArrayReverse(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    return F.reverse(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort","title":"koheesio.spark.transformations.arrays.ArraySort","text":"

Sort the elements in the array

By default, the elements are sorted in ascending order. To sort the elements in descending order, set the reverse parameter to True.

Example
ArraySort(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.reverse","title":"reverse class-attribute instance-attribute","text":"
reverse: bool = Field(default=False, description='Sort the elements in the array in a descending order. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    column = F.array_sort(column)\n    if self.reverse:\n        # Reverse the order of elements in the array\n        column = ArrayReverse.from_step(self).func(column)\n    return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc","title":"koheesio.spark.transformations.arrays.ArraySortDesc","text":"

Sort the elements in the array in descending order

"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc.reverse","title":"reverse class-attribute instance-attribute","text":"
reverse: bool = True\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum","title":"koheesio.spark.transformations.arrays.ArraySum","text":"

Return the sum of the values in the array

Parameters:

Name Type Description Default keep_nan bool

Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.

False keep_null bool

Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.

False Example
ArraySum(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum.func","title":"func","text":"
func(column: Column) -> Column\n

Using the aggregate function to sum the values in the array

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"Using the `aggregate` function to sum the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(column, None, False).elementType\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for summing.\"\n        )\n\n    # remove na values from array.\n    column = super().func(column)\n\n    # Using the `aggregate` function to sum the values in the array by providing the initial value as 0.0 and the\n    # lambda function to add the elements together. Pyspark will automatically infer the type of the initial value\n    # making 0.0 valid for both integer and float types.\n    initial_value = F.lit(0.0)\n    return F.aggregate(column, initial_value, lambda accumulator, x: accumulator + x)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation","title":"koheesio.spark.transformations.arrays.ArrayTransformation","text":"

Base class for array transformations

"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig","title":"ColumnConfig","text":"

Set the data type of the Transformation to array

"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    raise NotImplementedError(\"This is an abstract class\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode","title":"koheesio.spark.transformations.arrays.Explode","text":"

Explode the array into separate rows

Example
Explode(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.distinct","title":"distinct class-attribute instance-attribute","text":"
distinct: bool = Field(False, description='Remove duplicates from the exploded array. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.preserve_nulls","title":"preserve_nulls class-attribute instance-attribute","text":"
preserve_nulls: bool = Field(True, description='Preserve rows with null values in the exploded array by using explode_outer instead of explode.Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    if self.distinct:\n        column = ArrayDistinct.from_step(self).func(column)\n    return F.explode_outer(column) if self.preserve_nulls else F.explode(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct","title":"koheesio.spark.transformations.arrays.ExplodeDistinct","text":"

Explode the array into separate rows while removing duplicates and empty values

Example
ExplodeDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct.distinct","title":"distinct class-attribute instance-attribute","text":"
distinct: bool = True\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html","title":"Camel to snake","text":"

Class for converting DataFrame column names from camel case to snake case.

"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.camel_to_snake_re","title":"koheesio.spark.transformations.camel_to_snake.camel_to_snake_re module-attribute","text":"
camel_to_snake_re = compile('([a-z0-9])([A-Z])')\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","title":"koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","text":"

Converts column names from camel case to snake cases

Parameters:

Name Type Description Default columns Optional[ListOfColumns]

The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: [\"column1\", \"column2\"] or \"column1\"

None Example

input_df:

camelCaseColumn snake_case_column ... ...
output_df = CamelToSnakeTransformation(column=\"camelCaseColumn\").transform(input_df)\n

output_df:

camel_case_column snake_case_column ... ...

In this example, the column camelCaseColumn is converted to camel_case_column.

Note: the data in the columns is not changed, only the column names.

"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[ListOfColumns] = Field(default='', alias='column', description=\"The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'` \")\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/camel_to_snake.py
def execute(self):\n    _df = self.df\n\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n\n    for column in columns:\n        _df = _df.withColumnRenamed(column, convert_camel_to_snake(column))\n\n    self.output.df = _df\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","title":"koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","text":"
convert_camel_to_snake(name: str)\n

Converts a string from camelCase to snake_case.

Parameters:

name : str The string to be converted.

Returns:

str The converted string in snake_case.

Source code in src/koheesio/spark/transformations/camel_to_snake.py
def convert_camel_to_snake(name: str):\n    \"\"\"\n    Converts a string from camelCase to snake_case.\n\n    Parameters:\n    ----------\n    name : str\n        The string to be converted.\n\n    Returns:\n    --------\n    str\n        The converted string in snake_case.\n    \"\"\"\n    return camel_to_snake_re.sub(r\"\\1_\\2\", name).lower()\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html","title":"Cast to datatype","text":"

Transformations to cast a column or set of columns to a given datatype.

Each one of these have been vetted to throw warnings when wrong datatypes are passed (to skip erroring any job or pipeline).

Furthermore, detailed tests have been added to ensure that types are actually compatible as prescribed.

Concept
  • One can use the CastToDataType class directly, or use one of the more specific subclasses.
  • Each subclass is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
  • Compatible Data types are specified in the ColumnConfig class of each subclass, and are documented in the docstring of each subclass.

See class docstrings for more information

Note

Dates, Arrays and Maps are not supported by this module.

  • for dates, use the koheesio.spark.transformations.date_time module
  • for arrays, use the koheesio.spark.transformations.arrays module

Classes:

Name Description CastToDatatype:

Cast a column or set of columns to a given datatype

CastToByte

Cast to Byte (a.k.a. tinyint)

CastToShort

Cast to Short (a.k.a. smallint)

CastToInteger

Cast to Integer (a.k.a. int)

CastToLong

Cast to Long (a.k.a. bigint)

CastToFloat

Cast to Float (a.k.a. real)

CastToDouble

Cast to Double

CastToDecimal

Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)

CastToString

Cast to String

CastToBinary

Cast to Binary (a.k.a. byte array)

CastToBoolean

Cast to Boolean

CastToTimestamp

Cast to Timestamp

Note

The following parameters are common to all classes in this module:

Parameters:

Name Type Description Default columns ListOfColumns

Name of the source column(s). Alias: column

required target_column str

Name of the target column or alias if more than one column is specified. Alias: target_alias

required datatype str or SparkDatatype

Datatype to cast to. Choose from SparkDatatype Enum (only needs to be specified in CastToDatatype, all other classes have a fixed datatype)

required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary","title":"koheesio.spark.transformations.cast_to_datatype.CastToBinary","text":"

Cast to Binary (a.k.a. byte array)

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • float
  • double
  • decimal
  • boolean
  • timestamp
  • date
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • string

Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = BINARY\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToBinary class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, STRING]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean","title":"koheesio.spark.transformations.cast_to_datatype.CastToBoolean","text":"

Cast to Boolean

Unsupported datatypes:

Following casts are not supported

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • decimal
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = BOOLEAN\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToBoolean class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte","title":"koheesio.spark.transformations.cast_to_datatype.CastToByte","text":"

Cast to Byte (a.k.a. tinyint)

Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • boolean
  • timestamp
  • decimal
  • double
  • float
  • long
  • integer
  • short

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • timestamp range of values too small for timestamp to have any meaning
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = BYTE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToByte class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype","title":"koheesio.spark.transformations.cast_to_datatype.CastToDatatype","text":"

Cast a column or set of columns to a given datatype

Wrapper around pyspark.sql.Column.cast

Concept

This class acts as the base class for all the other CastTo* classes. It is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.

Example

input_df:

c1 c2 1 2 3 4
output_df = CastToDatatype(\n    column=\"c1\",\n    datatype=\"string\",\n    target_alias=\"c1\",\n).transform(input_df)\n

output_df:

c1 c2 \"1\" 2 \"3\" 4

In the example above, the column c1 is cast to a string datatype. The column c2 is not affected.

Parameters:

Name Type Description Default columns ListOfColumns

Name of the source column(s). Alias: column

required datatype str or SparkDatatype

Datatype to cast to. Choose from SparkDatatype Enum

required target_column str

Name of the target column or alias if more than one column is specified. Alias: target_alias

required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = Field(default=..., description='Datatype. Choose from SparkDatatype Enum')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n    # This is to let the IDE explicitly know that the datatype is not a string, but a `SparkDatatype` Enum\n    datatype: SparkDatatype = self.datatype\n    return column.cast(datatype.spark_type())\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.validate_datatype","title":"validate_datatype","text":"
validate_datatype(datatype_value) -> SparkDatatype\n

Validate the datatype.

Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@field_validator(\"datatype\")\ndef validate_datatype(cls, datatype_value) -> SparkDatatype:\n    \"\"\"Validate the datatype.\"\"\"\n    # handle string input\n    try:\n        if isinstance(datatype_value, str):\n            datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value)\n            return datatype_value\n\n        # and let SparkDatatype handle the rest\n        datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value.value)\n\n    except AttributeError as e:\n        raise AttributeError(f\"Invalid datatype: {datatype_value}\") from e\n\n    return datatype_value\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal","title":"koheesio.spark.transformations.cast_to_datatype.CastToDecimal","text":"

Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)

Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.

The DecimalType must have fixed precision (the maximum total number of digits) and scale (the number of digits on the right of dot). For example, (5, 2) can support the value from [-999.99 to 999.99].

The precision can be up to 38, the scale must be less or equal to precision.

Spark sets the default precision and scale to (10, 0) when creating a DecimalType. However, when inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).

For compatibility reasons, Koheesio sets the default precision and scale to (38, 18) when creating a DecimalType.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • boolean
  • timestamp
  • date
  • string
  • void
  • decimal spark will convert existing decimals to null if the precision and scale doesn't fit the data

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default

Parameters:

Name Type Description Default columns ListOfColumns

Name of the source column(s). Alias: column

* target_column str

Name of the target column or alias if more than one column is specified. Alias: target_alias

required precision conint(gt=0, le=38)

the maximum (i.e. total) number of digits (default: 38). Must be > 0.

38 scale conint(ge=0, le=18)

the number of digits on right side of dot. (default: 18). Must be >= 0.

18"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = DECIMAL\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.precision","title":"precision class-attribute instance-attribute","text":"
precision: conint(gt=0, le=38) = Field(default=38, description='The maximum total number of digits (precision) of the decimal. Must be > 0. Default is 38')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.scale","title":"scale class-attribute instance-attribute","text":"
scale: conint(ge=0, le=18) = Field(default=18, description='The number of digits to the right of the decimal point (scale). Must be >= 0. Default is 18')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToDecimal class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n    return column.cast(self.datatype.spark_type(precision=self.precision, scale=self.scale))\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.validate_scale_and_precisions","title":"validate_scale_and_precisions","text":"
validate_scale_and_precisions()\n

Validate the precision and scale values.

Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@model_validator(mode=\"after\")\ndef validate_scale_and_precisions(self):\n    \"\"\"Validate the precision and scale values.\"\"\"\n    precision_value = self.precision\n    scale_value = self.scale\n\n    if scale_value == precision_value:\n        self.log.warning(\"scale and precision are equal, this will result in a null value\")\n    if scale_value > precision_value:\n        raise ValueError(\"scale must be < precision\")\n\n    return self\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble","title":"koheesio.spark.transformations.cast_to_datatype.CastToDouble","text":"

Cast to Double

Represents 8-byte double-precision floating point numbers. The range of numbers is from -1.7976931348623157E308 to 1.7976931348623157E308.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • decimal
  • boolean
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = DOUBLE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToDouble class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat","title":"koheesio.spark.transformations.cast_to_datatype.CastToFloat","text":"

Cast to Float (a.k.a. real)

Represents 4-byte single-precision floating point numbers. The range of numbers is from -3.402823E38 to 3.402823E38.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • double
  • decimal
  • boolean

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • timestamp precision is lost (use CastToDouble instead)
  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = FLOAT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToFloat class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger","title":"koheesio.spark.transformations.cast_to_datatype.CastToInteger","text":"

Cast to Integer (a.k.a. int)

Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • long
  • float
  • double
  • decimal
  • boolean
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = INTEGER\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToInteger class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong","title":"koheesio.spark.transformations.cast_to_datatype.CastToLong","text":"

Cast to Long (a.k.a. bigint)

Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • long
  • float
  • double
  • decimal
  • boolean
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = LONG\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToLong class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort","title":"koheesio.spark.transformations.cast_to_datatype.CastToShort","text":"

Cast to Short (a.k.a. smallint)

Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • integer
  • long
  • float
  • double
  • decimal
  • string
  • boolean
  • timestamp
  • date
  • void

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • timestamp range of values too small for timestamp to have any meaning
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = SHORT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToShort class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString","title":"koheesio.spark.transformations.cast_to_datatype.CastToString","text":"

Cast to String

Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • decimal
  • binary
  • boolean
  • timestamp
  • date
  • array
  • map
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = STRING\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToString class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BINARY, BOOLEAN, TIMESTAMP, DATE, ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","title":"koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","text":"

Cast to Timestamp

Numeric time stamp is the number of seconds since January 1, 1970 00:00:00.000000000 UTC. Not advised to be used on small integers, as the range of values is too small for timestamp to have any meaning.

For more fine-grained control over the timestamp format, use the date_time module. This allows for parsing strings to timestamps and vice versa.

See Also
  • koheesio.spark.transformations.date_time
  • https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#timestamp-pattern
Unsupported datatypes:

Following casts are not supported

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • integer
  • long
  • float
  • double
  • decimal
  • date

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • boolean: range of values too small for timestamp to have any meaning
  • byte: range of values too small for timestamp to have any meaning
  • string: converts to null in most cases, use date_time module instead
  • short: range of values too small for timestamp to have any meaning
  • void: skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = TIMESTAMP\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToTimestamp class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, BOOLEAN, BYTE, SHORT, STRING, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, DATE]\n
"},{"location":"api_reference/spark/transformations/drop_column.html","title":"Drop column","text":"

This module defines the DropColumn class, a subclass of ColumnsTransformation.

"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn","title":"koheesio.spark.transformations.drop_column.DropColumn","text":"

Drop one or more columns

The DropColumn class is used to drop one or more columns from a PySpark DataFrame. It wraps the pyspark.DataFrame.drop function and can handle either a single string or a list of strings as input.

If a specified column does not exist in the DataFrame, no error or warning is thrown, and all existing columns will remain.

Expected behavior
  • When the column does not exist, all columns will remain (no error or warning is thrown)
  • Either a single string, or a list of strings can be specified
Example

df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = DropColumn(column=\"product\").transform(df)\n

output_df:

amount country 1000 USA 1500 USA 1600 USA

In this example, the product column is dropped from the DataFrame df.

"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/drop_column.py
def execute(self):\n    self.log.info(f\"{self.column=}\")\n    self.output.df = self.df.drop(*self.columns)\n
"},{"location":"api_reference/spark/transformations/dummy.html","title":"Dummy","text":"

Dummy transformation for testing purposes.

"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation","title":"koheesio.spark.transformations.dummy.DummyTransformation","text":"

Dummy transformation for testing purposes.

This transformation adds a new column hello to the DataFrame with the value world.

It is intended for testing purposes or for use in examples or reference documentation.

Example

input_df:

id 1
output_df = DummyTransformation().transform(input_df)\n

output_df:

id hello 1 world

In this example, the hello column is added to the DataFrame input_df.

"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/dummy.py
def execute(self):\n    self.output.df = self.df.withColumn(\"hello\", lit(\"world\"))\n
"},{"location":"api_reference/spark/transformations/get_item.html","title":"Get item","text":"

Transformation to wrap around the pyspark getItem function

"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem","title":"koheesio.spark.transformations.get_item.GetItem","text":"

Get item from list or map (dictionary)

Wrapper around pyspark.sql.functions.getItem

GetItem is strict about the data type of the column. If the column is not a list or a map, an error will be raised.

Note

Only MapType and ArrayType are supported.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to get the item from. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None key Union[int, str]

The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index

required Example"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-list-arraytype","title":"Example with list (ArrayType)","text":"

By specifying an integer for the parameter \"key\", getItem knows to get the element at index n of a list (index starts at 0).

input_df:

id content 1 [1, 2, 3] 2 [4, 5] 3 [6] 4 []
output_df = GetItem(\n    column=\"content\",\n    index=1,  # get the second element of the list\n    target_column=\"item\",\n).transform(input_df)\n

output_df:

id content item 1 [1, 2, 3] 2 2 [4, 5] 5 3 [6] null 4 [] null"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-a-dict-maptype","title":"Example with a dict (MapType)","text":"

input_df:

id content 1 {key1 -> value1} 2 {key1 -> value2} 3 {key2 -> hello} 4 {key2 -> world}

output_df = GetItem(\n    column= \"content\",\n    key=\"key2,\n    target_column=\"item\",\n).transform(input_df)\n
As we request the key to be \"key2\", the first 2 rows will be null, because it does not have \"key2\".

output_df:

id content item 1 {key1 -> value1} null 2 {key1 -> value2} null 3 {key2 -> hello} hello 4 {key2 -> world} world"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.key","title":"key class-attribute instance-attribute","text":"
key: Union[int, str] = Field(default=..., alias='index', description='The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index')\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig","title":"ColumnConfig","text":"

Limit the data types to ArrayType and MapType.

"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute instance-attribute","text":"
data_type_strict_mode = True\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = run_for_all_data_type\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/get_item.py
def func(self, column: Column) -> Column:\n    return get_item(column, self.key)\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.get_item","title":"koheesio.spark.transformations.get_item.get_item","text":"
get_item(column: Column, key: Union[str, int])\n

Wrapper around pyspark.sql.functions.getItem

Parameters:

Name Type Description Default column Column

The column to get the item from

required key Union[str, int]

The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string.

required

Returns:

Type Description Column

The column with the item

Source code in src/koheesio/spark/transformations/get_item.py
def get_item(column: Column, key: Union[str, int]):\n    \"\"\"\n    Wrapper around pyspark.sql.functions.getItem\n\n    Parameters\n    ----------\n    column : Column\n        The column to get the item from\n    key : Union[str, int]\n        The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer.\n        If the column is a dict (MapType), this should be a string.\n\n    Returns\n    -------\n    Column\n        The column with the item\n    \"\"\"\n    return column.getItem(key)\n
"},{"location":"api_reference/spark/transformations/hash.html","title":"Hash","text":"

Module for hashing data using SHA-2 family of hash functions

See the docstring of the Sha2Hash class for more information.

"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.HASH_ALGORITHM","title":"koheesio.spark.transformations.hash.HASH_ALGORITHM module-attribute","text":"
HASH_ALGORITHM = Literal[224, 256, 384, 512]\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.STRING","title":"koheesio.spark.transformations.hash.STRING module-attribute","text":"
STRING = STRING\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash","title":"koheesio.spark.transformations.hash.Sha2Hash","text":"

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

  • https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).

Note

This function allows concatenating the values of multiple columns together prior to hashing.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to hash. Alias: column

required delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

| num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

256 target_column str

The generated hash will be written to the column name specified here

required"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.delimiter","title":"delimiter class-attribute instance-attribute","text":"
delimiter: Optional[str] = Field(default='|', description=\"Optional separator for the string that will eventually be hashed. Defaults to '|'\")\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.num_bits","title":"num_bits class-attribute instance-attribute","text":"
num_bits: Optional[HASH_ALGORITHM] = Field(default=256, description='Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: str = Field(default=..., description='The generated hash will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/hash.py
def execute(self):\n    columns = list(self.get_columns())\n    self.output.df = (\n        self.df.withColumn(\n            self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)\n        )\n        if columns\n        else self.df\n    )\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.sha2_hash","title":"koheesio.spark.transformations.hash.sha2_hash","text":"
sha2_hash(columns: List[str], delimiter: Optional[str] = '|', num_bits: Optional[HASH_ALGORITHM] = 256)\n

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

  • https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.

If a null is passed, the result will also be null.

Parameters:

Name Type Description Default columns List[str]

The columns to hash

required delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

| num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

256 Source code in src/koheesio/spark/transformations/hash.py
def sha2_hash(columns: List[str], delimiter: Optional[str] = \"|\", num_bits: Optional[HASH_ALGORITHM] = 256):\n    \"\"\"\n    hash the value of 1 or more columns using SHA-2 family of hash functions\n\n    Mild wrapper around pyspark.sql.functions.sha2\n\n    - https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html\n\n    Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).\n    This function allows concatenating the values of multiple columns together prior to hashing.\n\n    If a null is passed, the result will also be null.\n\n    Parameters\n    ----------\n    columns : List[str]\n        The columns to hash\n    delimiter : Optional[str], optional, default=|\n        Optional separator for the string that will eventually be hashed. Defaults to '|'\n    num_bits : Optional[HASH_ALGORITHM], optional, default=256\n        Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512\n    \"\"\"\n    # make sure all columns are of type pyspark.sql.Column and cast to string\n    _columns = []\n    for c in columns:\n        if isinstance(c, str):\n            c: Column = col(c)\n        _columns.append(c.cast(STRING.spark_type()))\n\n    # concatenate columns if more than 1 column is provided\n    if len(_columns) > 1:\n        column = concat_ws(delimiter, *_columns)\n    else:\n        column = _columns[0]\n\n    return sha2(column, num_bits)\n
"},{"location":"api_reference/spark/transformations/lookup.html","title":"Lookup","text":"

Lookup transformation for joining two dataframes together

Classes:

Name Description JoinMapping TargetColumn JoinType JoinHint DataframeLookup"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup","title":"koheesio.spark.transformations.lookup.DataframeLookup","text":"

Lookup transformation for joining two dataframes together

Parameters:

Name Type Description Default df DataFrame

The left Spark DataFrame

required other DataFrame

The right Spark DataFrame

required on List[JoinMapping] | JoinMapping

List of join mappings. If only one mapping is passed, it can be passed as a single object.

required targets List[TargetColumn] | TargetColumn

List of target columns. If only one target is passed, it can be passed as a single object.

required how JoinType

What type of join to perform. Defaults to left. See JoinType for more information.

required hint JoinHint

What type of join hint to use. Defaults to None. See JoinHint for more information.

required Example
from pyspark.sql import SparkSession\nfrom koheesio.spark.transformations.lookup import (\n    DataframeLookup,\n    JoinMapping,\n    TargetColumn,\n    JoinType,\n)\n\nspark = SparkSession.builder.getOrCreate()\n\n# create the dataframes\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\n# perform the lookup\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", joined_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.transform()\n

output_df:

id value right_value 1 A A 2 B null

In this example, the left_df and right_df dataframes are joined together using the id column. The value column from the right_df is aliased as right_value in the output dataframe.

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.df","title":"df class-attribute instance-attribute","text":"
df: DataFrame = Field(default=None, description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.hint","title":"hint class-attribute instance-attribute","text":"
hint: Optional[JoinHint] = Field(default=None, description='What type of join hint to use. Defaults to None. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.how","title":"how class-attribute instance-attribute","text":"
how: Optional[JoinType] = Field(default=LEFT, description='What type of join to perform. Defaults to left. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.on","title":"on class-attribute instance-attribute","text":"
on: Union[List[JoinMapping], JoinMapping] = Field(default=..., alias='join_mapping', description='List of join mappings. If only one mapping is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.other","title":"other class-attribute instance-attribute","text":"
other: DataFrame = Field(default=None, description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.targets","title":"targets class-attribute instance-attribute","text":"
targets: Union[List[TargetColumn], TargetColumn] = Field(default=..., alias='target_columns', description='List of target columns. If only one target is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output","title":"Output","text":"

Output for the lookup transformation

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.left_df","title":"left_df class-attribute instance-attribute","text":"
left_df: DataFrame = Field(default=..., description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.right_df","title":"right_df class-attribute instance-attribute","text":"
right_df: DataFrame = Field(default=..., description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.execute","title":"execute","text":"
execute() -> Output\n

Execute the lookup transformation

Source code in src/koheesio/spark/transformations/lookup.py
def execute(self) -> Output:\n    \"\"\"Execute the lookup transformation\"\"\"\n    # prepare the right dataframe\n    prepared_right_df = self.get_right_df().select(\n        *[join_mapping.column for join_mapping in self.on],\n        *[target.column for target in self.targets],\n    )\n    if self.hint:\n        prepared_right_df = prepared_right_df.hint(self.hint)\n\n    # generate the output\n    self.output.left_df = self.df\n    self.output.right_df = prepared_right_df\n    self.output.df = self.df.join(\n        prepared_right_df,\n        on=[join_mapping.source_column for join_mapping in self.on],\n        how=self.how,\n    )\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.get_right_df","title":"get_right_df","text":"
get_right_df() -> DataFrame\n

Get the right side dataframe

Source code in src/koheesio/spark/transformations/lookup.py
def get_right_df(self) -> DataFrame:\n    \"\"\"Get the right side dataframe\"\"\"\n    return self.other\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.set_list","title":"set_list","text":"
set_list(value)\n

Ensure that we can pass either a single object, or a list of objects

Source code in src/koheesio/spark/transformations/lookup.py
@field_validator(\"on\", \"targets\")\ndef set_list(cls, value):\n    \"\"\"Ensure that we can pass either a single object, or a list of objects\"\"\"\n    return [value] if not isinstance(value, list) else value\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint","title":"koheesio.spark.transformations.lookup.JoinHint","text":"

Supported join hints

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.BROADCAST","title":"BROADCAST class-attribute instance-attribute","text":"
BROADCAST = 'broadcast'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.MERGE","title":"MERGE class-attribute instance-attribute","text":"
MERGE = 'merge'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping","title":"koheesio.spark.transformations.lookup.JoinMapping","text":"

Mapping for joining two dataframes together

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.column","title":"column property","text":"
column: Column\n

Get the join mapping as a pyspark.sql.Column object

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.other_column","title":"other_column instance-attribute","text":"
other_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.source_column","title":"source_column instance-attribute","text":"
source_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType","title":"koheesio.spark.transformations.lookup.JoinType","text":"

Supported join types

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.ANTI","title":"ANTI class-attribute instance-attribute","text":"
ANTI = 'anti'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.CROSS","title":"CROSS class-attribute instance-attribute","text":"
CROSS = 'cross'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.FULL","title":"FULL class-attribute instance-attribute","text":"
FULL = 'full'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.INNER","title":"INNER class-attribute instance-attribute","text":"
INNER = 'inner'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.LEFT","title":"LEFT class-attribute instance-attribute","text":"
LEFT = 'left'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.RIGHT","title":"RIGHT class-attribute instance-attribute","text":"
RIGHT = 'right'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.SEMI","title":"SEMI class-attribute instance-attribute","text":"
SEMI = 'semi'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn","title":"koheesio.spark.transformations.lookup.TargetColumn","text":"

Target column for the joined dataframe

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.column","title":"column property","text":"
column: Column\n

Get the target column as a pyspark.sql.Column object

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column","title":"target_column instance-attribute","text":"
target_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column_alias","title":"target_column_alias instance-attribute","text":"
target_column_alias: str\n
"},{"location":"api_reference/spark/transformations/repartition.html","title":"Repartition","text":"

Repartition Transformation

"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition","title":"koheesio.spark.transformations.repartition.Repartition","text":"

Wrapper around DataFrame.repartition

With repartition, the number of partitions can be given as an optional value. If this is not provided, a default value is used. The default number of partitions is defined by the spark config 'spark.sql.shuffle.partitions', for which the default value is 200 and will never exceed the number or rows in the DataFrame (whichever is value is lower).

If columns are omitted, the entire DataFrame is repartitioned without considering the particular values in the columns.

Parameters:

Name Type Description Default column Optional[Union[str, List[str]]]

Name of the source column(s). If omitted, the entire DataFrame is repartitioned without considering the particular values in the columns. Alias: columns

None num_partitions Optional[int]

The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.

None Example
Repartition(column=[\"c1\", \"c2\"], num_partitions=3)  # results in 3 partitions\nRepartition(column=\"c1\", num_partitions=2)  # results in 2 partitions\nRepartition(column=[\"c1\", \"c2\"])  # results in <= 200 partitions\nRepartition(num_partitions=5)  # results in 5 partitions\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[ListOfColumns] = Field(default='', alias='column', description='Name of the source column(s)')\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.numPartitions","title":"numPartitions class-attribute instance-attribute","text":"
numPartitions: Optional[int] = Field(default=None, alias='num_partitions', description=\"The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.\")\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/repartition.py
def execute(self):\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n    # Prepare repartition input:\n    #  num_partitions comes first, but if it is not provided it should not be included as None.\n    repartition_inputs = [i for i in [self.numPartitions, *columns] if i]\n    self.output.df = self.df.repartition(*repartition_inputs)\n
"},{"location":"api_reference/spark/transformations/replace.html","title":"Replace","text":"

Transformation to replace a particular value in a column with another one

"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace","title":"koheesio.spark.transformations.replace.Replace","text":"

Replace a particular value in a column with another one

Can handle empty strings (\"\") as well as NULL / None values.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • boolean
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • decimal
  • timestamp
  • date
  • string
  • void skipped by default

Any supported none-string datatype will be cast to string before the replacement is done.

Example

input_df:

id string 1 hello 2 world 3
output_df = Replace(\n    column=\"string\",\n    from_value=\"hello\",\n    to_value=\"programmer\",\n).transform(input_df)\n

output_df:

id string 1 programmer 2 world 3

In this example, the value \"hello\" in the column \"string\" is replaced with \"programmer\".

"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.from_value","title":"from_value class-attribute instance-attribute","text":"
from_value: Optional[str] = Field(default=None, alias='from', description=\"The original value that needs to be replaced. If no value is given, all 'null' values will be replaced with the to_value\")\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.to_value","title":"to_value class-attribute instance-attribute","text":"
to_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig","title":"ColumnConfig","text":"

Column type configurations for the column to be replaced

"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP, DATE]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/replace.py
def func(self, column: Column) -> Column:\n    return replace(column=column, from_value=self.from_value, to_value=self.to_value)\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.replace","title":"koheesio.spark.transformations.replace.replace","text":"
replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None)\n

Function to replace a particular value in a column with another one

Source code in src/koheesio/spark/transformations/replace.py
def replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None):\n    \"\"\"Function to replace a particular value in a column with another one\"\"\"\n    # make sure we have a Column object\n    if isinstance(column, str):\n        column = col(column)\n\n    if not from_value:\n        condition = column.isNull()\n    else:\n        condition = column == from_value\n\n    return when(condition, lit(to_value)).otherwise(column)\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html","title":"Row number dedup","text":"

This module contains the RowNumberDedup class, which performs a row_number deduplication operation on a DataFrame.

See the docstring of the RowNumberDedup class for more information.

"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup","title":"koheesio.spark.transformations.row_number_dedup.RowNumberDedup","text":"

A class used to perform a row_number deduplication operation on a DataFrame.

This class is a specialized transformation that extends the ColumnsTransformation class. It sorts the DataFrame based on the provided sort columns and assigns a row_number to each row. It then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row can be stored in a specified target column or a default column named \"meta_row_number_column\". The class also provides an option to preserve meta columns (like the row_numberk column) in the output DataFrame.

Attributes:

Name Type Description columns list

List of columns to apply the transformation to. If a single '*' is passed as a column name or if the columns parameter is omitted, the transformation will be applied to all columns of the data types specified in run_for_all_data_type of the ColumnConfig. (inherited from ColumnsTransformation)

sort_columns list

List of columns that the DataFrame will be sorted by.

target_column (str, optional)

Column where the row_number of each row will be stored.

preserve_meta (bool, optional)

Flag that determines whether the meta columns should be kept in the output DataFrame.

"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.preserve_meta","title":"preserve_meta class-attribute instance-attribute","text":"
preserve_meta: bool = Field(default=False, description=\"If true, meta columns are kept in output dataframe. Defaults to 'False'\")\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.sort_columns","title":"sort_columns class-attribute instance-attribute","text":"
sort_columns: conlist(Union[str, Column], min_length=0) = Field(default_factory=list, alias='sort_column', description='List of orderBy columns. If only one column is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: Optional[Union[str, Column]] = Field(default='meta_row_number_column', alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.window_spec","title":"window_spec property","text":"
window_spec: WindowSpec\n

Builds a WindowSpec object based on the columns defined in the configuration.

The WindowSpec object is used to define a window frame over which functions are applied in Spark. This method partitions the data by the columns returned by the get_columns method and then orders the partitions by the columns specified in sort_columns.

Notes

The order of the columns in the WindowSpec object is preserved. If a column is passed as a string, it is converted to a Column object with DESC ordering.

Returns:

Type Description WindowSpec

A WindowSpec object that can be used to define a window frame in Spark.

"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.execute","title":"execute","text":"
execute() -> Output\n

Performs the row_number deduplication operation on the DataFrame.

This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row, and then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row is stored in the target column. If preserve_meta is False, the method also drops the target column from the DataFrame.

Source code in src/koheesio/spark/transformations/row_number_dedup.py
def execute(self) -> RowNumberDedup.Output:\n    \"\"\"\n    Performs the row_number deduplication operation on the DataFrame.\n\n    This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row,\n    and then filters the DataFrame to keep only the top-row_number row for each group of duplicates.\n    The row_number of each row is stored in the target column. If preserve_meta is False,\n    the method also drops the target column from the DataFrame.\n    \"\"\"\n    df = self.df\n    window_spec = self.window_spec\n\n    # if target_column is a string, convert it to a Column object\n    if isinstance((target_column := self.target_column), str):\n        target_column = col(target_column)\n\n    # dedup the dataframe based on the window spec\n    df = df.withColumn(self.target_column, row_number().over(window_spec)).filter(target_column == 1).select(\"*\")\n\n    if not self.preserve_meta:\n        df = df.drop(target_column)\n\n    self.output.df = df\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.set_sort_columns","title":"set_sort_columns","text":"
set_sort_columns(columns_value)\n

Validates and optimizes the sort_columns parameter.

This method ensures that sort_columns is a list (or single object) of unique strings or Column objects. It removes any empty strings or None values from the list and deduplicates the columns.

Parameters:

Name Type Description Default columns_value Union[str, Column, List[Union[str, Column]]]

The value of the sort_columns parameter.

required

Returns:

Type Description List[Union[str, Column]]

The optimized and deduplicated list of sort columns.

Source code in src/koheesio/spark/transformations/row_number_dedup.py
@field_validator(\"sort_columns\", mode=\"before\")\ndef set_sort_columns(cls, columns_value):\n    \"\"\"\n    Validates and optimizes the sort_columns parameter.\n\n    This method ensures that sort_columns is a list (or single object) of unique strings or Column objects.\n    It removes any empty strings or None values from the list and deduplicates the columns.\n\n    Parameters\n    ----------\n    columns_value : Union[str, Column, List[Union[str, Column]]]\n        The value of the sort_columns parameter.\n\n    Returns\n    -------\n    List[Union[str, Column]]\n        The optimized and deduplicated list of sort columns.\n    \"\"\"\n    # Convert single string or Column object to a list\n    columns = [columns_value] if isinstance(columns_value, (str, Column)) else [*columns_value]\n\n    # Remove empty strings, None, etc.\n    columns = [c for c in columns if (isinstance(c, Column) and c is not None) or (isinstance(c, str) and c)]\n\n    dedup_columns = []\n    seen = set()\n\n    # Deduplicate the columns while preserving the order\n    for column in columns:\n        if str(column) not in seen:\n            dedup_columns.append(column)\n            seen.add(str(column))\n\n    return dedup_columns\n
"},{"location":"api_reference/spark/transformations/sql_transform.html","title":"Sql transform","text":"

SQL Transform module

SQL Transform module provides an easy interface to transform a dataframe using SQL. This SQL can originate from a string or a file and may contain placeholders for templating.

"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform","title":"koheesio.spark.transformations.sql_transform.SqlTransform","text":"

SQL Transform module provides an easy interface to transform a dataframe using SQL.

This SQL can originate from a string or a file and may contain placeholder (parameters) for templating.

  • Placeholders are identified with ${placeholder}.
  • Placeholders can be passed as explicit params (params) or as implicit params (kwargs).

Example sql script:

SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n
"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/sql_transform.py
def execute(self):\n    table_name = get_random_string(prefix=\"sql_transform\")\n    self.params = {**self.params, \"table_name\": table_name}\n\n    df = self.df\n    df.createOrReplaceTempView(table_name)\n    query = self.query\n\n    self.output.df = self.spark.sql(query)\n
"},{"location":"api_reference/spark/transformations/transform.html","title":"Transform","text":"

Transform module

Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.

"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform","title":"koheesio.spark.transformations.transform.Transform","text":"
Transform(func: Callable, params: Dict = None, df: DataFrame = None, **kwargs)\n

Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.

The implementation is inspired by and based upon: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html

Parameters:

Name Type Description Default func Callable

The function to be called on the DataFrame.

required params Dict

The keyword arguments to be passed to the function. Defaults to None. Alternatively, keyword arguments can be passed directly as keyword arguments - they will be merged with the params dictionary.

None Example Source code in src/koheesio/spark/transformations/transform.py
def __init__(self, func: Callable, params: Dict = None, df: DataFrame = None, **kwargs):\n    params = {**(params or {}), **kwargs}\n    super().__init__(func=func, params=params, df=df)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--a-function-compatible-with-transform","title":"a function compatible with Transform:","text":"
def some_func(df, a: str, b: str):\n    return df.withColumn(a, f.lit(b))\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--verbose-style-input-in-transform","title":"verbose style input in Transform","text":"
Transform(func=some_func, params={\"a\": \"foo\", \"b\": \"bar\"})\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--shortened-style-notation-easier-to-read","title":"shortened style notation (easier to read)","text":"
Transform(some_func, a=\"foo\", b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--when-too-much-input-is-given-transform-will-ignore-extra-input","title":"when too much input is given, Transform will ignore extra input","text":"
Transform(\n    some_func,\n    a=\"foo\",\n    # ignored input\n    c=\"baz\",\n    title=42,\n    author=\"Adams\",\n    # order of params input should not matter\n    b=\"bar\",\n)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--using-the-from_func-classmethod","title":"using the from_func classmethod","text":"
SomeFunc = Transform.from_func(some_func, a=\"foo\")\nsome_func = SomeFunc(b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.func","title":"func class-attribute instance-attribute","text":"
func: Callable = Field(default=None, description='The function to be called on the DataFrame.')\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.execute","title":"execute","text":"
execute()\n

Call the function on the DataFrame with the given keyword arguments.

Source code in src/koheesio/spark/transformations/transform.py
def execute(self):\n    \"\"\"Call the function on the DataFrame with the given keyword arguments.\"\"\"\n    func, kwargs = get_args_for_func(self.func, self.params)\n    self.output.df = self.df.transform(func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.from_func","title":"from_func classmethod","text":"
from_func(func: Callable, **kwargs) -> Callable[..., Transform]\n

Create a Transform class from a function. Useful for creating a new class with a different name.

This method uses the functools.partial function to create a new class with the given function and keyword arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for the specific use case.

Example
CustomTransform = Transform.from_func(some_func, a=\"foo\")\nsome_func = CustomTransform(b=\"bar\")\n

In this example, CustomTransform is a Transform class with the function some_func and the keyword argument a set to \"foo\". When calling some_func(b=\"bar\"), the function some_func will be called with the keyword arguments a=\"foo\" and b=\"bar\".

Source code in src/koheesio/spark/transformations/transform.py
@classmethod\ndef from_func(cls, func: Callable, **kwargs) -> Callable[..., Transform]:\n    \"\"\"Create a Transform class from a function. Useful for creating a new class with a different name.\n\n    This method uses the `functools.partial` function to create a new class with the given function and keyword\n    arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for\n    the specific use case.\n\n    Example\n    -------\n    ```python\n    CustomTransform = Transform.from_func(some_func, a=\"foo\")\n    some_func = CustomTransform(b=\"bar\")\n    ```\n\n    In this example, `CustomTransform` is a Transform class with the function `some_func` and the keyword argument\n    `a` set to \"foo\". When calling `some_func(b=\"bar\")`, the function `some_func` will be called with the keyword\n    arguments `a=\"foo\"` and `b=\"bar\"`.\n    \"\"\"\n    return partial(cls, func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/uuid5.html","title":"Uuid5","text":"

Ability to generate UUID5 using native pyspark (no udf)

"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5","title":"koheesio.spark.transformations.uuid5.HashUUID5","text":"

Generate a UUID with the UUID5 algorithm

Spark does not provide inbuilt API to generate version 5 UUID, hence we have to use a custom implementation to provide this capability.

Prerequisites: this function has no side effects. But be aware that in most cases, the expectation is that your data is clean (e.g. trimmed of leading and trailing spaces)

Concept

UUID5 is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). https://docs.python.org/3/library/uuid.html#uuid.uuid5

Based on https://github.com/MrPowers/quinn/pull/96 with the difference that since Spark 3.0.0 an OVERLAY function from ANSI SQL 2016 is available which saves coding space and string allocation(s) in place of CONCAT + SUBSTRING.

For more info on OVERLAY, see: https://docs.databricks.com/sql/language-manual/functions/overlay.html

Example

Input is a DataFrame with two columns:

id string 1 hello 2 world 3

Input parameters:

  • source_columns = [\"id\", \"string\"]
  • target_column = \"uuid5\"

Result:

id string uuid5 1 hello f3e99bbd-85ae-5dc3-bf6e-cd0022a0ebe6 2 world b48e880f-c289-5c94-b51f-b9d21f9616c0 3 2193a99d-222e-5a0c-a7d6-48fbe78d2708

In code:

HashUUID5(source_columns=[\"id\", \"string\"], target_column=\"uuid5\").transform(input_df)\n

In this example, the id and string columns are concatenated and hashed using the UUID5 algorithm. The result is stored in the uuid5 column.

"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.delimiter","title":"delimiter class-attribute instance-attribute","text":"
delimiter: Optional[str] = Field(default='|', description='Separator for the string that will eventually be hashed')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.description","title":"description class-attribute instance-attribute","text":"
description: str = 'Generate a UUID with the UUID5 algorithm'\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.extra_string","title":"extra_string class-attribute instance-attribute","text":"
extra_string: Optional[str] = Field(default='', description='In case of collisions, one can pass an extra string to hash on.')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.namespace","title":"namespace class-attribute instance-attribute","text":"
namespace: Optional[Union[str, UUID]] = Field(default='', description='Namespace DNS')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.source_columns","title":"source_columns class-attribute instance-attribute","text":"
source_columns: ListOfColumns = Field(default=..., description=\"List of columns that should be hashed. Should contain the name of at least 1 column. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'`\")\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: str = Field(default=..., description='The generated UUID will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.execute","title":"execute","text":"
execute() -> None\n
Source code in src/koheesio/spark/transformations/uuid5.py
def execute(self) -> None:\n    ns = f.lit(uuid5_namespace(self.namespace).bytes)\n    self.log.info(f\"UUID5 namespace '{ns}' derived from '{self.namespace}'\")\n    cols_to_hash = f.concat_ws(self.delimiter, *self.source_columns)\n    cols_to_hash = f.concat(f.lit(self.extra_string), cols_to_hash)\n    cols_to_hash = f.encode(cols_to_hash, \"utf-8\")\n    cols_to_hash = f.concat(ns, cols_to_hash)\n    source_columns_sha1 = f.sha1(cols_to_hash)\n    variant_part = f.substring(source_columns_sha1, 17, 4)\n    variant_part = f.conv(variant_part, 16, 2)\n    variant_part = f.lpad(variant_part, 16, \"0\")\n    variant_part = f.overlay(variant_part, f.lit(\"10\"), 1, 2)  # RFC 4122 variant.\n    variant_part = f.lower(f.conv(variant_part, 2, 16))\n    target_col_uuid = f.concat_ws(\n        \"-\",\n        f.substring(source_columns_sha1, 1, 8),\n        f.substring(source_columns_sha1, 9, 4),\n        f.concat(f.lit(\"5\"), f.substring(source_columns_sha1, 14, 3)),  # Set version.\n        variant_part,\n        f.substring(source_columns_sha1, 21, 12),\n    )\n    # Applying the transformation to the input df, storing the result in the column specified in `target_column`.\n    self.output.df = self.df.withColumn(self.target_column, target_col_uuid)\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.hash_uuid5","title":"koheesio.spark.transformations.uuid5.hash_uuid5","text":"
hash_uuid5(input_value: str, namespace: Optional[Union[str, UUID]] = '', extra_string: Optional[str] = '')\n

pure python implementation of HashUUID5

See: https://docs.python.org/3/library/uuid.html#uuid.uuid5

Parameters:

Name Type Description Default input_value str

value that will be hashed

required namespace Optional[str | UUID]

namespace DNS

'' extra_string Optional[str]

optional extra string that will be prepended to the input_value

''

Returns:

Type Description str

uuid.UUID (uuid5) cast to string

Source code in src/koheesio/spark/transformations/uuid5.py
def hash_uuid5(\n    input_value: str,\n    namespace: Optional[Union[str, uuid.UUID]] = \"\",\n    extra_string: Optional[str] = \"\",\n):\n    \"\"\"pure python implementation of HashUUID5\n\n    See: https://docs.python.org/3/library/uuid.html#uuid.uuid5\n\n    Parameters\n    ----------\n    input_value : str\n        value that will be hashed\n    namespace : Optional[str | uuid.UUID]\n        namespace DNS\n    extra_string : Optional[str]\n        optional extra string that will be prepended to the input_value\n\n    Returns\n    -------\n    str\n        uuid.UUID (uuid5) cast to string\n    \"\"\"\n    if not isinstance(namespace, uuid.UUID):\n        hashed_namespace = uuid5_namespace(namespace)\n    else:\n        hashed_namespace = namespace\n    return str(uuid.uuid5(hashed_namespace, (extra_string + input_value)))\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.uuid5_namespace","title":"koheesio.spark.transformations.uuid5.uuid5_namespace","text":"
uuid5_namespace(ns: Optional[Union[str, UUID]]) -> UUID\n

Helper function used to provide a UUID5 hashed namespace based on the passed str

Parameters:

Name Type Description Default ns Optional[Union[str, UUID]]

A str, an empty string (or None), or an existing UUID can be passed

required

Returns:

Type Description UUID

UUID5 hashed namespace

Source code in src/koheesio/spark/transformations/uuid5.py
def uuid5_namespace(ns: Optional[Union[str, uuid.UUID]]) -> uuid.UUID:\n    \"\"\"Helper function used to provide a UUID5 hashed namespace based on the passed str\n\n    Parameters\n    ----------\n    ns : Optional[Union[str, uuid.UUID]]\n        A str, an empty string (or None), or an existing UUID can be passed\n\n    Returns\n    -------\n    uuid.UUID\n        UUID5 hashed namespace\n    \"\"\"\n    # if we already have a UUID, we just return it\n    if isinstance(ns, uuid.UUID):\n        return ns\n\n    # if ns is empty or none, we simply return the default NAMESPACE_DNS\n    if not ns:\n        ns = uuid.NAMESPACE_DNS\n        return ns\n\n    # else we hash the string against the NAMESPACE_DNS\n    ns = uuid.uuid5(uuid.NAMESPACE_DNS, ns)\n    return ns\n
"},{"location":"api_reference/spark/transformations/date_time/index.html","title":"Date time","text":"

Module that holds the transformations that can be used for date and time related operations.

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone","title":"koheesio.spark.transformations.date_time.ChangeTimeZone","text":"

Allows for the value of a column to be changed from one timezone to another

Adding useful metadata

When add_target_timezone is enabled (default), an additional column is created documenting which timezone a field has been converted to. Additionally, the suffix added to this column can be customized (default value is _timezone).

Example

Input:

target_column = \"some_column_name\"\ntarget_timezone = \"EST\"\nadd_target_timezone = True  # default value\ntimezone_column_suffix = \"_timezone\"  # default value\n

Output:

column name  = \"some_column_name_timezone\"  # notice the suffix\ncolumn value = \"EST\"\n

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.add_target_timezone","title":"add_target_timezone class-attribute instance-attribute","text":"
add_target_timezone: bool = Field(default=True, description='Toggles whether the target timezone is added as a column. True by default.')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.from_timezone","title":"from_timezone class-attribute instance-attribute","text":"
from_timezone: str = Field(default=..., alias='source_timezone', description='Timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.target_timezone_column_suffix","title":"target_timezone_column_suffix class-attribute instance-attribute","text":"
target_timezone_column_suffix: Optional[str] = Field(default='_timezone', alias='suffix', description=\"Allows to customize the suffix that is added to the target_timezone column. Defaults to '_timezone'. Note: this will be ignored if 'add_target_timezone' is set to False\")\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.to_timezone","title":"to_timezone class-attribute instance-attribute","text":"
to_timezone: str = Field(default=..., alias='target_timezone', description='Target timezone. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def execute(self):\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n        # document which timezone a field has been converted to\n        if self.add_target_timezone:\n            df = df.withColumn(f\"{target_column}{self.target_timezone_column_suffix}\", f.lit(self.to_timezone))\n\n    self.output.df = df\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n    return change_timezone(column=column, source_timezone=self.from_timezone, target_timezone=self.to_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_no_duplicate_timezones","title":"validate_no_duplicate_timezones","text":"
validate_no_duplicate_timezones(values)\n

Validate that source and target timezone are not the same

Source code in src/koheesio/spark/transformations/date_time/__init__.py
@model_validator(mode=\"before\")\ndef validate_no_duplicate_timezones(cls, values):\n    \"\"\"Validate that source and target timezone are not the same\"\"\"\n    from_timezone_value = values.get(\"from_timezone\")\n    to_timezone_value = values.get(\"o_timezone\")\n\n    if from_timezone_value == to_timezone_value:\n        raise ValueError(\"Timezone conversions from and to the same timezones are not valid.\")\n\n    return values\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_timezone","title":"validate_timezone","text":"
validate_timezone(timezone_value)\n

Validate that the timezone is a valid timezone.

Source code in src/koheesio/spark/transformations/date_time/__init__.py
@field_validator(\"from_timezone\", \"to_timezone\")\ndef validate_timezone(cls, timezone_value):\n    \"\"\"Validate that the timezone is a valid timezone.\"\"\"\n    if timezone_value not in all_timezones_set:\n        raise ValueError(\n            \"Not a valid timezone. Refer to the `TZ database name` column here: \"\n            \"https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\"\n        )\n    return timezone_value\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat","title":"koheesio.spark.transformations.date_time.DateFormat","text":"

wrapper around pyspark.sql.functions.date_format

See Also
  • https://spark.apache.org/docs/3.3.2/api/python/reference/pyspark.sql/api/pyspark.sql.functions.date_format.html
  • https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html
Concept

This Transformation allows to convert a date/timestamp/string to a value of string in the format specified by the date format given.

A pattern could be for instance dd.MM.yyyy and could return a string like \u201818.03.1993\u2019. All pattern letters of datetime pattern can be used, see: https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html

How to use

If more than one column is passed, the behavior of the Class changes this way

  • the transformation will be run in a loop against all the given columns
  • the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.
Example
source_column value: datetime.date(2020, 1, 1)\ntarget: \"yyyyMMdd HH:mm\"\noutput: \"20200101 00:00\"\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(..., description='The format for the resulting string. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n    return date_format(column, self.format)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp","title":"koheesio.spark.transformations.date_time.ToTimestamp","text":"

wrapper around pyspark.sql.functions.to_timestamp

Converts a Column (or set of Columns) into pyspark.sql.types.TimestampType using the specified format. Specify formats according to datetime pattern https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html_.

Functionally equivalent to col.cast(\"timestamp\").

See Also

Related Koheesio classes:

  • koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
  • koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field

pyspark.sql.functions:

  • datetime pattern : https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Example"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--basic-usage-example","title":"Basic usage example:","text":"

input_df:

t \"1997-02-28 10:30:00\"

t is a string

tts = ToTimestamp(\n    # since the source column is the same as the target in this example, 't' will be overwritten\n    column=\"t\",\n    target_column=\"t\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df)\n

output_df:

t datetime.datetime(1997, 2, 28, 10, 30)

Now t is a timestamp

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--multiple-columns-at-once","title":"Multiple columns at once:","text":"

input_df:

t1 t2 \"1997-02-28 10:30:00\" \"2007-03-31 11:40:10\"

t1 and t2 are strings

tts = ToTimestamp(\n    columns=[\"t1\", \"t2\"],\n    # 'target_suffix' is synonymous with 'target_column'\n    target_suffix=\"new\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df).select(\"t1_new\", \"t2_new\")\n

output_df:

t1_new t2_new datetime.datetime(1997, 2, 28, 10, 30) datetime.datetime(2007, 3, 31, 11, 40)

Now t1_new and t2_new are both timestamps

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default=..., description='The date format for of the timestamp field. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n    # convert string to timestamp\n    converted_col = to_timestamp(column, self.format)\n    return when(column.isNull(), lit(None)).otherwise(converted_col)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.change_timezone","title":"koheesio.spark.transformations.date_time.change_timezone","text":"
change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str)\n

Helper function to change from one timezone to another

wrapper around pyspark.sql.functions.from_utc_timestamp and to_utc_timestamp

Parameters:

Name Type Description Default column Union[str, Column]

The column to change the timezone of

required source_timezone str

The timezone of the source_column value. Timezone fields are validated against the TZ database name column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

required target_timezone str

The target timezone. Timezone fields are validated against the TZ database name column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

required Source code in src/koheesio/spark/transformations/date_time/__init__.py
def change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str):\n    \"\"\"Helper function to change from one timezone to another\n\n    wrapper around `pyspark.sql.functions.from_utc_timestamp` and `to_utc_timestamp`\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column to change the timezone of\n    source_timezone : str\n        The timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in\n        this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n    target_timezone : str\n        The target timezone. Timezone fields are validated against the `TZ database name` column in this list:\n        https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n\n    \"\"\"\n    column = col(column) if isinstance(column, str) else column\n    return from_utc_timestamp((to_utc_timestamp(column, source_timezone)), target_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html","title":"Interval","text":"

This module provides a DateTimeColumn class that extends the Column class from PySpark. It allows for adding or subtracting an interval value from a datetime column.

This can be used to reflect a change in a given date / time column in a more human-readable way.

Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal

Background

The aim is to easily add or subtract an 'interval' value to a datetime column. An interval value is a string that represents a time interval. For example, '1 day', '1 month', '5 years', '1 minute 30 seconds', '10 milliseconds', etc. These can be used to reflect a change in a given date / time column in a more human-readable way.

Typically, this can be done using the date_add() and date_sub() functions in Spark SQL. However, these functions only support adding or subtracting a single unit of time measured in days. Using an interval gives us much more flexibility; however, Spark SQL does not provide a function to add or subtract an interval value from a datetime column through the python API directly, so we have to use the expr() function to do this to be able to directly use SQL.

This module provides a DateTimeColumn class that extends the Column class from PySpark. It allows for adding or subtracting an interval value from a datetime column using the + and - operators.

Additionally, this module provides two transformation classes that can be used as a transformation step in a pipeline:

  • DateTimeAddInterval: adds an interval value to a datetime column
  • DateTimeSubtractInterval: subtracts an interval value from a datetime column

These classes are subclasses of ColumnsTransformationWithTarget and hence can be used to perform transformations on multiple columns at once.

The above transformations both use the provided asjust_time() function to perform the actual transformation.

See also:

Related Koheesio classes:

  • koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
  • koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field

pyspark.sql.functions:

  • https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
  • https://spark.apache.org/docs/latest/api/sql/index.html
  • https://spark.apache.org/docs/latest/api/sql/#try_add
  • https://spark.apache.org/docs/latest/api/sql/#try_subtract

Classes:

Name Description DateTimeColumn

A datetime column that can be adjusted by adding or subtracting an interval value using the + and - operators.

DateTimeAddInterval

A transformation that adds an interval value to a datetime column. This class is a subclass of ColumnsTransformationWithTarget and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget for more information.

DateTimeSubtractInterval

A transformation that subtracts an interval value from a datetime column. This class is a subclass of ColumnsTransformationWithTarget and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget for more information.

Note

the DateTimeAddInterval and DateTimeSubtractInterval classes are very similar. The only difference is that one adds an interval value to a datetime column, while the other subtracts an interval value from a datetime column.

Functions:

Name Description dt_column

Converts a column to a DateTimeColumn. This function aims to be a drop-in replacement for pyspark.sql.functions.col that returns a DateTimeColumn instead of a Column.

adjust_time

Adjusts a datetime column by adding or subtracting an interval value.

validate_interval

Validates a given interval string.

Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--various-ways-to-create-and-interact-with-datetimecolumn","title":"Various ways to create and interact with DateTimeColumn:","text":"
  • Create a DateTimeColumn from a string: dt_column(\"my_column\")
  • Create a DateTimeColumn from a Column: dt_column(df.my_column)
  • Use the + and - operators to add or subtract an interval value from a DateTimeColumn:
    • dt_column(\"my_column\") + \"1 day\"
    • dt_column(\"my_column\") - \"1 month\"
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--functional-examples-using-adjust_time","title":"Functional examples using adjust_time():","text":"
  • Add 1 day to a column: adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")
  • Subtract 1 month from a column: adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--as-a-transformation-step","title":"As a transformation step:","text":"

from koheesio.spark.transformations.date_time.interval import (\n    DateTimeAddInterval,\n)\n\ninput_df = spark.createDataFrame([(1, \"2022-01-01 00:00:00\")], [\"id\", \"my_column\"])\n\n# add 1 day to my_column and store the result in a new column called 'one_day_later'\noutput_df = DateTimeAddInterval(column=\"my_column\", target_column=\"one_day_later\", interval=\"1 day\").transform(input_df)\n
output_df:

id my_column one_day_later 1 2022-01-01 00:00:00 2022-01-02 00:00:00

DateTimeSubtractInterval works in a similar way, but subtracts an interval value from a datetime column.

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.Operations","title":"koheesio.spark.transformations.date_time.interval.Operations module-attribute","text":"
Operations = Literal['add', 'subtract']\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","text":"

A transformation that adds or subtracts a specified interval from a datetime column.

See also:

pyspark.sql.functions:

  • https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
  • https://spark.apache.org/docs/latest/api/sql/index.html#interval

Parameters:

Name Type Description Default interval str

The interval to add to the datetime column.

required operation Operations

The operation to perform. Must be either 'add' or 'subtract'.

add Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--add-1-day-to-a-column","title":"add 1 day to a column","text":"
DateTimeAddInterval(\n    column=\"my_column\",\n    interval=\"1 day\",\n).transform(df)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--subtract-1-month-from-my_column-and-store-the-result-in-a-new-column-called-one_month_earlier","title":"subtract 1 month from my_column and store the result in a new column called one_month_earlier","text":"
DateTimeSubtractInterval(\n    column=\"my_column\",\n    target_column=\"one_month_earlier\",\n    interval=\"1 month\",\n)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.interval","title":"interval class-attribute instance-attribute","text":"
interval: str = Field(default=..., description='The interval to add to the datetime column.', examples=['1 day', '5 years', '3 months'])\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.operation","title":"operation class-attribute instance-attribute","text":"
operation: Operations = Field(default='add', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.validate_interval","title":"validate_interval class-attribute instance-attribute","text":"
validate_interval = field_validator('interval')(validate_interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/date_time/interval.py
def func(self, column: Column):\n    return adjust_time(column, operation=self.operation, interval=self.interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn","title":"koheesio.spark.transformations.date_time.interval.DateTimeColumn","text":"

A datetime column that can be adjusted by adding or subtracting an interval value using the + and - operators.

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn.from_column","title":"from_column classmethod","text":"
from_column(column: Column)\n

Create a DateTimeColumn from an existing Column

Source code in src/koheesio/spark/transformations/date_time/interval.py
@classmethod\ndef from_column(cls, column: Column):\n    \"\"\"Create a DateTimeColumn from an existing Column\"\"\"\n    return cls(column._jc)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","text":"

Subtracts a specified interval from a datetime column.

Works in the same way as DateTimeAddInterval, but subtracts the specified interval from the datetime column. See DateTimeAddInterval for more information.

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval.operation","title":"operation class-attribute instance-attribute","text":"
operation: Operations = Field(default='subtract', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time","title":"koheesio.spark.transformations.date_time.interval.adjust_time","text":"
adjust_time(column: Column, operation: Operations, interval: str) -> Column\n

Adjusts a datetime column by adding or subtracting an interval value.

This can be used to reflect a change in a given date / time column in a more human-readable way.

See also

Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal

Example

Parameters:

Name Type Description Default column Column

The datetime column to adjust.

required operation Operations

The operation to perform. Must be either 'add' or 'subtract'.

required interval str

The value to add or subtract. Must be a valid interval string.

required

Returns:

Type Description Column

The adjusted datetime column.

Source code in src/koheesio/spark/transformations/date_time/interval.py
def adjust_time(column: Column, operation: Operations, interval: str) -> Column:\n    \"\"\"\n    Adjusts a datetime column by adding or subtracting an interval value.\n\n    This can be used to reflect a change in a given date / time column in a more human-readable way.\n\n\n    See also\n    --------\n    Please refer to the Spark SQL documentation for a list of valid interval values:\n    https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal\n\n    ### pyspark.sql.functions:\n\n    * https://spark.apache.org/docs/latest/api/sql/index.html#interval\n    * https://spark.apache.org/docs/latest/api/sql/#try_add\n    * https://spark.apache.org/docs/latest/api/sql/#try_subtract\n\n    Example\n    --------\n    ### add 1 day to a column\n    ```python\n    adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n    ```\n\n    ### subtract 1 month from a column\n    ```python\n    adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n    ```\n\n    ### or, a much more complicated example\n\n    In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called `my_column`.\n    ```python\n    adjust_time(\n        \"my_column\",\n        operation=\"add\",\n        interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n    )\n    ```\n\n    Parameters\n    ----------\n    column : Column\n        The datetime column to adjust.\n    operation : Operations\n        The operation to perform. Must be either 'add' or 'subtract'.\n    interval : str\n        The value to add or subtract. Must be a valid interval string.\n\n    Returns\n    -------\n    Column\n        The adjusted datetime column.\n    \"\"\"\n\n    # check that value is a valid interval\n    interval = validate_interval(interval)\n\n    column_name = column._jc.toString()\n\n    # determine the operation to perform\n    try:\n        operation = {\n            \"add\": \"try_add\",\n            \"subtract\": \"try_subtract\",\n        }[operation]\n    except KeyError as e:\n        raise ValueError(f\"Operation '{operation}' is not valid. Must be either 'add' or 'subtract'.\") from e\n\n    # perform the operation\n    _expression = f\"{operation}({column_name}, interval '{interval}')\"\n    column = expr(_expression)\n\n    return column\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--pysparksqlfunctions","title":"pyspark.sql.functions:","text":"
  • https://spark.apache.org/docs/latest/api/sql/index.html#interval
  • https://spark.apache.org/docs/latest/api/sql/#try_add
  • https://spark.apache.org/docs/latest/api/sql/#try_subtract
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--add-1-day-to-a-column","title":"add 1 day to a column","text":"
adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--subtract-1-month-from-a-column","title":"subtract 1 month from a column","text":"
adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--or-a-much-more-complicated-example","title":"or, a much more complicated example","text":"

In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called my_column.

adjust_time(\n    \"my_column\",\n    operation=\"add\",\n    interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n)\n

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column","title":"koheesio.spark.transformations.date_time.interval.dt_column","text":"
dt_column(column: Union[str, Column]) -> DateTimeColumn\n

Convert a column to a DateTimeColumn

Aims to be a drop-in replacement for pyspark.sql.functions.col that returns a DateTimeColumn instead of a Column.

Example

Parameters:

Name Type Description Default column Union[str, Column]

The column (or name of the column) to convert to a DateTimeColumn

required Source code in src/koheesio/spark/transformations/date_time/interval.py
def dt_column(column: Union[str, Column]) -> DateTimeColumn:\n    \"\"\"Convert a column to a DateTimeColumn\n\n    Aims to be a drop-in replacement for `pyspark.sql.functions.col` that returns a DateTimeColumn instead of a Column.\n\n    Example\n    --------\n    ### create a DateTimeColumn from a string\n    ```python\n    dt_column(\"my_column\")\n    ```\n\n    ### create a DateTimeColumn from a Column\n    ```python\n    dt_column(df.my_column)\n    ```\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column (or name of the column) to convert to a DateTimeColumn\n    \"\"\"\n    if isinstance(column, str):\n        column = col(column)\n    elif not isinstance(column, Column):\n        raise TypeError(f\"Expected column to be of type str or Column, got {type(column)} instead.\")\n    return DateTimeColumn.from_column(column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-string","title":"create a DateTimeColumn from a string","text":"
dt_column(\"my_column\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-column","title":"create a DateTimeColumn from a Column","text":"
dt_column(df.my_column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.validate_interval","title":"koheesio.spark.transformations.date_time.interval.validate_interval","text":"
validate_interval(interval: str)\n

Validate an interval string

Parameters:

Name Type Description Default interval str

The interval string to validate

required

Raises:

Type Description ValueError

If the interval string is invalid

Source code in src/koheesio/spark/transformations/date_time/interval.py
def validate_interval(interval: str):\n    \"\"\"Validate an interval string\n\n    Parameters\n    ----------\n    interval : str\n        The interval string to validate\n\n    Raises\n    ------\n    ValueError\n        If the interval string is invalid\n    \"\"\"\n    try:\n        expr(f\"interval '{interval}'\")\n    except ParseException as e:\n        raise ValueError(f\"Value '{interval}' is not a valid interval.\") from e\n    return interval\n
"},{"location":"api_reference/spark/transformations/strings/index.html","title":"Strings","text":"

Adds a number of Transformations that are intended to be used with StringType column input. Some will work with other types however, but will output StringType or an array of StringType.

These Transformations take full advantage of Koheesio's ColumnsTransformationWithTarget class, allowing a user to apply column transformations to multiple columns at once. See the class docstrings for more information.

The following Transformations are included:

change_case:

  • Lower Converts a string column to lower case.
  • Upper Converts a string column to upper case.
  • TitleCase or InitCap Converts a string column to title case, where each word starts with a capital letter.

concat:

  • Concat Concatenates multiple input columns together into a single column, optionally using the given separator.

pad:

  • Pad Pads the values of source_column with the character up until it reaches length of characters
  • LPad Pad with a character on the left side of the string.
  • RPad Pad with a character on the right side of the string.

regexp:

  • RegexpExtract Extract a specific group matched by a Java regexp from the specified string column.
  • RegexpReplace Searches for the given regexp and replaces all instances with what is in 'replacement'.

replace:

  • Replace Replace all instances of a string in a column with another string.

split:

  • SplitAll Splits the contents of a column on basis of a split_pattern.
  • SplitAtFirstMatch Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.

substring:

  • Substring Extracts a substring from a string column starting at the given position.

trim:

  • Trim Trim whitespace from the beginning and/or end of a string.
  • LTrim Trim whitespace from the beginning of a string.
  • RTrim Trim whitespace from the end of a string.
"},{"location":"api_reference/spark/transformations/strings/change_case.html","title":"Change case","text":"

Convert the case of a string column to upper case, lower case, or title case

Classes:

Name Description `Lower`

Converts a string column to lower case.

`Upper`

Converts a string column to upper case.

`TitleCase` or `InitCap`

Converts a string column to title case, where each word starts with a capital letter.

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.InitCap","title":"koheesio.spark.transformations.strings.change_case.InitCap module-attribute","text":"
InitCap = TitleCase\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase","title":"koheesio.spark.transformations.strings.change_case.LowerCase","text":"

This function makes the contents of a column lower case.

Wraps the pyspark.sql.functions.lower function.

Warnings

If the type of the column is not string, LowerCase will not be run. A Warning will be thrown indicating this.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The name of the column or columns to convert to lower case. Alias: column. Lower case will be applied to all columns in the list. Column is required to be of string type.

required target_column

The name of the column to store the result in. If None, the result will be stored in the same column as the input.

required Example

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = LowerCase(column=\"product\", target_column=\"product_lower\").transform(df)\n

output_df:

product amount country product_lower Banana lemon orange 1000 USA banana lemon orange Carrots Blueberries 1500 USA carrots blueberries Beans 1600 USA beans

In this example, the column product is converted to product_lower and the contents of this column are converted to lower case.

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig","title":"ColumnConfig","text":"

Limit data type to string

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n    return lower(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase","title":"koheesio.spark.transformations.strings.change_case.TitleCase","text":"

This function makes the contents of a column title case. This means that every word starts with an upper case.

Wraps the pyspark.sql.functions.initcap function.

Warnings

If the type of the column is not string, TitleCase will not be run. A Warning will be thrown indicating this.

Parameters:

Name Type Description Default columns

The name of the column or columns to convert to title case. Alias: column. Title case will be applied to all columns in the list. Column is required to be of string type.

required target_column

The name of the column to store the result in. If None, the result will be stored in the same column as the input.

required Example

input_df:

product amount country Banana lemon orange 1000 USA Carrots blueberries 1500 USA Beans 1600 USA
output_df = TitleCase(column=\"product\", target_column=\"product_title\").transform(df)\n

output_df:

product amount country product_title Banana lemon orange 1000 USA Banana Lemon Orange Carrots blueberries 1500 USA Carrots Blueberries Beans 1600 USA Beans

In this example, the column product is converted to product_title and the contents of this column are converted to title case (each word now starts with an upper case).

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n    return initcap(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase","title":"koheesio.spark.transformations.strings.change_case.UpperCase","text":"

This function makes the contents of a column upper case.

Wraps the pyspark.sql.functions.upper function.

Warnings

If the type of the column is not string, UpperCase will not be run. A Warning will be thrown indicating this.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The name of the column or columns to convert to upper case. Alias: column. Upper case will be applied to all columns in the list. Column is required to be of string type.

required target_column

The name of the column to store the result in. If None, the result will be stored in the same column as the input.

required

Examples:

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = UpperCase(column=\"product\", target_column=\"product_upper\").transform(df)\n

output_df:

product amount country product_upper Banana lemon orange 1000 USA BANANA LEMON ORANGE Carrots Blueberries 1500 USA CARROTS BLUEBERRIES Beans 1600 USA BEANS

In this example, the column product is converted to product_upper and the contents of this column are converted to upper case.

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n    return upper(column)\n
"},{"location":"api_reference/spark/transformations/strings/concat.html","title":"Concat","text":"

Concatenates multiple input columns together into a single column, optionally using a given separator.

"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat","title":"koheesio.spark.transformations.strings.concat.Concat","text":"

This is a wrapper around PySpark concat() and concat_ws() functions

Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.

Concept

When working with arrays, the function will return the result of the concatenation of the elements in the array.

  • If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.
  • If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.

When working with date/timestamps, the function will return the result of the concatenation of the elements in the array. The timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first.

required target_column Optional[str]

Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.

None spacer Optional[str]

Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used

None Example"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-a-string-column-and-a-timestamp-column","title":"Example using a string column and a timestamp column","text":"

input_df:

column_a column_b text 1997-02-28 10:30:00
output_df = Concat(\n    columns=[\"column_a\", \"column_b\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n

output_df:

column_a column_b concatenated_column text 1997-02-28 10:30:00 text--1997-02-28 10:30:00

In the example above, the resulting column is a string column.

If we had left out the spacer, the resulting column would have had the value of text1997-02-28 10:30:00 (a string). Note that the timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss.

"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-two-array-columns","title":"Example using two array columns","text":"

input_df:

array_col_1 array_col_2 [text1, text2] [text3, text4]
output_df = Concat(\n    columns=[\"array_col_1\", \"array_col_2\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n

output_df:

array_col_1 array_col_2 concatenated_column [text1, text2] [text3, text4] \"text1--text2--text3\"

Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would have been an array with the values of [\"text1\", \"text2\", \"text3\"].

Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.

"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.spacer","title":"spacer class-attribute instance-attribute","text":"
spacer: Optional[str] = Field(default=None, description='Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used', alias='sep')\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: Optional[str] = Field(default=None, description=\"Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.\")\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.execute","title":"execute","text":"
execute() -> DataFrame\n
Source code in src/koheesio/spark/transformations/strings/concat.py
def execute(self) -> DataFrame:\n    columns = [col(s) for s in self.get_columns()]\n    self.output.df = self.df.withColumn(\n        self.target_column, concat_ws(self.spacer, *columns) if self.spacer else concat(*columns)\n    )\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.get_target_column","title":"get_target_column","text":"
get_target_column(target_column_value, values)\n

Get the target column name if it is not provided.

If not provided, a name will be generated by concatenating the names of the source columns with an '_'.

Source code in src/koheesio/spark/transformations/strings/concat.py
@field_validator(\"target_column\")\ndef get_target_column(cls, target_column_value, values):\n    \"\"\"Get the target column name if it is not provided.\n\n    If not provided, a name will be generated by concatenating the names of the source columns with an '_'.\"\"\"\n    if not target_column_value:\n        columns_value: List = values[\"columns\"]\n        columns = list(dict.fromkeys(columns_value))  # dict.fromkeys is used to dedup while maintaining order\n        return \"_\".join(columns)\n\n    return target_column_value\n
"},{"location":"api_reference/spark/transformations/strings/pad.html","title":"Pad","text":"

Pad the values of a column with a character up until it reaches a certain length.

Classes:

Name Description Pad

Pads the values of source_column with the character up until it reaches length of characters

LPad

Pad with a character on the left side of the string.

RPad

Pad with a character on the right side of the string.

"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.LPad","title":"koheesio.spark.transformations.strings.pad.LPad module-attribute","text":"
LPad = Pad\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.pad_directions","title":"koheesio.spark.transformations.strings.pad.pad_directions module-attribute","text":"
pad_directions = Literal['left', 'right']\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad","title":"koheesio.spark.transformations.strings.pad.Pad","text":"

Pads the values of source_column with the character up until it reaches length of characters The direction param can be changed to apply either a left or a right pad. Defaults to left pad.

Wraps the lpad and rpad functions from PySpark.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to pad. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None character constr(min_length=1)

The character to use for padding

required length PositiveInt

Positive integer to indicate the intended length

required direction Optional[pad_directions]

On which side to add the characters. Either \"left\" or \"right\". Defaults to \"left\"

left Example

input_df:

column hello world
output_df = Pad(\n    column=\"column\",\n    target_column=\"padded_column\",\n    character=\"*\",\n    length=10,\n    direction=\"right\",\n).transform(input_df)\n

output_df:

column padded_column hello hello***** world world*****

Note: in the example above, we could have used the RPad class instead of Pad with direction=\"right\" to achieve the same result.

"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.character","title":"character class-attribute instance-attribute","text":"
character: constr(min_length=1) = Field(default=..., description='The character to use for padding')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.direction","title":"direction class-attribute instance-attribute","text":"
direction: Optional[pad_directions] = Field(default='left', description='On which side to add the characters . Either \"left\" or \"right\". Defaults to \"left\"')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.length","title":"length class-attribute instance-attribute","text":"
length: PositiveInt = Field(default=..., description='Positive integer to indicate the intended length')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/pad.py
def func(self, column: Column):\n    func = lpad if self.direction == \"left\" else rpad\n    return func(column, self.length, self.character)\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad","title":"koheesio.spark.transformations.strings.pad.RPad","text":"

Pad with a character on the right side of the string.

See Pad class docstring for more information.

"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad.direction","title":"direction class-attribute instance-attribute","text":"
direction: Optional[pad_directions] = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html","title":"Regexp","text":"

String transformations using regular expressions.

This module contains transformations that use regular expressions to transform strings.

Classes:

Name Description RegexpExtract

Extract a specific group matched by a Java regexp from the specified string column.

RegexpReplace

Searches for the given regexp and replaces all instances with what is in 'replacement'.

"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract","title":"koheesio.spark.transformations.strings.regexp.RegexpExtract","text":"

Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.

A wrapper around the pyspark regexp_extract function

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to extract from. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None regexp str

The Java regular expression to extract

required index Optional[int]

When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.

0 Example"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--extracting-the-year-and-week-number-from-a-string","title":"Extracting the year and week number from a string","text":"

Let's say we have a column containing the year and week in a format like Y## W# and we would like to extract the week numbers.

input_df:

YWK 2020 W1 2021 WK2
output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"week_number\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=2,  # remember that this is 1-indexed! So 2 will get the week number in this example.\n).transform(input_df)\n

output_df:

YWK week_number 2020 W1 1 2021 WK2 2"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--using-the-same-example-but-extracting-the-year-instead","title":"Using the same example, but extracting the year instead","text":"

If you want to extract the year, you can use index=1.

output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"year\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=1,  # remember that this is 1-indexed! So 1 will get the year in this example.\n).transform(input_df)\n

output_df:

YWK year 2020 W1 2020 2021 WK2 2021"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.index","title":"index class-attribute instance-attribute","text":"
index: Optional[int] = Field(default=0, description='When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.regexp","title":"regexp class-attribute instance-attribute","text":"
regexp: str = Field(default=..., description='The Java regular expression to extract')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n    return regexp_extract(column, self.regexp, self.index)\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace","title":"koheesio.spark.transformations.strings.regexp.RegexpReplace","text":"

Searches for the given regexp and replaces all instances with what is in 'replacement'.

A wrapper around the pyspark regexp_replace function

Parameters:

Name Type Description Default columns

The column (or list of columns) to replace in. Alias: column

required target_column

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

required regexp

The regular expression to replace

required replacement

String to replace matched pattern with.

required

Examples:

input_df: | content | |------------| | hello world|

Let's say you want to replace 'hello'.

output_df = RegexpReplace(\n    column=\"content\",\n    target_column=\"replaced\",\n    regexp=\"hello\",\n    replacement=\"gutentag\",\n).transform(input_df)\n

output_df: | content | replaced | |------------|---------------| | hello world| gutentag world|

"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.regexp","title":"regexp class-attribute instance-attribute","text":"
regexp: str = Field(default=..., description='The regular expression to replace')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.replacement","title":"replacement class-attribute instance-attribute","text":"
replacement: str = Field(default=..., description='String to replace matched pattern with.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n    return regexp_replace(column, self.regexp, self.replacement)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html","title":"Replace","text":"

String replacements without using regular expressions.

"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace","title":"koheesio.spark.transformations.strings.replace.Replace","text":"

Replace all instances of a string in a column with another string.

This transformation uses PySpark when().otherwise() functions.

Notes
  • If original_value is not set, the transformation will replace all null values with new_value
  • If original_value is set, the transformation will replace all values matching original_value with new_value
  • Numeric values are supported, but will be cast to string in the process
  • Replace is meant for simple string replacements. If more advanced replacements are needed, use the RegexpReplace transformation instead.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to replace values in. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None original_value Optional[str]

The original value that needs to be replaced. Alias: from

None new_value str

The new value to replace this with. Alias: to

required

Examples:

input_df:

column hello world None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-null-values-with-a-new-value","title":"Replace all null values with a new value","text":"
output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=None,  # This is the default value, so it can be omitted\n    new_value=\"programmer\",\n).transform(input_df)\n

output_df:

column replaced_column hello hello world world None programmer"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-instances-of-a-string-in-a-column-with-another-string","title":"Replace all instances of a string in a column with another string","text":"
output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=\"world\",\n    new_value=\"programmer\",\n).transform(input_df)\n

output_df:

column replaced_column hello hello world programmer None None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.new_value","title":"new_value class-attribute instance-attribute","text":"
new_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.original_value","title":"original_value class-attribute instance-attribute","text":"
original_value: Optional[str] = Field(default=None, alias='from', description='The original value that needs to be replaced')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.cast_values_to_str","title":"cast_values_to_str","text":"
cast_values_to_str(value)\n

Cast values to string if they are not None

Source code in src/koheesio/spark/transformations/strings/replace.py
@field_validator(\"original_value\", \"new_value\", mode=\"before\")\ndef cast_values_to_str(cls, value):\n    \"\"\"Cast values to string if they are not None\"\"\"\n    if value:\n        return str(value)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/replace.py
def func(self, column: Column):\n    when_statement = (\n        when(column.isNull(), lit(self.new_value))\n        if not self.original_value\n        else when(\n            column == self.original_value,\n            lit(self.new_value),\n        )\n    )\n    return when_statement.otherwise(column)\n
"},{"location":"api_reference/spark/transformations/strings/split.html","title":"Split","text":"

Splits the contents of a column on basis of a split_pattern

Classes:

Name Description SplitAll

Splits the contents of a column on basis of a split_pattern.

SplitAtFirstMatch

Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.

"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll","title":"koheesio.spark.transformations.strings.split.SplitAll","text":"

This function splits the contents of a column on basis of a split_pattern.

It splits at al the locations the pattern is found. The new column will be of ArrayType.

Wraps the pyspark.sql.functions.split function.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to split. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None split_pattern str

This is the pattern that will be used to split the column contents.

required Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = SplitColumn(column=\"product\", target_column=\"split\", split_pattern=\" \").transform(input_df)\n

output_df:

product amount country split Banana lemon orange 1000 USA [\"Banana\", \"lemon\" \"orange\"] Carrots Blueberries 1500 USA [\"Carrots\", \"Blueberries\"] Beans 1600 USA [\"Beans\"]"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.split_pattern","title":"split_pattern class-attribute instance-attribute","text":"
split_pattern: str = Field(default=..., description='The pattern to split the column contents.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n    return split(column, pattern=self.split_pattern)\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch","title":"koheesio.spark.transformations.strings.split.SplitAtFirstMatch","text":"

Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..

Note
  • SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you can toggle the parameter retrieve_first_part.
  • The new column will be of StringType.
  • If you want to split a column more than once, you should call this function multiple times.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to split. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None split_pattern str

This is the pattern that will be used to split the column contents.

required retrieve_first_part Optional[bool]

Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.

True Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = SplitColumn(column=\"product\", target_column=\"split_first\", split_pattern=\"an\").transform(input_df)\n

output_df:

product amount country split_first Banana lemon orange 1000 USA B Carrots Blueberries 1500 USA Carrots Blueberries Beans 1600 USA Be"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.retrieve_first_part","title":"retrieve_first_part class-attribute instance-attribute","text":"
retrieve_first_part: Optional[bool] = Field(default=True, description='Takes the first part of the split when true, the second part when False. Other parts are ignored.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n    split_func = split(column, pattern=self.split_pattern)\n\n    # first part\n    if self.retrieve_first_part:\n        return split_func.getItem(0)\n\n    # or, second part\n    return coalesce(split_func.getItem(1), lit(\"\"))\n
"},{"location":"api_reference/spark/transformations/strings/substring.html","title":"Substring","text":"

Extracts a substring from a string column starting at the given position.

"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring","title":"koheesio.spark.transformations.strings.substring.Substring","text":"

Extracts a substring from a string column starting at the given position.

This is a wrapper around PySpark substring() function

Notes
  • Numeric columns will be cast to string
  • start is 1-indexed, not 0-indexed!

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to substring. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None start PositiveInt

Positive int. Defines where to begin the substring from. The first character of the field has index 1!

required length Optional[int]

Optional. If not provided, the substring will go until end of string.

-1 Example"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring--extract-a-substring-from-a-string-column-starting-at-the-given-position","title":"Extract a substring from a string column starting at the given position.","text":"

input_df:

column skyscraper
output_df = Substring(\n    column=\"column\",\n    target_column=\"substring_column\",\n    start=3,  # 1-indexed! So this will start at the 3rd character\n    length=4,\n).transform(input_df)\n

output_df:

column substring_column skyscraper yscr"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.length","title":"length class-attribute instance-attribute","text":"
length: Optional[int] = Field(default=-1, description='The target length for the string. use -1 to perform until end')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.start","title":"start class-attribute instance-attribute","text":"
start: PositiveInt = Field(default=..., description='The starting position')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/substring.py
def func(self, column: Column):\n    return when(column.isNull(), None).otherwise(substring(column, self.start, self.length)).cast(StringType())\n
"},{"location":"api_reference/spark/transformations/strings/trim.html","title":"Trim","text":"

Trim whitespace from the beginning and/or end of a string.

Classes:

Name Description - `Trim`

Trim whitespace from the beginning and/or end of a string.

- `LTrim`

Trim whitespace from the beginning of a string.

- `RTrim`

Trim whitespace from the end of a string.

See class docstrings for more information."},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.trim_type","title":"koheesio.spark.transformations.strings.trim.trim_type module-attribute","text":"
trim_type = Literal['left', 'right', 'left-right']\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim","title":"koheesio.spark.transformations.strings.trim.LTrim","text":"

Trim whitespace from the beginning of a string. Alias: LeftTrim

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim.direction","title":"direction class-attribute instance-attribute","text":"
direction: trim_type = 'left'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim","title":"koheesio.spark.transformations.strings.trim.RTrim","text":"

Trim whitespace from the end of a string. Alias: RightTrim

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim.direction","title":"direction class-attribute instance-attribute","text":"
direction: trim_type = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim","title":"koheesio.spark.transformations.strings.trim.Trim","text":"

Trim whitespace from the beginning and/or end of a string.

This is a wrapper around PySpark ltrim() and rtrim() functions

The direction parameter can be changed to apply either a left or a right trim. Defaults to left AND right trim.

Note: If the type of the column is not string, Trim will not be run. A Warning will be thrown indicating this

Parameters:

Name Type Description Default columns

The column (or list of columns) to trim. Alias: column If no columns are provided, all string columns will be trimmed.

required target_column

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

required direction

On which side to remove the spaces. Either \"left\", \"right\" or \"left-right\". Defaults to \"left-right\"

required

Examples:

input_df: | column | |-----------| | \" hello \" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-beginning-of-a-string","title":"Trim whitespace from the beginning of a string","text":"
output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"left\").transform(input_df)\n

output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello \" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-both-sides-of-a-string","title":"Trim whitespace from both sides of a string","text":"
output_df = Trim(\n    column=\"column\",\n    target_column=\"trimmed_column\",\n    direction=\"left-right\",  # default value\n).transform(input_df)\n

output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello\" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-end-of-a-string","title":"Trim whitespace from the end of a string","text":"
output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"right\").transform(input_df)\n

output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \" hello\" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.columns","title":"columns class-attribute instance-attribute","text":"
columns: ListOfColumns = Field(default='*', alias='column', description='The column (or list of columns) to trim. Alias: column. If no columns are provided, all stringcolumns will be trimmed.')\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.direction","title":"direction class-attribute instance-attribute","text":"
direction: trim_type = Field(default='left-right', description=\"On which side to remove the spaces. Either 'left', 'right' or 'left-right'\")\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig","title":"ColumnConfig","text":"

Limit data types to string only.

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/trim.py
def func(self, column: Column):\n    if self.direction == \"left\":\n        return f.ltrim(column)\n\n    if self.direction == \"right\":\n        return f.rtrim(column)\n\n    # both (left-right)\n    return f.rtrim(f.ltrim(column))\n
"},{"location":"api_reference/spark/writers/index.html","title":"Writers","text":"

The Writer class is used to write the DataFrame to a target.

"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode","title":"koheesio.spark.writers.BatchOutputMode","text":"

For Batch:

  • append: Append the contents of the DataFrame to the output table, default option in Koheesio.
  • overwrite: overwrite the existing data.
  • ignore: ignore the operation (i.e. no-op).
  • error or errorifexists: throw an exception at runtime.
  • merge: update matching data in the table and insert rows that do not exist.
  • merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERROR","title":"ERROR class-attribute instance-attribute","text":"
ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute instance-attribute","text":"
ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.IGNORE","title":"IGNORE class-attribute instance-attribute","text":"
IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE","title":"MERGE class-attribute instance-attribute","text":"
MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute instance-attribute","text":"
MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute instance-attribute","text":"
MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute instance-attribute","text":"
OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode","title":"koheesio.spark.writers.StreamingOutputMode","text":"

For Streaming:

  • append: only the new rows in the streaming DataFrame will be written to the sink.
  • complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.
  • update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.COMPLETE","title":"COMPLETE class-attribute instance-attribute","text":"
COMPLETE = 'complete'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.UPDATE","title":"UPDATE class-attribute instance-attribute","text":"
UPDATE = 'update'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer","title":"koheesio.spark.writers.Writer","text":"

The Writer class is used to write the DataFrame to a target.

"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='delta', description='The format of the output')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.streaming","title":"streaming property","text":"
streaming: bool\n

Check if the DataFrame is a streaming DataFrame or not.

"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.execute","title":"execute abstractmethod","text":"
execute()\n

Execute on a Writer should handle writing of the self.df (input) as a minimum

Source code in src/koheesio/spark/writers/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Writer should handle writing of the self.df (input) as a minimum\"\"\"\n    # self.df  # input dataframe\n    ...\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.write","title":"write","text":"
write(df: Optional[DataFrame] = None) -> Output\n

Write the DataFrame to the output using execute() and return the output.

If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.

Source code in src/koheesio/spark/writers/__init__.py
def write(self, df: Optional[DataFrame] = None) -> SparkStep.Output:\n    \"\"\"Write the DataFrame to the output using execute() and return the output.\n\n    If no DataFrame is passed, the self.df will be used.\n    If no self.df is set, a RuntimeError will be thrown.\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html","title":"Buffer","text":"

This module contains classes for writing data to a buffer before writing to the final destination.

The BufferWriter class is a base class for writers that write to a buffer first. It provides methods for writing, reading, and resetting the buffer, as well as checking if the buffer is compressed and compressing the buffer.

The PandasCsvBufferWriter class is a subclass of BufferWriter that writes a Spark DataFrame to CSV file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).

The PandasJsonBufferWriter class is a subclass of BufferWriter that writes a Spark DataFrame to JSON file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter","title":"koheesio.spark.writers.buffer.BufferWriter","text":"

Base class for writers that write to a buffer first, before writing to the final destination.

execute() method should implement how the incoming DataFrame is written to the buffer object (e.g. BytesIO) in the output.

The default implementation uses a SpooledTemporaryFile as the buffer. This is a file-like object that starts off stored in memory and automatically rolls over to a temporary file on disk if it exceeds a certain size. A SpooledTemporaryFile behaves similar to BytesIO, but with the added benefit of being able to handle larger amounts of data.

This approach provides a balance between speed and memory usage, allowing for fast in-memory operations for smaller amounts of data while still being able to handle larger amounts of data that would not otherwise fit in memory.

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output","title":"Output","text":"

Output class for BufferWriter

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.buffer","title":"buffer class-attribute instance-attribute","text":"
buffer: InstanceOf[SpooledTemporaryFile] = Field(default_factory=partial(SpooledTemporaryFile, mode='w+b', max_size=0), exclude=True)\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.compress","title":"compress","text":"
compress()\n

Compress the file_buffer in place using GZIP

Source code in src/koheesio/spark/writers/buffer.py
def compress(self):\n    \"\"\"Compress the file_buffer in place using GZIP\"\"\"\n    # check if the buffer is already compressed\n    if self.is_compressed():\n        self.logger.warn(\"Buffer is already compressed. Nothing to compress...\")\n        return self\n\n    # compress the file_buffer\n    file_buffer = self.buffer\n    compressed = gzip.compress(file_buffer.read())\n\n    # write the compressed content back to the buffer\n    self.reset_buffer()\n    self.buffer.write(compressed)\n\n    return self  # to allow for chaining\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.is_compressed","title":"is_compressed","text":"
is_compressed()\n

Check if the buffer is compressed.

Source code in src/koheesio/spark/writers/buffer.py
def is_compressed(self):\n    \"\"\"Check if the buffer is compressed.\"\"\"\n    self.rewind_buffer()\n    magic_number_present = self.buffer.read(2) == b\"\\x1f\\x8b\"\n    self.rewind_buffer()\n    return magic_number_present\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.read","title":"read","text":"
read()\n

Read the buffer

Source code in src/koheesio/spark/writers/buffer.py
def read(self):\n    \"\"\"Read the buffer\"\"\"\n    self.rewind_buffer()\n    data = self.buffer.read()\n    self.rewind_buffer()\n    return data\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.reset_buffer","title":"reset_buffer","text":"
reset_buffer()\n

Reset the buffer

Source code in src/koheesio/spark/writers/buffer.py
def reset_buffer(self):\n    \"\"\"Reset the buffer\"\"\"\n    self.buffer.truncate(0)\n    self.rewind_buffer()\n    return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.rewind_buffer","title":"rewind_buffer","text":"
rewind_buffer()\n

Rewind the buffer

Source code in src/koheesio/spark/writers/buffer.py
def rewind_buffer(self):\n    \"\"\"Rewind the buffer\"\"\"\n    self.buffer.seek(0)\n    return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.write","title":"write","text":"
write(df=None) -> Output\n

Write the DataFrame to the buffer

Source code in src/koheesio/spark/writers/buffer.py
def write(self, df=None) -> Output:\n    \"\"\"Write the DataFrame to the buffer\"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.output.reset_buffer()\n    self.execute()\n    return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter","title":"koheesio.spark.writers.buffer.PandasCsvBufferWriter","text":"

Write a Spark DataFrame to CSV file(s) using Pandas.

Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

See also: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option

Note

This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).

Pyspark vs Pandas

The following table shows the mapping between Pyspark, Pandas, and Koheesio properties. Note that the default values are mostly the same as Pyspark's DataFrameWriter implementation, with some exceptions (see below).

This class implements the most commonly used properties. If a property is not explicitly implemented, it can be accessed through params.

PySpark Property Default PySpark Pandas Property Default Pandas Koheesio Property Default Koheesio Notes maxRecordsPerFile ... chunksize None max_records_per_file ... Spark property name: spark.sql.files.maxRecordsPerFile sep , sep , sep , lineSep \\n line_terminator os.linesep lineSep (alias=line_terminator) \\n N/A ... index True index False Determines whether row labels (index) are included in the output header False header True header True quote \" quotechar \" quote (alias=quotechar) \" quoteAll False doublequote True quoteAll (alias=doublequote) False escape \\ escapechar None escapechar (alias=escape) \\ escapeQuotes True N/A N/A N/A ... Not available in Pandas ignoreLeadingWhiteSpace True N/A N/A N/A ... Not available in Pandas ignoreTrailingWhiteSpace True N/A N/A N/A ... Not available in Pandas charToEscapeQuoteEscaping escape or \u0000 N/A N/A N/A ... Not available in Pandas dateFormat yyyy-MM-dd N/A N/A N/A ... Pandas implements Timestamp, not Date timestampFormat yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] date_format N/A timestampFormat (alias=date_format) yyyy-MM-dd'T'HHss.SSS Follows PySpark defaults timestampNTZFormat yyyy-MM-dd'T'HH:mm:ss[.SSS] N/A N/A N/A ... Pandas implements Timestamp, see above compression None compression infer compression None encoding utf-8 encoding utf-8 N/A ... Not explicitly implemented nullValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented emptyValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented N/A ... float_format N/A N/A ... Not explicitly implemented N/A ... decimal N/A N/A ... Not explicitly implemented N/A ... index_label None N/A ... Not explicitly implemented N/A ... columns N/A N/A ... Not explicitly implemented N/A ... mode N/A N/A ... Not explicitly implemented N/A ... quoting N/A N/A ... Not explicitly implemented N/A ... errors N/A N/A ... Not explicitly implemented N/A ... storage_options N/A N/A ... Not explicitly implemented differences with Pyspark:
  • dateFormat -> Pandas implements Timestamp, not just Date. Hence, Koheesio sets the default to the python equivalent of PySpark's default.
  • compression -> Spark does not compress by default, hence Koheesio does not compress by default. Compression can be provided though.

Parameters:

Name Type Description Default header bool

Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.

True sep str

Field delimiter for the output file. Default is ','.

, quote str

String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'. Default is '\"'.

\" quoteAll bool

A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio sets the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'. Default is False.

False escape str

String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to \\ to match Pyspark's default behavior. In Pandas, this field is called 'escapechar', and defaults to None. Default is '\\'.

\\ timestampFormat str

Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] which mimics the iso8601 format (datetime.isoformat()). Default is '%Y-%m-%dT%H:%M:%S.%f'.

yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] lineSep str, optional, default=

String of length 1. Defines the character used as line separator that should be used for writing. Default is os.linesep.

required compression Optional[Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', 'tar']]

A string representing the compression to use for on-the-fly compression of the output data. Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.

None"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.compression","title":"compression class-attribute instance-attribute","text":"
compression: Optional[CompressionOptions] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.escape","title":"escape class-attribute instance-attribute","text":"
escape: constr(max_length=1) = Field(default='\\\\', description=\"String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to `\\\\` to match Pyspark's default behavior. In Pandas, this is called 'escapechar', and defaults to None.\", alias='escapechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.header","title":"header class-attribute instance-attribute","text":"
header: bool = Field(default=True, description=\"Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.index","title":"index class-attribute instance-attribute","text":"
index: bool = Field(default=False, description='Toggles whether to write row names (index). Default False in Koheesio - pandas default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.lineSep","title":"lineSep class-attribute instance-attribute","text":"
lineSep: Optional[constr(max_length=1)] = Field(default=linesep, description='String of length 1. Defines the character used as line separator that should be used for writing.', alias='line_terminator')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quote","title":"quote class-attribute instance-attribute","text":"
quote: constr(max_length=1) = Field(default='\"', description=\"String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'.\", alias='quotechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quoteAll","title":"quoteAll class-attribute instance-attribute","text":"
quoteAll: bool = Field(default=False, description=\"A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio set the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'.\", alias='doublequote')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.sep","title":"sep class-attribute instance-attribute","text":"
sep: constr(max_length=1) = Field(default=',', description='Field delimiter for the output file')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.timestampFormat","title":"timestampFormat class-attribute instance-attribute","text":"
timestampFormat: str = Field(default='%Y-%m-%dT%H:%M:%S.%f', description=\"Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` which mimics the iso8601 format (`datetime.isoformat()`).\", alias='date_format')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output","title":"Output","text":"

Output class for PandasCsvBufferWriter

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output.pandas_df","title":"pandas_df class-attribute instance-attribute","text":"
pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.execute","title":"execute","text":"
execute()\n

Write the DataFrame to the buffer using Pandas to_csv() method. Compression is handled by pandas to_csv() method.

Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_csv() method.\n    Compression is handled by pandas to_csv() method.\n    \"\"\"\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = self.df.toPandas()\n\n    # create csv file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_csv(file_buffer, **self.get_options(options_type=\"spark\"))\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.get_options","title":"get_options","text":"
get_options(options_type: str = 'csv')\n

Returns the options to pass to Pandas' to_csv() method.

Source code in src/koheesio/spark/writers/buffer.py
def get_options(self, options_type: str = \"csv\"):\n    \"\"\"Returns the options to pass to Pandas' to_csv() method.\"\"\"\n    try:\n        import pandas as _pd\n\n        # Get the pandas version as a tuple of integers\n        pandas_version = tuple(int(i) for i in _pd.__version__.split(\".\"))\n    except ImportError:\n        raise ImportError(\"Pandas is required to use this writer\")\n\n    # Use line_separator for pandas 2.0.0 and later\n    line_sep_option_naming = \"line_separator\" if pandas_version >= (2, 0, 0) else \"line_terminator\"\n\n    csv_options = {\n        \"header\": self.header,\n        \"sep\": self.sep,\n        \"quotechar\": self.quote,\n        \"doublequote\": self.quoteAll,\n        \"escapechar\": self.escape,\n        \"na_rep\": self.emptyValue or self.nullValue,\n        line_sep_option_naming: self.lineSep,\n        \"index\": self.index,\n        \"date_format\": self.timestampFormat,\n        \"compression\": self.compression,\n        **self.params,\n    }\n\n    if options_type == \"spark\":\n        csv_options[\"lineterminator\"] = csv_options.pop(line_sep_option_naming)\n    elif options_type == \"kohesio_pandas_buffer_writer\":\n        csv_options[\"line_terminator\"] = csv_options.pop(line_sep_option_naming)\n\n    return csv_options\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter","title":"koheesio.spark.writers.buffer.PandasJsonBufferWriter","text":"

Write a Spark DataFrame to JSON file(s) using Pandas.

Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

Note

This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).

Parameters:

Name Type Description Default orient

Format of the resulting JSON string. Default is 'records'.

required lines

Format output as one JSON object per line. Only used when orient='records'. Default is True. - If true, the output will be formatted as one JSON object per line. - If false, the output will be written as a single JSON object. Note: this value is only used when orient='records' and will be ignored otherwise.

required date_format

Type of date conversion. Default is 'iso'. See Date and Timestamp Formats for a detailed description and more information.

required double_precision

Number of decimal places for encoding floating point values. Default is 10.

required force_ascii

Force encoded string to be ASCII. Default is True.

required compression

A string representing the compression to use for on-the-fly compression of the output data. Koheesio sets this default to 'None' leaving the data uncompressed. Can be set to gzip' optionally. Other compression options are currently not supported by Koheesio for JSON output.

required The required dates required The required different required original required Note required then required References required"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[list[str]] = Field(default=None, description='The columns to write. If None, all columns will be written.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.compression","title":"compression class-attribute instance-attribute","text":"
compression: Optional[Literal['gzip']] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to 'gzip' optionally.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.date_format","title":"date_format class-attribute instance-attribute","text":"
date_format: Literal['iso', 'epoch'] = Field(default='iso', description=\"Type of date conversion. Default is 'iso'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.double_precision","title":"double_precision class-attribute instance-attribute","text":"
double_precision: int = Field(default=10, description='Number of decimal places for encoding floating point values. Default is 10.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.force_ascii","title":"force_ascii class-attribute instance-attribute","text":"
force_ascii: bool = Field(default=True, description='Force encoded string to be ASCII. Default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.lines","title":"lines class-attribute instance-attribute","text":"
lines: bool = Field(default=True, description=\"Format output as one JSON object per line. Only used when orient='records'. Default is True.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.orient","title":"orient class-attribute instance-attribute","text":"
orient: Literal['split', 'records', 'index', 'columns', 'values', 'table'] = Field(default='records', description=\"Format of the resulting JSON string. Default is 'records'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output","title":"Output","text":"

Output class for PandasJsonBufferWriter

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output.pandas_df","title":"pandas_df class-attribute instance-attribute","text":"
pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.execute","title":"execute","text":"
execute()\n

Write the DataFrame to the buffer using Pandas to_json() method.

Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_json() method.\"\"\"\n    df = self.df\n    if self.columns:\n        df = df[self.columns]\n\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = df.toPandas()\n\n    # create json file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_json(file_buffer, **self.get_options())\n\n    # compress the buffer if compression is set\n    if self.compression:\n        self.output.compress()\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.get_options","title":"get_options","text":"
get_options()\n

Returns the options to pass to Pandas' to_json() method.

Source code in src/koheesio/spark/writers/buffer.py
def get_options(self):\n    \"\"\"Returns the options to pass to Pandas' to_json() method.\"\"\"\n    json_options = {\n        \"orient\": self.orient,\n        \"date_format\": self.date_format,\n        \"double_precision\": self.double_precision,\n        \"force_ascii\": self.force_ascii,\n        \"lines\": self.lines,\n        **self.params,\n    }\n\n    # ignore the 'lines' parameter if orient is not 'records'\n    if self.orient != \"records\":\n        del json_options[\"lines\"]\n\n    return json_options\n
"},{"location":"api_reference/spark/writers/dummy.html","title":"Dummy","text":"

Module for the DummyWriter class.

"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter","title":"koheesio.spark.writers.dummy.DummyWriter","text":"

A simple DummyWriter that performs the equivalent of a df.show() on the given DataFrame and returns the first row of data as a dict.

This Writer does not actually write anything to a source/destination, but is useful for debugging or testing purposes.

Parameters:

Name Type Description Default n PositiveInt

Number of rows to show.

20 truncate bool | PositiveInt

If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.

True vertical bool

If set to True, print output rows vertically (one line per column value).

False"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.n","title":"n class-attribute instance-attribute","text":"
n: PositiveInt = Field(default=20, description='Number of rows to show.', gt=0)\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.truncate","title":"truncate class-attribute instance-attribute","text":"
truncate: Union[bool, PositiveInt] = Field(default=True, description='If set to ``True``, truncate strings longer than 20 chars by default.If set to a number greater than one, truncates long strings to length ``truncate`` and align cells right.')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.vertical","title":"vertical class-attribute instance-attribute","text":"
vertical: bool = Field(default=False, description='If set to ``True``, print output rows vertically (one line per column value).')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output","title":"Output","text":"

DummyWriter output

"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.df_content","title":"df_content class-attribute instance-attribute","text":"
df_content: str = Field(default=..., description='The content of the DataFrame as a string')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.head","title":"head class-attribute instance-attribute","text":"
head: Dict[str, Any] = Field(default=..., description='The first row of the DataFrame as a dict')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.execute","title":"execute","text":"
execute() -> Output\n

Execute the DummyWriter

Source code in src/koheesio/spark/writers/dummy.py
def execute(self) -> Output:\n    \"\"\"Execute the DummyWriter\"\"\"\n    df: DataFrame = self.df\n\n    # noinspection PyProtectedMember\n    df_content = df._jdf.showString(self.n, self.truncate, self.vertical)\n\n    # logs the equivalent of doing df.show()\n    self.log.info(f\"content of df that was passed to DummyWriter:\\n{df_content}\")\n\n    self.output.head = self.df.head().asDict()\n    self.output.df_content = df_content\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.int_truncate","title":"int_truncate","text":"
int_truncate(truncate_value) -> int\n

Truncate is either a bool or an int.

Parameters:

truncate_value : int | bool, optional, default=True If int, specifies the maximum length of the string. If bool and True, defaults to a maximum length of 20 characters.

Returns:

int The maximum length of the string.

Source code in src/koheesio/spark/writers/dummy.py
@field_validator(\"truncate\")\ndef int_truncate(cls, truncate_value) -> int:\n    \"\"\"\n    Truncate is either a bool or an int.\n\n    Parameters:\n    -----------\n    truncate_value : int | bool, optional, default=True\n        If int, specifies the maximum length of the string.\n        If bool and True, defaults to a maximum length of 20 characters.\n\n    Returns:\n    --------\n    int\n        The maximum length of the string.\n\n    \"\"\"\n    # Same logic as what is inside DataFrame.show()\n    if isinstance(truncate_value, bool) and truncate_value is True:\n        return 20  # default is 20 chars\n    return int(truncate_value)  # otherwise 0, or whatever the user specified\n
"},{"location":"api_reference/spark/writers/kafka.html","title":"Kafka","text":"

Kafka writer to write batch or streaming data into kafka topics

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter","title":"koheesio.spark.writers.kafka.KafkaWriter","text":"

Kafka writer to write batch or streaming data into kafka topics

All kafka specific options can be provided as additional init params

Parameters:

Name Type Description Default broker str

broker url of the kafka cluster

required topic str

full topic name to write the data to

required trigger Optional[Union[Trigger, str, Dict]]

Indicates optionally how to stream the data into kafka, continuous or batch

required checkpoint_location str

In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs.

required Example
KafkaWriter(\n    write_broker=\"broker.com:9500\",\n    topic=\"test-topic\",\n    trigger=Trigger(continuous=True)\n    includeHeaders: \"true\",\n    key.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    value.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    kafka.group.id: \"test-group\",\n    checkpoint_location: \"s3://bucket/test-topic\"\n)\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.batch_writer","title":"batch_writer property","text":"
batch_writer: DataFrameWriter\n

returns a batch writer

Returns:

Type Description DataFrameWriter"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.broker","title":"broker class-attribute instance-attribute","text":"
broker: str = Field(default=..., description='Kafka brokers to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.checkpoint_location","title":"checkpoint_location class-attribute instance-attribute","text":"
checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.format","title":"format class-attribute instance-attribute","text":"
format: str = 'kafka'\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.logged_option_keys","title":"logged_option_keys property","text":"
logged_option_keys\n

keys to be logged

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.options","title":"options property","text":"
options\n

retrieve the kafka options incl topic and broker.

Returns:

Type Description dict

Dict being the combination of kafka options + topic + broker

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.stream_writer","title":"stream_writer property","text":"
stream_writer: DataStreamWriter\n

returns a stream writer

Returns:

Type Description DataStreamWriter"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.streaming_query","title":"streaming_query property","text":"
streaming_query: Optional[Union[str, StreamingQuery]]\n

return the streaming query

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.topic","title":"topic class-attribute instance-attribute","text":"
topic: str = Field(default=..., description='Kafka topic to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.trigger","title":"trigger class-attribute instance-attribute","text":"
trigger: Optional[Union[Trigger, str, Dict]] = Field(Trigger(available_now=True), description='Set the trigger for the stream query. If not set data is processed in batch')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.writer","title":"writer property","text":"
writer: Union[DataStreamWriter, DataFrameWriter]\n

function to get the writer of proper type according to whether the data to written is a stream or not This function will also set the trigger property in case of a datastream.

Returns:

Type Description Union[DataStreamWriter, DataFrameWriter]

In case of streaming data -> DataStreamWriter, else -> DataFrameWriter

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output","title":"Output","text":"

Output of the KafkaWriter

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output.streaming_query","title":"streaming_query class-attribute instance-attribute","text":"
streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.execute","title":"execute","text":"
execute()\n

Effectively write the data from the dataframe (streaming of batch) to kafka topic.

Returns:

Type Description Output

streaming_query function can be used to gain insights on running write.

Source code in src/koheesio/spark/writers/kafka.py
def execute(self):\n    \"\"\"Effectively write the data from the dataframe (streaming of batch) to kafka topic.\n\n    Returns\n    -------\n    KafkaWriter.Output\n        streaming_query function can be used to gain insights on running write.\n    \"\"\"\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self._validate_dataframe()\n\n    _writer = self.writer.format(self.format).options(**self.options)\n    self.output.streaming_query = _writer.start() if self.streaming else _writer.save()\n
"},{"location":"api_reference/spark/writers/snowflake.html","title":"Snowflake","text":"

This module contains the SnowflakeWriter class, which is used to write data to Snowflake.

"},{"location":"api_reference/spark/writers/stream.html","title":"Stream","text":"

Module that holds some classes and functions to be able to write to a stream

Classes:

Name Description Trigger

class to set the trigger for a stream query

StreamWriter

abstract class for stream writers

ForEachBatchStreamWriter

class to run a writer for each batch

Functions:

Name Description writer_to_foreachbatch

function to be used as batch_function for StreamWriter (sub)classes

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter","title":"koheesio.spark.writers.stream.ForEachBatchStreamWriter","text":"

Runnable ForEachBatchWriter

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n    self.streaming_query = self.writer.start()\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter","title":"koheesio.spark.writers.stream.StreamWriter","text":"

ABC Stream Writer

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.batch_function","title":"batch_function class-attribute instance-attribute","text":"
batch_function: Optional[Callable] = Field(default=None, description='allows you to run custom batch functions for each micro batch', alias='batch_function_for_each_df')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.checkpoint_location","title":"checkpoint_location class-attribute instance-attribute","text":"
checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.output_mode","title":"output_mode class-attribute instance-attribute","text":"
output_mode: StreamingOutputMode = Field(default=APPEND, alias='outputMode', description=__doc__)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.stream_writer","title":"stream_writer property","text":"
stream_writer: DataStreamWriter\n

Returns the stream writer for the given DataFrame and settings

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.streaming_query","title":"streaming_query class-attribute instance-attribute","text":"
streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.trigger","title":"trigger class-attribute instance-attribute","text":"
trigger: Optional[Union[Trigger, str, Dict]] = Field(default=Trigger(available_now=True), description='Set the trigger for the stream query. If this is not set it process data as batch')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.writer","title":"writer property","text":"
writer\n

Returns the stream writer since we don't have a batch mode for streams

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.await_termination","title":"await_termination","text":"
await_termination(timeout: Optional[int] = None)\n

Await termination of the stream query

Source code in src/koheesio/spark/writers/stream.py
def await_termination(self, timeout: Optional[int] = None):\n    \"\"\"Await termination of the stream query\"\"\"\n    self.streaming_query.awaitTermination(timeout=timeout)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.execute","title":"execute abstractmethod","text":"
execute()\n
Source code in src/koheesio/spark/writers/stream.py
@abstractmethod\ndef execute(self):\n    raise NotImplementedError\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger","title":"koheesio.spark.writers.stream.Trigger","text":"

Trigger types for a stream query.

Only one trigger can be set!

Example
  • processingTime='5 seconds'
  • continuous='5 seconds'
  • availableNow=True
  • once=True
See Also
  • https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.available_now","title":"available_now class-attribute instance-attribute","text":"
available_now: Optional[bool] = Field(default=None, alias='availableNow', description='if set to True, set a trigger that processes all available data in multiple batches then terminates the query.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.continuous","title":"continuous class-attribute instance-attribute","text":"
continuous: Optional[str] = Field(default=None, description=\"a time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a continuous query with a given checkpoint interval.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(validate_default=False, extra='forbid')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.once","title":"once class-attribute instance-attribute","text":"
once: Optional[bool] = Field(default=None, deprecated=True, description='if set to True, set a trigger that processes only one batch of data in a streaming query then terminates the query. use `available_now` instead of `once`.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.processing_time","title":"processing_time class-attribute instance-attribute","text":"
processing_time: Optional[str] = Field(default=None, alias='processingTime', description=\"a processing time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a microbatch query periodically based on the processing time.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.triggers","title":"triggers property","text":"
triggers\n

Returns a list of tuples with the value for each trigger

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.value","title":"value property","text":"
value: Dict[str, str]\n

Returns the trigger value as a dictionary

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.execute","title":"execute","text":"
execute()\n

Returns the trigger value as a dictionary This method can be skipped, as the value can be accessed directly from the value property

Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n    \"\"\"Returns the trigger value as a dictionary\n    This method can be skipped, as the value can be accessed directly from the `value` property\n    \"\"\"\n    self.log.warning(\"Trigger.execute is deprecated. Use Trigger.value directly instead\")\n    self.output.value = self.value\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_any","title":"from_any classmethod","text":"
from_any(value)\n

Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a dictionary

This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types

Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_any(cls, value):\n    \"\"\"Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a\n    dictionary\n\n    This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types\n    \"\"\"\n    if isinstance(value, Trigger):\n        return value\n\n    if isinstance(value, str):\n        return cls.from_string(value)\n\n    if isinstance(value, dict):\n        return cls.from_dict(value)\n\n    raise RuntimeError(f\"Unable to create Trigger based on the given value: {value}\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_dict","title":"from_dict classmethod","text":"
from_dict(_dict)\n

Creates a Trigger class based on a dictionary

Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_dict(cls, _dict):\n    \"\"\"Creates a Trigger class based on a dictionary\"\"\"\n    return cls(**_dict)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string","title":"from_string classmethod","text":"
from_string(trigger: str)\n

Creates a Trigger class based on a string

Example Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_string(cls, trigger: str):\n    \"\"\"Creates a Trigger class based on a string\n\n    Example\n    -------\n    ### happy flow\n\n    * processingTime='5 seconds'\n    * processing_time=\"5 hours\"\n    * processingTime=4 minutes\n    * once=True\n    * once=true\n    * available_now=true\n    * continuous='3 hours'\n    * once=TrUe\n    * once=TRUE\n\n    ### unhappy flow\n    valid values, but should fail the validation check of the class\n\n    * availableNow=False\n    * continuous=True\n    * once=false\n    \"\"\"\n    import re\n\n    trigger_from_string = re.compile(r\"(?P<triggerType>\\w+)=[\\'\\\"]?(?P<value>.+)[\\'\\\"]?\")\n    _match = trigger_from_string.match(trigger)\n\n    if _match is None:\n        raise ValueError(\n            f\"Cannot parse value for Trigger: '{trigger}'. \\n\"\n            f\"Valid types are {', '.join(cls._all_triggers_with_alias())}\"\n        )\n\n    trigger_type, value = _match.groups()\n\n    # strip the value of any quotes\n    value = value.strip(\"'\").strip('\"')\n\n    # making value a boolean when given\n    value = convert_str_to_bool(value)\n\n    return cls.from_dict({trigger_type: value})\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--happy-flow","title":"happy flow","text":"
  • processingTime='5 seconds'
  • processing_time=\"5 hours\"
  • processingTime=4 minutes
  • once=True
  • once=true
  • available_now=true
  • continuous='3 hours'
  • once=TrUe
  • once=TRUE
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--unhappy-flow","title":"unhappy flow","text":"

valid values, but should fail the validation check of the class

  • availableNow=False
  • continuous=True
  • once=false
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_available_now","title":"validate_available_now","text":"
validate_available_now(available_now)\n

Validate the available_now trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"available_now\", mode=\"before\")\ndef validate_available_now(cls, available_now):\n    \"\"\"Validate the available_now trigger value\"\"\"\n    # making value a boolean when given\n    available_now = convert_str_to_bool(available_now)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if available_now is not True:\n        raise ValueError(f\"Value for availableNow must be True. Got:{available_now}\")\n    return available_now\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_continuous","title":"validate_continuous","text":"
validate_continuous(continuous)\n

Validate the continuous trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"continuous\", mode=\"before\")\ndef validate_continuous(cls, continuous):\n    \"\"\"Validate the continuous trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger` except that the if statement is not\n    # split in two parts\n    if not isinstance(continuous, str):\n        raise ValueError(f\"Value for continuous must be a string. Got: {continuous}\")\n\n    if len(continuous.strip()) == 0:\n        raise ValueError(f\"Value for continuous must be a non empty string. Got: {continuous}\")\n    return continuous\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_once","title":"validate_once","text":"
validate_once(once)\n

Validate the once trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"once\", mode=\"before\")\ndef validate_once(cls, once):\n    \"\"\"Validate the once trigger value\"\"\"\n    # making value a boolean when given\n    once = convert_str_to_bool(once)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if once is not True:\n        raise ValueError(f\"Value for once must be True. Got: {once}\")\n    return once\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_processing_time","title":"validate_processing_time","text":"
validate_processing_time(processing_time)\n

Validate the processing time trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"processing_time\", mode=\"before\")\ndef validate_processing_time(cls, processing_time):\n    \"\"\"Validate the processing time trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if not isinstance(processing_time, str):\n        raise ValueError(f\"Value for processing_time must be a string. Got: {processing_time}\")\n\n    if len(processing_time.strip()) == 0:\n        raise ValueError(f\"Value for processingTime must be a non empty string. Got: {processing_time}\")\n    return processing_time\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_triggers","title":"validate_triggers","text":"
validate_triggers(triggers: Dict)\n

Validate the trigger value

Source code in src/koheesio/spark/writers/stream.py
@model_validator(mode=\"before\")\ndef validate_triggers(cls, triggers: Dict):\n    \"\"\"Validate the trigger value\"\"\"\n    params = [*triggers.values()]\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`; modified to work with pydantic v2\n    if not triggers:\n        raise ValueError(\"No trigger provided\")\n    if len(params) > 1:\n        raise ValueError(\"Multiple triggers not allowed.\")\n\n    return triggers\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch","title":"koheesio.spark.writers.stream.writer_to_foreachbatch","text":"
writer_to_foreachbatch(writer: Writer)\n

Call writer.execute on each batch

To be passed as batch_function for StreamWriter (sub)classes.

Example Source code in src/koheesio/spark/writers/stream.py
def writer_to_foreachbatch(writer: Writer):\n    \"\"\"Call `writer.execute` on each batch\n\n    To be passed as batch_function for StreamWriter (sub)classes.\n\n    Example\n    -------\n    ### Writing to a Delta table and a Snowflake table\n    ```python\n    DeltaTableStreamWriter(\n        table=\"my_table\",\n        checkpointLocation=\"my_checkpointlocation\",\n        batch_function=writer_to_foreachbatch(\n            SnowflakeWriter(\n                **sfOptions,\n                table=\"snowflake_table\",\n                insert_type=SnowflakeWriter.InsertType.APPEND,\n            )\n        ),\n    )\n    ```\n    \"\"\"\n\n    def inner(df, batch_id: int):\n        \"\"\"Inner method\n\n        As per the Spark documentation:\n        In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a\n        DataFrame and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the\n        output (that is, the provided Dataset) to external systems. The output DataFrame is guaranteed to exactly\n        same for the same batchId (assuming all operations are deterministic in the query).\n        \"\"\"\n        writer.log.debug(f\"Running batch function for batch {batch_id}\")\n        writer.write(df)\n\n    return inner\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch--writing-to-a-delta-table-and-a-snowflake-table","title":"Writing to a Delta table and a Snowflake table","text":"
DeltaTableStreamWriter(\n    table=\"my_table\",\n    checkpointLocation=\"my_checkpointlocation\",\n    batch_function=writer_to_foreachbatch(\n        SnowflakeWriter(\n            **sfOptions,\n            table=\"snowflake_table\",\n            insert_type=SnowflakeWriter.InsertType.APPEND,\n        )\n    ),\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html","title":"Delta","text":"

This module is the entry point for the koheesio.spark.writers.delta package.

It imports and exposes the DeltaTableWriter and DeltaTableStreamWriter classes for external use.

Classes: DeltaTableWriter: Class to write data in batch mode to a Delta table. DeltaTableStreamWriter: Class to write data in streaming mode to a Delta table.

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode","title":"koheesio.spark.writers.delta.BatchOutputMode","text":"

For Batch:

  • append: Append the contents of the DataFrame to the output table, default option in Koheesio.
  • overwrite: overwrite the existing data.
  • ignore: ignore the operation (i.e. no-op).
  • error or errorifexists: throw an exception at runtime.
  • merge: update matching data in the table and insert rows that do not exist.
  • merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERROR","title":"ERROR class-attribute instance-attribute","text":"
ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute instance-attribute","text":"
ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.IGNORE","title":"IGNORE class-attribute instance-attribute","text":"
IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE","title":"MERGE class-attribute instance-attribute","text":"
MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute instance-attribute","text":"
MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute instance-attribute","text":"
MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute instance-attribute","text":"
OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.DeltaTableStreamWriter","text":"

Delta table stream writer

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options","title":"Options","text":"

Options for DeltaTableStreamWriter

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute instance-attribute","text":"
allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute instance-attribute","text":"
maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute instance-attribute","text":"
maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter","title":"koheesio.spark.writers.delta.DeltaTableWriter","text":"

Delta table Writer for both batch and streaming dataframes.

Example

Parameters:

Name Type Description Default table Union[DeltaTableStep, str]

The table to write to

required output_mode Optional[Union[str, BatchOutputMode, StreamingOutputMode]]

The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.

required params Optional[dict]

Additional parameters to use for specific mode

required"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val>=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # <------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # <------ DEFAULT, can be changed by providing custom value\n    },\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge","title":"Example for MERGE","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE","text":"

in case the table isn't created yet, first run will execute an APPEND operation

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"<update_condition>\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"<insert_condition>\",\n            },\n        ],\n        \"merge_cond\": \"<merge_condition>\",\n    },\n)\n

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"

dataframe writer options can be passed as keyword arguments

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.format","title":"format class-attribute instance-attribute","text":"
format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.output_mode","title":"output_mode class-attribute instance-attribute","text":"
output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.params","title":"params class-attribute instance-attribute","text":"
params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.partition_by","title":"partition_by class-attribute instance-attribute","text":"
partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.writer","title":"writer property","text":"
writer: Union[DeltaMergeBuilder, DataFrameWriter]\n

Specify DeltaTableWriter

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod","text":"
get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n

Retrieve an OutputMode by validating choice against a set of option OutputModes.

Currently supported output modes can be found in:

  • BatchOutputMode
  • StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.SCD2DeltaTableWriter","text":"

A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.

Attributes:

Name Type Description table InstanceOf[DeltaTableStep]

The table to merge to.

merge_key str

The key used for merging data.

include_columns List[str]

Columns to be merged. Will be selected from DataFrame. Default is all columns.

exclude_columns List[str]

Columns to be excluded from DataFrame.

scd2_columns List[str]

List of attributes for SCD2 type (track changes).

scd2_timestamp_col Optional[Column]

Timestamp column for SCD2 type (track changes). Default to current_timestamp.

scd1_columns List[str]

List of attributes for SCD1 type (just update).

meta_scd2_struct_col_name str

SCD2 struct name.

meta_scd2_effective_time_col_name str

Effective col name.

meta_scd2_is_current_col_name str

Current col name.

meta_scd2_end_time_col_name str

End time col name.

target_auto_generated_columns List[str]

Auto generated columns from target Delta table. Will be used to exclude from merge logic.

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute instance-attribute","text":"
exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute instance-attribute","text":"
include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute instance-attribute","text":"
merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute instance-attribute","text":"
meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute instance-attribute","text":"
meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute instance-attribute","text":"
meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute instance-attribute","text":"
meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute instance-attribute","text":"
scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute instance-attribute","text":"
scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute instance-attribute","text":"
scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute instance-attribute","text":"
target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.execute","title":"execute","text":"
execute() -> None\n

Execute the SCD Type 2 operation.

This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.

Raises:

Type Description TypeError

If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.

Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n
"},{"location":"api_reference/spark/writers/delta/batch.html","title":"Batch","text":"

This module defines the DeltaTableWriter class, which is used to write both batch and streaming dataframes to Delta tables.

DeltaTableWriter supports two output modes: MERGEALL and MERGE.

  • The MERGEALL mode merges all incoming data with existing data in the table based on certain conditions.
  • The MERGE mode allows for more custom merging behavior using the DeltaMergeBuilder class from the delta.tables library.

The output_mode_params dictionary is used to specify conditions for merging, updating, and inserting data. The target_alias and source_alias keys are used to specify the aliases for the target and source dataframes in the merge conditions.

Classes:

Name Description DeltaTableWriter

A class for writing data to Delta tables.

DeltaTableStreamWriter

A class for writing streaming data to Delta tables.

Example
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val>=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n    },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter","title":"koheesio.spark.writers.delta.batch.DeltaTableWriter","text":"

Delta table Writer for both batch and streaming dataframes.

Example

Parameters:

Name Type Description Default table Union[DeltaTableStep, str]

The table to write to

required output_mode Optional[Union[str, BatchOutputMode, StreamingOutputMode]]

The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.

required params Optional[dict]

Additional parameters to use for specific mode

required"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val>=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # <------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # <------ DEFAULT, can be changed by providing custom value\n    },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge","title":"Example for MERGE","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE","text":"

in case the table isn't created yet, first run will execute an APPEND operation

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"<update_condition>\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"<insert_condition>\",\n            },\n        ],\n        \"merge_cond\": \"<merge_condition>\",\n    },\n)\n

"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"

dataframe writer options can be passed as keyword arguments

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n

"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.format","title":"format class-attribute instance-attribute","text":"
format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.output_mode","title":"output_mode class-attribute instance-attribute","text":"
output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.params","title":"params class-attribute instance-attribute","text":"
params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.partition_by","title":"partition_by class-attribute instance-attribute","text":"
partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.writer","title":"writer property","text":"
writer: Union[DeltaMergeBuilder, DataFrameWriter]\n

Specify DeltaTableWriter

"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod","text":"
get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n

Retrieve an OutputMode by validating choice against a set of option OutputModes.

Currently supported output modes can be found in:

  • BatchOutputMode
  • StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n
"},{"location":"api_reference/spark/writers/delta/scd.html","title":"Scd","text":"

This module defines writers to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.

Slowly Changing Dimension (SCD) is a technique used in data warehousing to handle changes to dimension data over time. SCD Type 2 is one of the most common types of SCD, where historical changes are tracked by creating new records for each change.

Koheesio is a powerful data processing framework that provides advanced capabilities for working with Delta tables in Apache Spark. It offers a convenient and efficient way to handle SCD Type 2 operations on Delta tables.

To learn more about Slowly Changing Dimension and SCD Type 2, you can refer to the following resources: - Slowly Changing Dimension (SCD) - Wikipedia

By using Koheesio, you can benefit from its efficient merge logic, support for SCD Type 2 and SCD Type 1 attributes, and seamless integration with Delta tables in Spark.

"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","text":"

A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.

Attributes:

Name Type Description table InstanceOf[DeltaTableStep]

The table to merge to.

merge_key str

The key used for merging data.

include_columns List[str]

Columns to be merged. Will be selected from DataFrame. Default is all columns.

exclude_columns List[str]

Columns to be excluded from DataFrame.

scd2_columns List[str]

List of attributes for SCD2 type (track changes).

scd2_timestamp_col Optional[Column]

Timestamp column for SCD2 type (track changes). Default to current_timestamp.

scd1_columns List[str]

List of attributes for SCD1 type (just update).

meta_scd2_struct_col_name str

SCD2 struct name.

meta_scd2_effective_time_col_name str

Effective col name.

meta_scd2_is_current_col_name str

Current col name.

meta_scd2_end_time_col_name str

End time col name.

target_auto_generated_columns List[str]

Auto generated columns from target Delta table. Will be used to exclude from merge logic.

"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute instance-attribute","text":"
exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute instance-attribute","text":"
include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute instance-attribute","text":"
merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute instance-attribute","text":"
meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute instance-attribute","text":"
meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute instance-attribute","text":"
meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute instance-attribute","text":"
meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute instance-attribute","text":"
scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute instance-attribute","text":"
scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute instance-attribute","text":"
scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute instance-attribute","text":"
target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.execute","title":"execute","text":"
execute() -> None\n

Execute the SCD Type 2 operation.

This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.

Raises:

Type Description TypeError

If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.

Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n
"},{"location":"api_reference/spark/writers/delta/stream.html","title":"Stream","text":"

This module defines the DeltaTableStreamWriter class, which is used to write streaming dataframes to Delta tables.

"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","text":"

Delta table stream writer

"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options","title":"Options","text":"

Options for DeltaTableStreamWriter

"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute instance-attribute","text":"
allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute instance-attribute","text":"
maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute instance-attribute","text":"
maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/utils.html","title":"Utils","text":"

This module provides utility functions while working with delta framework.

"},{"location":"api_reference/spark/writers/delta/utils.html#koheesio.spark.writers.delta.utils.log_clauses","title":"koheesio.spark.writers.delta.utils.log_clauses","text":"
log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]\n

Prepare log message for clauses of DeltaMergePlan statement.

Parameters:

Name Type Description Default clauses JavaObject

The clauses of the DeltaMergePlan statement.

required source_alias str

The source alias.

required target_alias str

The target alias.

required

Returns:

Type Description Optional[str]

The log message if there are clauses, otherwise None.

Notes

This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses, processes the conditions, and constructs the log message based on the clause type and columns.

If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is None, it sets the condition_clause to \"No conditions required\".

The log message includes the clauses type, the clause type, the columns, and the condition.

Source code in src/koheesio/spark/writers/delta/utils.py
def log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]:\n    \"\"\"\n    Prepare log message for clauses of DeltaMergePlan statement.\n\n    Parameters\n    ----------\n    clauses : JavaObject\n        The clauses of the DeltaMergePlan statement.\n    source_alias : str\n        The source alias.\n    target_alias : str\n        The target alias.\n\n    Returns\n    -------\n    Optional[str]\n        The log message if there are clauses, otherwise None.\n\n    Notes\n    -----\n    This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses,\n    processes the conditions, and constructs the log message based on the clause type and columns.\n\n    If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is\n    None, it sets the condition_clause to \"No conditions required\".\n\n    The log message includes the clauses type, the clause type, the columns, and the condition.\n    \"\"\"\n    log_message = None\n\n    if not clauses.isEmpty():\n        clauses_type = clauses.last().nodeName().replace(\"DeltaMergeInto\", \"\")\n        _processed_clauses = {}\n\n        for i in range(0, clauses.length()):\n            clause = clauses.apply(i)\n            condition = clause.condition()\n\n            if \"value\" in dir(condition):\n                condition_clause = (\n                    condition.value()\n                    .toString()\n                    .replace(f\"'{source_alias}\", source_alias)\n                    .replace(f\"'{target_alias}\", target_alias)\n                )\n            elif condition.toString() == \"None\":\n                condition_clause = \"No conditions required\"\n\n            clause_type: str = clause.clauseType().capitalize()\n            columns = \"ALL\" if clause_type == \"Delete\" else clause.actions().toList().apply(0).toString()\n\n            if clause_type.lower() not in _processed_clauses:\n                _processed_clauses[clause_type.lower()] = []\n\n            log_message = (\n                f\"{clauses_type} will perform action:{clause_type} columns ({columns}) if `{condition_clause}`\"\n            )\n\n    return log_message\n
"},{"location":"api_reference/sso/index.html","title":"Sso","text":""},{"location":"api_reference/sso/okta.html","title":"Okta","text":"

This module contains Okta integration steps.

"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter","title":"koheesio.sso.okta.LoggerOktaTokenFilter","text":"
LoggerOktaTokenFilter(okta_object: OktaAccessToken, name: str = 'OktaToken')\n

Filter which hides token value from log.

Source code in src/koheesio/sso/okta.py
def __init__(self, okta_object: OktaAccessToken, name: str = \"OktaToken\"):\n    self.__okta_object = okta_object\n    super().__init__(name=name)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter.filter","title":"filter","text":"
filter(record)\n
Source code in src/koheesio/sso/okta.py
def filter(self, record):\n    # noinspection PyUnresolvedReferences\n    if token := self.__okta_object.output.token:\n        token_value = token.get_secret_value()\n        record.msg = record.msg.replace(token_value, \"<SECRET_TOKEN>\")\n\n    return True\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta","title":"koheesio.sso.okta.Okta","text":"

Base Okta class

"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_id","title":"client_id class-attribute instance-attribute","text":"
client_id: str = Field(default=..., alias='okta_id', description='Okta account ID')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_secret","title":"client_secret class-attribute instance-attribute","text":"
client_secret: SecretStr = Field(default=..., alias='okta_secret', description='Okta account secret', repr=False)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.data","title":"data class-attribute instance-attribute","text":"
data: Optional[Union[Dict[str, str], str]] = Field(default={'grant_type': 'client_credentials'}, description='Data to be sent along with the token request')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken","title":"koheesio.sso.okta.OktaAccessToken","text":"
OktaAccessToken(**kwargs)\n

Get Okta authorization token

Example:

token = (\n    OktaAccessToken(\n        url=\"https://org.okta.com\",\n        client_id=\"client\",\n        client_secret=SecretStr(\"secret\"),\n        params={\n            \"p1\": \"foo\",\n            \"p2\": \"bar\",\n        },\n    )\n    .execute()\n    .token\n)\n

Source code in src/koheesio/sso/okta.py
def __init__(self, **kwargs):\n    _logger = LoggingFactory.get_logger(name=self.__class__.__name__, inherit_from_koheesio=True)\n    logger_filter = LoggerOktaTokenFilter(okta_object=self)\n    _logger.addFilter(logger_filter)\n    super().__init__(**kwargs)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output","title":"Output","text":"

Output class for OktaAccessToken.

"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output.token","title":"token class-attribute instance-attribute","text":"
token: Optional[SecretStr] = Field(default=None, description='Okta authentication token')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.execute","title":"execute","text":"
execute()\n

Execute an HTTP Post call to Okta service and retrieve the access token.

Source code in src/koheesio/sso/okta.py
def execute(self):\n    \"\"\"\n    Execute an HTTP Post call to Okta service and retrieve the access token.\n    \"\"\"\n    HttpPostStep.execute(self)\n\n    # noinspection PyUnresolvedReferences\n    status_code = self.output.status_code\n    # noinspection PyUnresolvedReferences\n    raw_payload = self.output.raw_payload\n\n    if status_code != 200:\n        raise HTTPError(f\"Request failed with '{status_code}' code. Payload: {raw_payload}\")\n\n    # noinspection PyUnresolvedReferences\n    json_payload = self.output.json_payload\n\n    if token := json_payload.get(\"access_token\"):\n        self.output.token = SecretStr(token)\n    else:\n        raise ValueError(f\"No 'access_token' found in the Okta response: {json_payload}\")\n
"},{"location":"api_reference/steps/index.html","title":"Steps","text":"

Steps Module

This module contains the definition of the Step class, which serves as the base class for custom units of logic that can be executed. It also includes the StepOutput class, which defines the output data model for a Step.

The Step class is designed to be subclassed for creating new steps in a data pipeline. Each subclass should implement the execute method, specifying the expected inputs and outputs.

This module also exports the SparkStep class for steps that interact with Spark

Classes:
  • Step: Base class for a custom unit of logic that can be executed.
  • StepOutput: Defines the output data model for a Step.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step","title":"koheesio.steps.Step","text":"

Base class for a step

A custom unit of logic that can be executed.

The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self) method, specifying the expected inputs and outputs.

Note: since the Step class is meta classed, the execute method is wrapped with the do_execute function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.

Methods and Attributes

The Step class has several attributes and methods.

Background

A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.

A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!

The diagram serves to illustrate the concept of a Step:

\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n

Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.

  • Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
  • Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the execute method of the Step class with the _execute_wrapper function. This ensures that the execute method always returns the output of the Step along with providing logging and validation of the output.
  • Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute.
  • The Output class can be extended to add additional fields to the output of the Step.

Examples:

class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -> MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--input","title":"INPUT","text":"

The following fields are available by default on the Step class: - name: Name of the Step. If not set, the name of the class will be used. - description: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.

When subclassing a Step, any additional pydantic field will be treated as input to the Step. See also the explanation on the .execute() method below.

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--output","title":"OUTPUT","text":"

Every Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute. The Output class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute().

  • Output: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class.
  • output: Allows you to interact with the Output of the Step lazily (see above and StepOutput)

When subclassing a Step, any additional pydantic field added to the nested Output class will be treated as output of the Step. See also the description of StepOutput for more information.

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--methods","title":"Methods:","text":"
  • execute: Abstract method to implement for new steps.
    • The Inputs of the step can be accessed, using self.input_name.
    • The output of the step can be accessed, using self.output.output_name.
  • run: Alias to .execute() method. You can use this to run the step, but execute is preferred.
  • to_yaml: YAML dump the step
  • get_description: Get the description of the Step

When subclassing a Step, execute is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.

Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute function making it always return a StepOutput. See also the explanation on the do_execute function.

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--class-methods","title":"class methods:","text":"
  • from_step: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
  • get_description: Get the description of the Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--dunder-methods","title":"dunder methods:","text":"
  • __getattr__: Allows input to be accessed through self.input_name
  • __repr__ and __str__: String representation of a step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.output","title":"output property writable","text":"
output: Output\n

Interact with the output of the Step

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.Output","title":"Output","text":"

Output class for Step

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.execute","title":"execute abstractmethod","text":"
execute()\n

Abstract method to implement for new steps.

The Inputs of the step can be accessed, using self.input_name

Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute function making it always return the Steps output

Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.from_step","title":"from_step classmethod","text":"
from_step(step: Step, **kwargs)\n

Returns a new Step instance based on the data of another Step or BaseModel instance

Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_json","title":"repr_json","text":"
repr_json(simple=False) -> str\n

dump the step to json, meant for representation

Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid json

Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -> Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_yaml","title":"repr_yaml","text":"
repr_yaml(simple=False) -> str\n

dump the step to yaml, meant for representation

Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n  a: foo\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid yaml

Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.run","title":"run","text":"
run()\n

Alias to .execute()

Source code in src/koheesio/steps/__init__.py
def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepMetaClass","title":"koheesio.steps.StepMetaClass","text":"

StepMetaClass has to be set up as a Metaclass extending ModelMetaclass to allow Pydantic to be unaffected while allowing for the execute method to be auto-decorated with do_execute

"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput","title":"koheesio.steps.StepOutput","text":"

Class for the StepOutput model

Usage

Setting up the StepOutputs class is done like this:

class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n

"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.validate_output","title":"validate_output","text":"
validate_output() -> StepOutput\n

Validate the output of the Step

Essentially, this method is a wrapper around the validate method of the BaseModel class

Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/steps/dummy.html","title":"Dummy","text":"

Dummy step for testing purposes.

This module contains a dummy step for testing purposes. It is used to test the Koheesio framework or to provide a simple example of how to create a new step.

Example

s = DummyStep(a=\"a\", b=2)\ns.execute()\n
In this case, s.output will be equivalent to the following dictionary:
{\"a\": \"a\", \"b\": 2, \"c\": \"aa\"}\n

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput","title":"koheesio.steps.dummy.DummyOutput","text":"

Dummy output for testing purposes.

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.a","title":"a instance-attribute","text":"
a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.b","title":"b instance-attribute","text":"
b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep","title":"koheesio.steps.dummy.DummyStep","text":"

Dummy step for testing purposes.

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.a","title":"a instance-attribute","text":"
a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.b","title":"b instance-attribute","text":"
b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output","title":"Output","text":"

Dummy output for testing purposes.

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output.c","title":"c instance-attribute","text":"
c: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.execute","title":"execute","text":"
execute()\n

Dummy execute for testing purposes.

Source code in src/koheesio/steps/dummy.py
def execute(self):\n    \"\"\"Dummy execute for testing purposes.\"\"\"\n    self.output.a = self.a\n    self.output.b = self.b\n    self.output.c = self.a * self.b\n
"},{"location":"api_reference/steps/http.html","title":"Http","text":"

This module contains a few simple HTTP Steps that can be used to perform API Calls to HTTP endpoints

Example
from koheesio.steps.http import HttpGetStep\n\nresponse = HttpGetStep(url=\"https://google.com\").execute().json_payload\n

In the above example, the response variable will contain the JSON response from the HTTP request.

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep","title":"koheesio.steps.http.HttpDeleteStep","text":"

send DELETE requests

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = DELETE\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep","title":"koheesio.steps.http.HttpGetStep","text":"

send GET requests

Example

response = HttpGetStep(url=\"https://google.com\").execute().json_payload\n
In the above example, the response variable will contain the JSON response from the HTTP request.

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = GET\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod","title":"koheesio.steps.http.HttpMethod","text":"

Enumeration of allowed http methods

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.DELETE","title":"DELETE class-attribute instance-attribute","text":"
DELETE = 'delete'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.GET","title":"GET class-attribute instance-attribute","text":"
GET = 'get'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.POST","title":"POST class-attribute instance-attribute","text":"
POST = 'post'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.PUT","title":"PUT class-attribute instance-attribute","text":"
PUT = 'put'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.from_string","title":"from_string classmethod","text":"
from_string(value: str)\n

Allows for getting the right Method Enum by simply passing a string value This method is not case-sensitive

Source code in src/koheesio/steps/http.py
@classmethod\ndef from_string(cls, value: str):\n    \"\"\"Allows for getting the right Method Enum by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep","title":"koheesio.steps.http.HttpPostStep","text":"

send POST requests

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = POST\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep","title":"koheesio.steps.http.HttpPutStep","text":"

send PUT requests

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = PUT\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep","title":"koheesio.steps.http.HttpStep","text":"

Can be used to perform API Calls to HTTP endpoints

Understanding Retries

This class includes a built-in retry mechanism for handling temporary issues, such as network errors or server downtime, that might cause the HTTP request to fail. The retry mechanism is controlled by three parameters: max_retries, initial_delay, and backoff.

  • max_retries determines the number of retries after the initial request. For example, if max_retries is set to 4, the request will be attempted a total of 5 times (1 initial attempt + 4 retries). If max_retries is set to 0, no retries will be attempted, and the request will be tried only once.

  • initial_delay sets the waiting period before the first retry. If initial_delay is set to 3, the delay before the first retry will be 3 seconds. Changing the initial_delay value directly affects the amount of delay before each retry.

  • backoff controls the rate at which the delay increases for each subsequent retry. If backoff is set to 2 (the default), the delay will double with each retry. If backoff is set to 1, the delay between retries will remain constant. Changing the backoff value affects how quickly the delay increases.

Given the default values of max_retries=3, initial_delay=2, and backoff=2, the delays between retries would be 2 seconds, 4 seconds, and 8 seconds, respectively. This results in a total delay of 14 seconds before all retries are exhausted.

For example, if you set initial_delay=3 and backoff=2, the delays before the retries would be 3 seconds, 6 seconds, and 12 seconds. If you set initial_delay=2 and backoff=3, the delays before the retries would be 2 seconds, 6 seconds, and 18 seconds. If you set initial_delay=2 and backoff=1, the delays before the retries would be 2 seconds, 2 seconds, and 2 seconds.

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.data","title":"data class-attribute instance-attribute","text":"
data: Optional[Union[Dict[str, str], str]] = Field(default_factory=dict, description='[Optional] Data to be sent along with the request', alias='body')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.headers","title":"headers class-attribute instance-attribute","text":"
headers: Optional[Dict[str, Union[str, SecretStr]]] = Field(default_factory=dict, description='Request headers', alias='header')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.method","title":"method class-attribute instance-attribute","text":"
method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to HTTP request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.session","title":"session class-attribute instance-attribute","text":"
session: Session = Field(default_factory=Session, description='Requests session object to be used for making HTTP requests', exclude=True, repr=False)\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.timeout","title":"timeout class-attribute instance-attribute","text":"
timeout: Optional[int] = Field(default=3, description='[Optional] Request timeout')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., description='API endpoint URL', alias='uri')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output","title":"Output","text":"

Output class for HttpStep

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.json_payload","title":"json_payload property","text":"
json_payload\n

Alias for response_json

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.raw_payload","title":"raw_payload class-attribute instance-attribute","text":"
raw_payload: Optional[str] = Field(default=None, alias='response_text', description='The raw response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_json","title":"response_json class-attribute instance-attribute","text":"
response_json: Optional[Union[Dict, List]] = Field(default=None, alias='json_payload', description='The JSON response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_raw","title":"response_raw class-attribute instance-attribute","text":"
response_raw: Optional[Response] = Field(default=None, alias='response', description='The raw requests.Response object returned by the appropriate requests.request() call')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.status_code","title":"status_code class-attribute instance-attribute","text":"
status_code: Optional[int] = Field(default=None, description='The status return code of the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.decode_sensitive_headers","title":"decode_sensitive_headers","text":"
decode_sensitive_headers(headers)\n

Authorization headers are being converted into SecretStr under the hood to avoid dumping any sensitive content into logs by the encode_sensitive_headers method.

However, when calling the get_headers method, the SecretStr should be converted back to string, otherwise sensitive info would have looked like '**********'.

This method decodes values of the headers dictionary that are of type SecretStr into plain text.

Source code in src/koheesio/steps/http.py
@field_serializer(\"headers\", when_used=\"json\")\ndef decode_sensitive_headers(self, headers):\n    \"\"\"\n    Authorization headers are being converted into SecretStr under the hood to avoid dumping any\n    sensitive content into logs by the `encode_sensitive_headers` method.\n\n    However, when calling the `get_headers` method, the SecretStr should be converted back to\n    string, otherwise sensitive info would have looked like '**********'.\n\n    This method decodes values of the `headers` dictionary that are of type SecretStr into plain text.\n    \"\"\"\n    for k, v in headers.items():\n        headers[k] = v.get_secret_value() if isinstance(v, SecretStr) else v\n    return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.delete","title":"delete","text":"
delete() -> Response\n

Execute an HTTP DELETE call

Source code in src/koheesio/steps/http.py
def delete(self) -> requests.Response:\n    \"\"\"Execute an HTTP DELETE call\"\"\"\n    self.method = HttpMethod.DELETE\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.encode_sensitive_headers","title":"encode_sensitive_headers","text":"
encode_sensitive_headers(headers)\n

Encode potentially sensitive data into pydantic.SecretStr class to prevent them being displayed as plain text in logs.

Source code in src/koheesio/steps/http.py
@field_validator(\"headers\", mode=\"before\")\ndef encode_sensitive_headers(cls, headers):\n    \"\"\"\n    Encode potentially sensitive data into pydantic.SecretStr class to prevent them\n    being displayed as plain text in logs.\n    \"\"\"\n    if auth := headers.get(\"Authorization\"):\n        headers[\"Authorization\"] = auth if isinstance(auth, SecretStr) else SecretStr(auth)\n    return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.execute","title":"execute","text":"
execute() -> Output\n

Executes the HTTP request.

This method simply calls self.request(), which includes the retry logic. If self.request() raises an exception, it will be propagated to the caller of this method.

Raises:

Type Description (RequestException, HTTPError)

The last exception that was caught if self.request() fails after self.max_retries attempts.

Source code in src/koheesio/steps/http.py
def execute(self) -> Output:\n    \"\"\"\n    Executes the HTTP request.\n\n    This method simply calls `self.request()`, which includes the retry logic. If `self.request()` raises an\n    exception, it will be propagated to the caller of this method.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `self.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get","title":"get","text":"
get() -> Response\n

Execute an HTTP GET call

Source code in src/koheesio/steps/http.py
def get(self) -> requests.Response:\n    \"\"\"Execute an HTTP GET call\"\"\"\n    self.method = HttpMethod.GET\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_headers","title":"get_headers","text":"
get_headers()\n

Dump headers into JSON without SecretStr masking.

Source code in src/koheesio/steps/http.py
def get_headers(self):\n    \"\"\"\n    Dump headers into JSON without SecretStr masking.\n    \"\"\"\n    return json.loads(self.model_dump_json()).get(\"headers\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_options","title":"get_options","text":"
get_options()\n

options to be passed to requests.request()

Source code in src/koheesio/steps/http.py
def get_options(self):\n    \"\"\"options to be passed to requests.request()\"\"\"\n    return {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self.params,  # type: ignore\n    }\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_proper_http_method_from_str_value","title":"get_proper_http_method_from_str_value","text":"
get_proper_http_method_from_str_value(method_value)\n

Converts string value to HttpMethod enum value

Source code in src/koheesio/steps/http.py
@field_validator(\"method\")\ndef get_proper_http_method_from_str_value(cls, method_value):\n    \"\"\"Converts string value to HttpMethod enum value\"\"\"\n    if isinstance(method_value, str):\n        try:\n            method_value = HttpMethod.from_string(method_value)\n        except AttributeError as e:\n            raise AttributeError(\n                \"Only values from HttpMethod class are allowed! \"\n                f\"Provided value: '{method_value}', allowed values: {', '.join(HttpMethod.__members__.keys())}\"\n            ) from e\n\n    return method_value\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.post","title":"post","text":"
post() -> Response\n

Execute an HTTP POST call

Source code in src/koheesio/steps/http.py
def post(self) -> requests.Response:\n    \"\"\"Execute an HTTP POST call\"\"\"\n    self.method = HttpMethod.POST\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.put","title":"put","text":"
put() -> Response\n

Execute an HTTP PUT call

Source code in src/koheesio/steps/http.py
def put(self) -> requests.Response:\n    \"\"\"Execute an HTTP PUT call\"\"\"\n    self.method = HttpMethod.PUT\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.request","title":"request","text":"
request(method: Optional[HttpMethod] = None) -> Response\n

Executes the HTTP request with retry logic.

Actual http_method execution is abstracted into this method. This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.

This method will try to execute requests.request up to self.max_retries times. If self.request() raises an exception, it logs a warning message and the error message, then waits for self.initial_delay * (self.backoff ** i) seconds before retrying. The delay increases exponentially after each failed attempt due to the self.backoff ** i term.

If self.request() still fails after self.max_retries attempts, it logs an error message and re-raises the last exception that was caught.

This is a good way to handle temporary issues that might cause self.request() to fail, such as network errors or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with requests if it's struggling to respond.

Parameters:

Name Type Description Default method HttpMethod

Optional parameter that allows calls to different HTTP methods and bypassing class level method parameter.

None

Raises:

Type Description (RequestException, HTTPError)

The last exception that was caught if requests.request() fails after self.max_retries attempts.

Source code in src/koheesio/steps/http.py
def request(self, method: Optional[HttpMethod] = None) -> requests.Response:\n    \"\"\"\n    Executes the HTTP request with retry logic.\n\n    Actual http_method execution is abstracted into this method.\n    This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.\n\n    This method will try to execute `requests.request` up to `self.max_retries` times. If `self.request()` raises\n    an exception, it logs a warning message and the error message, then waits for\n    `self.initial_delay * (self.backoff ** i)` seconds before retrying. The delay increases exponentially\n    after each failed attempt due to the `self.backoff ** i` term.\n\n    If `self.request()` still fails after `self.max_retries` attempts, it logs an error message and re-raises the\n    last exception that was caught.\n\n    This is a good way to handle temporary issues that might cause `self.request()` to fail, such as network errors\n    or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with\n    requests if it's struggling to respond.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        Optional parameter that allows calls to different HTTP methods and bypassing class level `method`\n        parameter.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `requests.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    _method = (method or self.method).value.upper()\n    options = self.get_options()\n\n    self.log.debug(f\"Making {_method} request to {options['url']} with headers {options['headers']}\")\n\n    response = self.session.request(method=_method, **options)\n    response.raise_for_status()\n\n    self.log.debug(f\"Received response with status code {response.status_code} and body {response.text}\")\n    self.set_outputs(response)\n\n    return response\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.set_outputs","title":"set_outputs","text":"
set_outputs(response)\n

Types of response output

Source code in src/koheesio/steps/http.py
def set_outputs(self, response):\n    \"\"\"\n    Types of response output\n    \"\"\"\n    self.output.response_raw = response\n    self.output.raw_payload = response.text\n    self.output.status_code = response.status_code\n\n    # Only decode non empty payloads to avoid triggering decoding error unnecessarily.\n    if self.output.raw_payload:\n        try:\n            self.output.response_json = response.json()\n\n        except json.decoder.JSONDecodeError as e:\n            self.log.info(f\"An error occurred while processing the JSON payload. Error message:\\n{e.msg}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep","title":"koheesio.steps.http.PaginatedHtppGetStep","text":"

Represents a paginated HTTP GET step.

Parameters:

Name Type Description Default paginate bool

Whether to paginate the API response. Defaults to False.

required pages int

Number of pages to paginate. Defaults to 1.

required offset int

Offset for paginated API calls. Offset determines the starting page. Defaults to 1.

required limit int

Limit for paginated API calls. Defaults to 100.

required"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.limit","title":"limit class-attribute instance-attribute","text":"
limit: Optional[int] = Field(default=100, description='Limit for paginated API calls. The url should (optionally) contain a named limit parameter, for example: api.example.com/data?limit={limit}')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.offset","title":"offset class-attribute instance-attribute","text":"
offset: Optional[int] = Field(default=1, description=\"Offset for paginated API calls. Offset determines the starting page. Defaults to 1. The url can (optionally) contain a named 'offset' parameter, for example: api.example.com/data?offset={offset}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.pages","title":"pages class-attribute instance-attribute","text":"
pages: Optional[int] = Field(default=1, description='Number of pages to paginate. Defaults to 1')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.paginate","title":"paginate class-attribute instance-attribute","text":"
paginate: Optional[bool] = Field(default=False, description=\"Whether to paginate the API response. Defaults to False. When set to True, the API response will be paginated. The url should contain a named 'page' parameter for example: api.example.com/data?page={page}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.execute","title":"execute","text":"
execute() -> Output\n

Executes the HTTP GET request and handles pagination.

Returns:

Type Description Output

The output of the HTTP GET request.

Source code in src/koheesio/steps/http.py
def execute(self) -> HttpGetStep.Output:\n    \"\"\"\n    Executes the HTTP GET request and handles pagination.\n\n    Returns\n    -------\n    HttpGetStep.Output\n        The output of the HTTP GET request.\n    \"\"\"\n    # Set up pagination parameters\n    offset, pages = (self.offset, self.pages + 1) if self.paginate else (1, 1)  # type: ignore\n    data = []\n    _basic_url = self.url\n\n    for page in range(offset, pages):\n        if self.paginate:\n            self.log.info(f\"Fetching page {page} of {pages - 1}\")\n\n        self.url = self._url(basic_url=_basic_url, page=page)\n        self.request()\n\n        if isinstance(self.output.response_json, list):\n            data += self.output.response_json\n        else:\n            data.append(self.output.response_json)\n\n    self.url = _basic_url\n    self.output.response_json = data\n    self.output.response_raw = None\n    self.output.raw_payload = None\n    self.output.status_code = None\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.get_options","title":"get_options","text":"
get_options()\n

Returns the options to be passed to the requests.request() function.

Returns:

Type Description dict

The options.

Source code in src/koheesio/steps/http.py
def get_options(self):\n    \"\"\"\n    Returns the options to be passed to the requests.request() function.\n\n    Returns\n    -------\n    dict\n        The options.\n    \"\"\"\n    options = {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self._adjust_params(),  # type: ignore\n    }\n\n    return options\n
"},{"location":"community/approach-documentation.html","title":"Approach documentation","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#scope","title":"Scope","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#the-system","title":"The System","text":"

We will be adopting \"The Documentation System\".

From documentation.divio.com:

There is a secret that needs to be understood in order to write good software documentation: there isn\u2019t one thing called documentation, there are four.

They are: tutorials, how-to guides, technical reference and explanation. They represent four different purposes or functions, and require four different approaches to their creation. Understanding the implications of this will help improve most documentation - often immensely.

About the system The documentation system outlined here is a simple, comprehensive and nearly universally-applicable scheme. It is proven in practice across a wide variety of fields and applications.

There are some very simple principles that govern documentation that are very rarely if ever spelled out. They seem to be a secret, though they shouldn\u2019t be.

If you can put these principles into practice, it will make your documentation better and your project, product or team more successful - that\u2019s a promise.

The system is widely adopted for large and small, open and proprietary documentation projects.

Video Presentation on YouTube:

","tags":["doctype/explanation"]},{"location":"community/contribute.html","title":"Contribute","text":""},{"location":"community/contribute.html#how-to-contribute","title":"How to contribute","text":"

There are a few guidelines that we need contributors to follow so that we are able to process requests as efficiently as possible. If you have any questions or concerns please feel free to contact us at opensource@nike.com.

"},{"location":"community/contribute.html#getting-started","title":"Getting Started","text":"
  • Review our Code of Conduct
  • Make sure you have a GitHub account
  • Submit a ticket for your issue, assuming one does not already exist.
    • Clearly describe the issue including steps to reproduce when it is a bug.
    • Make sure you fill in the earliest version that you know has the issue.
  • Fork the repository on GitHub
"},{"location":"community/contribute.html#making-changes","title":"Making Changes","text":"
  • Create a feature branch off of main before you start your work.
    • Please avoid working directly on the main branch.
  • Setup the required package manager hatch
  • Setup the dev environment see below
  • Make commits of logical units.
    • You may be asked to squash unnecessary commits down to logical units.
  • Check for unnecessary whitespace with git diff --check before committing.
  • Write meaningful, descriptive commit messages.
  • Please follow existing code conventions when working on a file
  • Make sure to check the standards on the code, see below
  • Make sure to test the code before you push changes see below
"},{"location":"community/contribute.html#submitting-changes","title":"\ud83e\udd1d Submitting Changes","text":"
  • Push your changes to a topic branch in your fork of the repository.
  • Submit a pull request to the repository in the Nike-Inc organization.
  • After feedback has been given we expect responses within two weeks. After two weeks we may close the pull request if it isn't showing any activity.
  • Bug fixes or features that lack appropriate tests may not be considered for merge.
  • Changes that lower test coverage may not be considered for merge.
"},{"location":"community/contribute.html#make-commands","title":"\ud83d\udd28 Make commands","text":"

We use make for managing different steps of setup and maintenance in the project. You can install make by following the instructions here

For a full list of available make commands, you can run:

make help\n
"},{"location":"community/contribute.html#package-manager","title":"\ud83d\udce6 Package manager","text":"

We use hatch as our package manager.

Note: Please DO NOT use pip or conda to install the dependencies. Instead, use hatch.

To install hatch, run the following command:

make init\n

or,

make hatch-install\n

This will install hatch using brew if you are on a Mac.

If you are on a different OS, you can follow the instructions here

"},{"location":"community/contribute.html#dev-environment-setup","title":"\ud83d\udccc Dev Environment Setup","text":"

To ensure our standards, make sure to install the required packages.

make dev\n

This will install all the required packages for development in the project under the .venv directory. Use this virtual environment to run the code and tests during local development.

"},{"location":"community/contribute.html#linting-and-standards","title":"\ud83e\uddf9 Linting and Standards","text":"

We use ruff, pylint, isort, black and mypy to maintain standards in the codebase.

Run the following two commands to check the codebase for any issues:

make check\n
This will run all the checks including pylint and mypy.

make fmt\n
This will format the codebase using black, isort, and ruff.

Make sure that the linters and formatters do not report any errors or warnings before submitting a pull request.

"},{"location":"community/contribute.html#testing","title":"\ud83e\uddea Testing","text":"

We use pytest to test our code.

You can run the tests by running one of the following commands:

make cov  # to run the tests and check the coverage\nmake all-tests  # to run all the tests\nmake spark-tests  # to run the spark tests\nmake non-spark-tests  # to run the non-spark tests\n

Make sure that all tests pass and that you have adequate coverage before submitting a pull request.

"},{"location":"community/contribute.html#additional-resources","title":"Additional Resources","text":"
  • General GitHub documentation
  • GitHub pull request documentation
  • Nike's Code of Conduct
  • Nike's Individual Contributor License Agreement
  • Nike OSS
"},{"location":"includes/glossary.html","title":"Glossary","text":""},{"location":"includes/glossary.html#pydantic","title":"Pydantic","text":"

Pydantic is a Python library for data validation and settings management using Python type annotations. It allows Koheesio to bring in strong typing and a high level of type safety. Essentially, it allows Koheesio to consider configurations of a pipeline (i.e. the settings used inside Steps, Tasks, etc.) as data that can be validated and structured.

"},{"location":"includes/glossary.html#pyspark","title":"PySpark","text":"

PySpark is a Python library for Apache Spark, a powerful open-source data processing engine. It allows Koheesio to handle large-scale data processing tasks efficiently.

"},{"location":"misc/info.html","title":"Info","text":"

{{ macros_info() }}

"},{"location":"reference/concepts/concepts.html","title":"Concepts","text":"

The framework architecture is built from a set of core components. Each of the implementations that the framework provides out of the box, can be swapped out for custom implementations as long as they match the API.

The core components are the following:

Note: click on the 'Concept' to take you to the corresponding module. The module documentation will have greater detail on the specifics of the implementation

"},{"location":"reference/concepts/concepts.html#step","title":"Step","text":"

A custom unit of logic that can be executed. A Step is an atomic operation and serves as the building block of data pipelines built with the framework. A step can be seen as an operation on a set of inputs, and returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.

\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n&nbsp;\n&nbsp;\nStep\n&nbsp;\n&nbsp;\n&nbsp;\n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n

Step is the core abstraction of the framework. Meaning, that it is the core building block of the framework and is used to define all the operations that can be executed.

Please see the Step documentation for more details.

"},{"location":"reference/concepts/concepts.html#task","title":"Task","text":"

The unit of work of one execution of the framework.

An execution usually consists of an Extract - Transform - Load approach of one data object. Tasks typically consist of a series of Steps.

Please see the Task documentation for more details.

"},{"location":"reference/concepts/concepts.html#context","title":"Context","text":"

The Context is used to configure the environment where a Task or Step runs.

It is often based on configuration files and can be used to adapt behaviour of a Task or Step based on the environment it runs in.

Please see the Context documentation for more details.

"},{"location":"reference/concepts/concepts.html#logger","title":"logger","text":"

A logger object to log messages with different levels.

Please see the Logging documentation for more details.

The interactions between the base concepts of the model is visible in the below diagram:

---\ntitle: Koheesio Class Diagram\n---\nclassDiagram\n    Step .. Task\n    Step .. Transformation\n    Step .. Reader\n    Step .. Writer\n\n    class Context\n\n    class LoggingFactory\n\n    class Task{\n        <<abstract>>\n        + List~Step~ steps\n        ...\n        + execute() Output\n    }\n\n    class Step{\n        <<abstract>>\n        ...\n        Output: ...\n        + execute() Output\n    }\n\n    class Transformation{\n        <<abstract>>\n        + df: DataFrame\n        ...\n        Output:\n        + df: DataFrame\n        + transform(df: DataFrame) DataFrame\n    }\n\n    class Reader{\n        <<abstract>>\n        ...\n        Output:\n        + df: DataFrame\n        + read() DataFrame\n    }\n\n    class Writer{\n        <<abstract>>\n        + df: DataFrame\n        ...\n        + write(df: DataFrame)\n    }
"},{"location":"reference/concepts/context.html","title":"Context in Koheesio","text":"

In the Koheesio framework, the Context class plays a pivotal role. It serves as a flexible and powerful tool for managing configuration data and shared variables across tasks and steps in your application.

Context behaves much like a Python dictionary, but with additional features that enhance its usability and flexibility. It allows you to store and retrieve values, including complex Python objects, with ease. You can access these values using dictionary-like methods or as class attributes, providing a simple and intuitive interface.

Moreover, Context supports nested keys and recursive merging of contexts, making it a versatile tool for managing complex configurations. It also provides serialization and deserialization capabilities, allowing you to easily save and load configurations in JSON, YAML, or TOML formats.

Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context provides a robust and efficient solution.

This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.

"},{"location":"reference/concepts/context.html#api-reference","title":"API Reference","text":"

See API Reference for a detailed description of the Context class and its methods.

"},{"location":"reference/concepts/context.html#key-features","title":"Key Features","text":"
  • Accessing Values: Context simplifies accessing configuration values. You can access them using dictionary-like methods or as class attributes. This allows for a more intuitive interaction with the Context object. For example:

    context = Context({\"bronze_table\": \"catalog.schema.table_name\"})\nprint(context.bronze_table)  # Outputs: catalog.schema.table_name\n
  • Nested Keys: Context supports nested keys, allowing you to access and add nested keys in a straightforward way. This is useful when dealing with complex configurations that require a hierarchical structure. For example:

    context = Context({\"bronze\": {\"table\": \"catalog.schema.table_name\"}})\nprint(context.bronze.table)  # Outputs: catalog.schema.table_name\n
  • Merging Contexts: You can merge two Contexts together, with the incoming Context having priority. Recursive merging is also supported. This is particularly useful when you want to update a Context with new data without losing the existing values. For example:

    context1 = Context({\"bronze_table\": \"catalog.schema.table_name\"})\ncontext2 = Context({\"silver_table\": \"catalog.schema.table_name\"})\ncontext1.merge(context2)\nprint(context1.silver_table)  # Outputs: catalog.schema.table_name\n
  • Adding Keys: You can add keys to a Context by using the add method. This allows you to dynamically update the Context as needed. For example:

    context.add(\"silver_table\", \"catalog.schema.table_name\")\n
  • Checking Key Existence: You can check if a key exists in a Context by using the contains method. This is useful when you want to ensure a key is present before attempting to access its value. For example:

    context.contains(\"silver_table\")  # Returns: True\n
  • Getting Key-Value Pair: You can get a key-value pair from a Context by using the get_item method. This can be useful when you want to extract a specific piece of data from the Context. For example:

    context.get_item(\"silver_table\")  # Returns: {\"silver_table\": \"catalog.schema.table_name\"}\n
  • Converting to Dictionary: You can convert a Context to a dictionary by using the to_dict method. This can be useful when you need to interact with code that expects a standard Python dictionary. For example:

    context_dict = context.to_dict()\n
  • Creating from Dictionary: You can create a Context from a dictionary by using the from_dict method. This allows you to easily convert existing data structures into a Context. For example:

    context = Context.from_dict({\"bronze_table\": \"catalog.schema.table_name\"})\n
"},{"location":"reference/concepts/context.html#advantages-over-a-dictionary","title":"Advantages over a Dictionary","text":"

While a dictionary can be used to store configuration values, Context provides several advantages:

  • Support for nested keys: Unlike a standard Python dictionary, Context allows you to access nested keys as if they were attributes. This makes it easier to work with complex, hierarchical data.

  • Recursive merging of two Contexts: Context allows you to merge two Contexts together, with the incoming Context having priority. This is useful when you want to update a Context with new data without losing the existing values.

  • Accessing keys as if they were class attributes: This provides a more intuitive way to interact with the Context, as you can use dot notation to access values.

  • Code completion in IDEs: Because you can access keys as if they were attributes, IDEs can provide code completion for Context keys. This can make your coding process more efficient and less error-prone.

  • Easy creation from a YAML, JSON, or TOML file: Context provides methods to easily load data from YAML or JSON files, making it a great tool for managing configuration data.

"},{"location":"reference/concepts/context.html#data-formats-and-serialization","title":"Data Formats and Serialization","text":"

Context leverages JSON, YAML, and TOML for serialization and deserialization. These formats are widely used in the industry and provide a balance between readability and ease of use.

  • JSON: A lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It's widely used for APIs and web-based applications.

  • YAML: A human-friendly data serialization standard often used for configuration files. It's more readable than JSON and supports complex data structures.

  • TOML: A minimal configuration file format that's easy to read due to its clear and simple syntax. It's often used for configuration files in Python applications.

"},{"location":"reference/concepts/context.html#examples","title":"Examples","text":"

In this section, we provide a variety of examples to demonstrate the capabilities of the Context class in Koheesio.

"},{"location":"reference/concepts/context.html#basic-operations","title":"Basic Operations","text":"

Here are some basic operations you can perform with Context. These operations form the foundation of how you interact with a Context object:

# Create a Context\ncontext = Context({\"bronze_table\": \"catalog.schema.table_name\"})\n\n# Access a value\nvalue = context.bronze_table\n\n# Add a key\ncontext.add(\"silver_table\", \"catalog.schema.table_name\")\n\n# Merge two Contexts\ncontext.merge(Context({\"silver_table\": \"catalog.schema.table_name\"}))\n
"},{"location":"reference/concepts/context.html#serialization-and-deserialization","title":"Serialization and Deserialization","text":"

Context supports serialization and deserialization to and from JSON, YAML, and TOML formats. This allows you to easily save and load Context data:

# Load context from a JSON file\ncontext = Context.from_json(\"path/to/context.json\")\n\n# Save context to a JSON file\ncontext.to_json(\"path/to/context.json\")\n\n# Load context from a YAML file\ncontext = Context.from_yaml(\"path/to/context.yaml\")\n\n# Save context to a YAML file\ncontext.to_yaml(\"path/to/context.yaml\")\n\n# Load context from a TOML file\ncontext = Context.from_toml(\"path/to/context.toml\")\n\n# Save context to a TOML file\ncontext.to_toml(\"path/to/context.toml\")\n
"},{"location":"reference/concepts/context.html#nested-keys","title":"Nested Keys","text":"

Context supports nested keys, allowing you to create hierarchical configurations. This is useful when dealing with complex data structures:

# Create a Context with nested keys\ncontext = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Access a nested key\nprint(context.database.bronze_table)  # Outputs: catalog.schema.bronze_table\n
"},{"location":"reference/concepts/context.html#recursive-merging","title":"Recursive Merging","text":"

Context also supports recursive merging, allowing you to merge two Contexts together at all levels of their hierarchy. This is particularly useful when you want to update a Context with new data without losing the existing values:

# Create two Contexts with nested keys\ncontext1 = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\"\n    }\n})\n\ncontext2 = Context({\n    \"database\": {\n        \"silver_table\": \"catalog.schema.new_silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Merge the two Contexts\ncontext1.merge(context2)\n\n# Print the merged Context\nprint(context1.to_dict())  \n# Outputs: \n# {\n#     \"database\": {\n#         \"bronze_table\": \"catalog.schema.bronze_table\",\n#         \"silver_table\": \"catalog.schema.new_silver_table\",\n#         \"gold_table\": \"catalog.schema.gold_table\"\n#     }\n# }\n
"},{"location":"reference/concepts/context.html#jsonpickle-and-complex-python-objects","title":"Jsonpickle and Complex Python Objects","text":"

The Context class in Koheesio also uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON. This allows you to convert complex Python objects, including custom classes, into a format that can be easily stored and transferred.

Here's an example of how this works:

# Import necessary modules\nfrom koheesio.context import Context\n\n# Initialize SnowflakeReader and store in a Context\nsnowflake_reader = SnowflakeReader(...)  # fill in with necessary arguments\ncontext = Context({\"snowflake_reader\": snowflake_reader})\n\n# Serialize the Context to a JSON string\njson_str = context.to_json()\n\n# Print the serialized Context\nprint(json_str)\n\n# Deserialize the JSON string back into a Context\ndeserialized_context = Context.from_json(json_str)\n\n# Access the deserialized SnowflakeReader\ndeserialized_snowflake_reader = deserialized_context.snowflake_reader\n\n# Now you can use the deserialized SnowflakeReader as you would the original\n

This feature is particularly useful when you need to save the state of your application, transfer it over a network, or store it in a database. When you're ready to use the stored data, you can easily convert it back into the original Python objects.

However, there are a few things to keep in mind:

  1. The classes you're serializing must be importable (i.e., they must be in the Python path) when you're deserializing the JSON. jsonpickle needs to be able to import the class to reconstruct the object. This holds true for most Koheesio classes, as they are designed to be importable and reconstructible.

  2. Not all Python objects can be serialized. For example, objects that hold a reference to a file or a network connection can't be serialized because their state can't be easily captured in a static file.

  3. As mentioned in the code comments, jsonpickle is not secure against malicious data. You should only deserialize data that you trust.

So, while the Context class provides a powerful tool for handling complex Python objects, it's important to be aware of these limitations.

"},{"location":"reference/concepts/context.html#conclusion","title":"Conclusion","text":"

In this document, we've covered the key features of the Context class in the Koheesio framework, including its ability to handle complex Python objects, support for nested keys and recursive merging, and its serialization and deserialization capabilities.

Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context provides a robust and efficient solution.

"},{"location":"reference/concepts/context.html#further-reading","title":"Further Reading","text":"

For more information, you can refer to the following resources:

  • Python jsonpickle Documentation
  • Python JSON Documentation
  • Python YAML Documentation
  • Python TOML Documentation

Refer to the API documentation for more details on the Context class and its methods.

"},{"location":"reference/concepts/logger.html","title":"Python Logger Code Instructions","text":"

Here you can find instructions on how to use the Koheesio Logging Factory.

"},{"location":"reference/concepts/logger.html#logging-factory","title":"Logging Factory","text":"

The LoggingFactory class is a factory for creating and configuring loggers. To use it, follow these steps:

  1. Import the necessary modules:

    from koheesio.logger import LoggingFactory\n
  2. Initialize logging factory for koheesio modules:

    factory = LoggingFactory(name=\"replace_koheesio_parent_name\", env=\"local\", logger_id=\"your_run_id\")\n# Or use default \nfactory = LoggingFactory()\n# Or just specify log level for koheesio modules\nfactory = LoggingFactory(level=\"DEBUG\")\n
  3. Create a logger by calling the create_logger method of the LoggingFactory class, you can inherit from koheesio logger:

    python logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME) # Or for koheesio modules logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME,inherit_from_koheesio=True)

  4. You can now use the logger object to log messages:

    logger.debug(\"Debug message\")\nlogger.info(\"Info message\")\nlogger.warning(\"Warning message\")\nlogger.error(\"Error message\")\nlogger.critical(\"Critical message\")\n
  5. (Optional) You can add additional handlers to the logger by calling the add_handlers method of the LoggingFactory class:

    handlers = [\n    (\"your_handler_module.YourHandlerClass\", {\"level\": \"INFO\"}),\n    # Add more handlers if needed\n]\nfactory.add_handlers(handlers)\n
  6. (Optional) You can create child loggers based on the parent logger by calling the get_logger method of the LoggingFactory class:

    child_logger = factory.get_logger(name=\"your_child_logger_name\")\n
  7. (Optional) Get an independent logger without inheritance

    If you need an independent logger without inheriting from the LoggingFactory logger, you can use the get_logger method:

    your_logger = factory.get_logger(name=\"your_logger_name\", inherit=False)\n

By setting inherit to False, you will obtain a logger that is not tied to the LoggingFactory logger hierarchy, only format of message will be the same, but you can also change it. This allows you to have an independent logger with its own configuration. You can use the your_logger object to log messages:

```python\nyour_logger.debug(\"Debug message\")\nyour_logger.info(\"Info message\")\nyour_logger.warning(\"Warning message\")\nyour_logger.error(\"Error message\")\nyour_logger.critical(\"Critical message\")\n```\n
  1. (Optional) You can use Masked types to masked secrets/tokens/passwords in output. The Masked types are special types provided by the koheesio library to handle sensitive data that should not be logged or printed in plain text. They are used to wrap sensitive data and override their string representation to prevent accidental exposure of the data.Here are some examples of how to use Masked types:

    import logging\nfrom koheesio.logger import MaskedString, MaskedInt, MaskedFloat, MaskedDict\n\n# Set up logging\nlogger = logging.getLogger(__name__)\nlogging.basicConfig(level=logging.INFO)\n\n# Using MaskedString\nmasked_string = MaskedString(\"my secret string\")\nlogger.info(masked_string)  # This will not log the actual string\n\n# Using MaskedInt\nmasked_int = MaskedInt(12345)\nlogger.info(masked_int)  # This will not log the actual integer\n\n# Using MaskedFloat\nmasked_float = MaskedFloat(3.14159)\nlogger.info(masked_float)  # This will not log the actual float\n\n# Using MaskedDict\nmasked_dict = MaskedDict({\"key\": \"value\"})\nlogger.info(masked_dict)  # This will not log the actual dictionary\n

Please make sure to replace \"your_logger_name\", \"your_run_id\", \"your_handler_module.YourHandlerClass\", \"your_child_logger_name\", and other placeholders with your own values according to your application's requirements.

By following these steps, you can obtain an independent logger without inheriting from the LoggingFactory logger. This allows you to customize the logger configuration and use it separately in your code.

Note: Ensure that you have imported the necessary modules, instantiated the LoggingFactory class, and customized the logger name and other parameters according to your application's requirements.

"},{"location":"reference/concepts/logger.html#example","title":"Example","text":"
import logging\n\n# Step 2: Instantiate the LoggingFactory class\nfactory = LoggingFactory(env=\"local\")\n\n# Step 3: Create an independent logger with a custom log level\nyour_logger = factory.get_logger(\"your_logger\", inherit_from_koheesio=False)\nyour_logger.setLevel(logging.DEBUG)\n\n# Step 4: Create a logger using the create_logger method from LoggingFactory with a different log level\nfactory_logger = LoggingFactory(level=\"WARNING\").get_logger(name=factory.LOGGER_NAME)\n\n# Step 5: Create a child logger with a debug level\nchild_logger = factory.get_logger(name=\"child\")\nchild_logger.setLevel(logging.DEBUG)\n\nchild2_logger = factory.get_logger(name=\"child2\")\nchild2_logger.setLevel(logging.INFO)\n\n# Step 6: Log messages at different levels for both loggers\nyour_logger.debug(\"Debug message\")  # This message will be displayed\nyour_logger.info(\"Info message\")  # This message will be displayed\nyour_logger.warning(\"Warning message\")  # This message will be displayed\nyour_logger.error(\"Error message\")  # This message will be displayed\nyour_logger.critical(\"Critical message\")  # This message will be displayed\n\nfactory_logger.debug(\"Debug message\")  # This message will not be displayed\nfactory_logger.info(\"Info message\")  # This message will not be displayed\nfactory_logger.warning(\"Warning message\")  # This message will be displayed\nfactory_logger.error(\"Error message\")  # This message will be displayed\nfactory_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild_logger.debug(\"Debug message\")  # This message will be displayed\nchild_logger.info(\"Info message\")  # This message will be displayed\nchild_logger.warning(\"Warning message\")  # This message will be displayed\nchild_logger.error(\"Error message\")  # This message will be displayed\nchild_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild2_logger.debug(\"Debug message\")  # This message will be displayed\nchild2_logger.info(\"Info message\")  # This message will be displayed\nchild2_logger.warning(\"Warning message\")  # This message will be displayed\nchild2_logger.error(\"Error message\")  # This message will be displayed\nchild2_logger.critical(\"Critical message\")  # This message will be displayed\n

Output:

[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [your_logger] {__init__.py:<module>:118} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [your_logger] {__init__.py:<module>:119} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [your_logger] {__init__.py:<module>:120} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [your_logger] {__init__.py:<module>:121} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [your_logger] {__init__.py:<module>:122} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio] {__init__.py:<module>:126} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio] {__init__.py:<module>:127} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio] {__init__.py:<module>:128} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [koheesio.child] {__init__.py:<module>:130} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child] {__init__.py:<module>:131} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child] {__init__.py:<module>:132} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child] {__init__.py:<module>:133} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child] {__init__.py:<module>:134} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child2] {__init__.py:<module>:137} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child2] {__init__.py:<module>:138} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child2] {__init__.py:<module>:139} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child2] {__init__.py:<module>:140} - Critical message\n
"},{"location":"reference/concepts/logger.html#loggeridfilter-class","title":"LoggerIDFilter Class","text":"

The LoggerIDFilter class is a filter that injects run_id information into the log. To use it, follow these steps:

  1. Import the necessary modules:

    import logging\n
  2. Create an instance of the LoggerIDFilter class:

    logger_filter = LoggerIDFilter()\n
  3. Set the LOGGER_ID attribute of the LoggerIDFilter class to the desired run ID:

    LoggerIDFilter.LOGGER_ID = \"your_run_id\"\n
  4. Add the logger_filter to your logger or handler:

    logger = logging.getLogger(\"your_logger_name\")\nlogger.addFilter(logger_filter)\n
"},{"location":"reference/concepts/logger.html#loggingfactory-set-up-optional","title":"LoggingFactory Set Up (Optional)","text":"
  1. Import the LoggingFactory class in your application code.

  2. Set the value for the LOGGER_FILTER variable:

  3. If you want to assign a specific logging.Filter instance, replace None with your desired filter instance.
  4. If you want to keep the default value of None, leave it unchanged.

  5. Set the value for the LOGGER_LEVEL variable:

  6. If you want to use the value from the \"KOHEESIO_LOGGING_LEVEL\" environment variable, leave the code as is.
  7. If you want to use a different environment variable or a specific default value, modify the code accordingly.

  8. Set the value for the LOGGER_ENV variable:

  9. Replace \"local\" with your desired environment name.

  10. Set the value for the LOGGER_FORMAT variable:

  11. If you want to customize the log message format, modify the value within the double quotes.
  12. The format should follow the desired log message format pattern.

  13. Set the value for the LOGGER_FORMATTER variable:

  14. If you want to assign a specific Formatter instance, replace Formatter(LOGGER_FORMAT) with your desired formatter instance.
  15. If you want to keep the default formatter with the defined log message format, leave it unchanged.

  16. Set the value for the CONSOLE_HANDLER variable:

    • If you want to assign a specific logging.Handler instance, replace None with your desired handler instance.
    • If you want to keep the default value of None, leave it unchanged.
  17. Set the value for the ENV variable:

    • Replace None with your desired environment value if applicable.
    • If you don't need to set this variable, leave it as None.
  18. Save the changes to the file.

"},{"location":"reference/concepts/step.html","title":"Steps in Koheesio","text":"

In the Koheesio framework, the Step class and its derivatives play a crucial role. They serve as the building blocks for creating data pipelines, allowing you to define custom units of logic that can be executed. This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.

Several type of Steps are available in Koheesio, including Reader, Transformation, Writer, and Task.

"},{"location":"reference/concepts/step.html#what-is-a-step","title":"What is a Step?","text":"

A Step is an atomic operation serving as the building block of data pipelines built with the Koheesio framework. Tasks typically consist of a series of Steps.

A step can be seen as an operation on a set of inputs, that returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.

\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n&nbsp;\n&nbsp;\nStep\n&nbsp;\n&nbsp;\n&nbsp;\n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n
"},{"location":"reference/concepts/step.html#how-to-read-a-step","title":"How to Read a Step?","text":"

A Step in Koheesio is a class that represents a unit of work in a data pipeline. It's similar to a Python built-in data class, but with additional features for execution, validation, and logging.

When you look at a Step, you'll typically see the following components:

  1. Class Definition: The Step is defined as a class that inherits from the base Step class in Koheesio. For example, class MyStep(Step):.

  2. Input Fields: These are defined as class attributes with type annotations, similar to attributes in a Python data class. These fields represent the inputs to the Step. For example, a: str defines an input field a of type str. Additionally, you will often see these fields defined using Pydantic's Field class, which allows for more detailed validation and documentation as well as default values and aliasing.

  3. Output Fields: These are defined in a nested class called Output that inherits from StepOutput. This class represents the output of the Step. For example, class Output(StepOutput): b: str defines an output field b of type str.

  4. Execute Method: This is a method that you need to implement when you create a new Step. It contains the logic of the Step and is where you use the input fields and populate the output fields. For example, def execute(self): self.output.b = f\"{self.a}-some-suffix\".

Here's an example of a Step:

class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -> MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n

In this Step, a is an input field of type str, b is an output field of type str, and the execute method appends -some-suffix to the input a and assigns it to the output b.

When you see a Step, you can think of it as a function where the class attributes are the inputs, the Output class defines the outputs, and the execute method is the function body. The main difference is that a Step also includes automatic validation of inputs and outputs (thanks to Pydantic), logging, and error handling.

"},{"location":"reference/concepts/step.html#understanding-inheritance-in-steps","title":"Understanding Inheritance in Steps","text":"

Inheritance is a core concept in object-oriented programming where a class (child or subclass) inherits properties and methods from another class (parent or superclass). In the context of Koheesio, when you create a new Step, you're creating a subclass that inherits from the base Step class.

When a new Step is defined (like class MyStep(Step):), it inherits all the properties and methods from the Step class. This includes the execute method, which is then overridden to provide the specific functionality for that Step.

Here's a simple breakdown:

  1. Parent Class (Superclass): This is the Step class in Koheesio. It provides the basic structure and functionalities of a Step, including input and output validation, logging, and error handling.

  2. Child Class (Subclass): This is the new Step you define, like MyStep. It inherits all the properties and methods from the Step class and can add or override them as needed.

  3. Inheritance: This is the process where MyStep inherits the properties and methods from the Step class. In Python, this is done by mentioning the parent class in parentheses when defining the child class, like class MyStep(Step):.

  4. Overriding: This is when you provide a new implementation of a method in the child class that is already defined in the parent class. In the case of Steps, you override the execute method to define the specific logic of your Step.

Understanding inheritance is key to understanding how Steps work in Koheesio. It allows you to leverage the functionalities provided by the Step class and focus on implementing the specific logic of your Step.

"},{"location":"reference/concepts/step.html#benefits-of-using-steps-in-data-pipelines","title":"Benefits of Using Steps in Data Pipelines","text":"

The concept of a Step is beneficial when creating Data Pipelines or Data Products for several reasons:

  1. Modularity: Each Step represents a self-contained unit of work, which makes the pipeline modular. This makes it easier to understand, test, and maintain the pipeline. If a problem arises, you can pinpoint which step is causing the issue.

  2. Reusability: Steps can be reused across different pipelines. Once a Step is defined, it can be used in any number of pipelines. This promotes code reuse and consistency across projects.

  3. Readability: Steps make the pipeline code more readable. Each Step has a clear input, output, and execution logic, which makes it easier to understand what each part of the pipeline is doing.

  4. Validation: Steps automatically validate their inputs and outputs. This ensures that the data flowing into and out of each step is of the expected type and format, which can help catch errors early.

  5. Logging: Steps automatically log the start and end of their execution, along with the input and output data. This can be very useful for debugging and understanding the flow of data through the pipeline.

  6. Error Handling: Steps provide built-in error handling. If an error occurs during the execution of a step, it is caught, logged, and then re-raised. This provides a clear indication of where the error occurred.

  7. Scalability: Steps can be easily parallelized or distributed, which is crucial for processing large datasets. This is especially true for steps that are designed to work with distributed computing frameworks like Apache Spark.

By using the concept of a Step, you can create data pipelines that are modular, reusable, readable, and robust, while also being easier to debug and scale.

"},{"location":"reference/concepts/step.html#compared-to-a-regular-pydantic-basemodel","title":"Compared to a regular Pydantic Basemodel","text":"

A Step in Koheesio, while built on top of Pydantic's BaseModel, provides additional features specifically designed for creating data pipelines. Here are some key differences:

  1. Execution Method: A Step includes an execute method that needs to be implemented. This method contains the logic of the step and is automatically decorated with functionalities such as logging and output validation.

  2. Input and Output Validation: A Step uses Pydantic models to define and validate its inputs and outputs. This ensures that the data flowing into and out of the step is of the expected type and format.

  3. Automatic Logging: A Step automatically logs the start and end of its execution, along with the input and output data. This is done through the do_execute decorator applied to the execute method.

  4. Error Handling: A Step provides built-in error handling. If an error occurs during the execution of the step, it is caught, logged, and then re-raised. This should help in debugging and understanding the flow of data.

  5. Serialization: A Step can be serialized to a YAML string using the to_yaml method. This can be useful for saving and loading steps.

  6. Lazy Mode Support: The StepOutput class in a Step supports lazy mode, which allows validation of the items stored in the class to be called at will instead of being forced to run it upfront.

In contrast, a regular Pydantic BaseModel is a simple data validation model that doesn't include these additional features. It's used for data parsing and validation, but doesn't include methods for execution, automatic logging, error handling, or serialization to YAML.

"},{"location":"reference/concepts/step.html#key-features-of-a-step","title":"Key Features of a Step","text":""},{"location":"reference/concepts/step.html#defining-a-step","title":"Defining a Step","text":"

To define a new step, you subclass the Step class and implement the execute method. The inputs of the step can be accessed using self.input_name. The output of the step can be accessed using self.output.output_name. For example:

class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n
"},{"location":"reference/concepts/step.html#running-a-step","title":"Running a Step","text":"

To run a step, you can call the execute method. You can also use the run method, which is an alias to execute. For example:

step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-step-output","title":"Accessing Step Output","text":"

The output of a step can be accessed using self.output.output_name. For example:

step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\n
"},{"location":"reference/concepts/step.html#serializing-a-step","title":"Serializing a Step","text":"

You can serialize a step to a YAML string using the to_yaml method. For example:

step = MyStep(input1=\"value1\", input2=2)\nyaml_str = step.to_yaml()\n
"},{"location":"reference/concepts/step.html#getting-step-description","title":"Getting Step Description","text":"

You can get the description of a step using the get_description method. For example:

step = MyStep(input1=\"value1\", input2=2)\ndescription = step.get_description()\n
"},{"location":"reference/concepts/step.html#defining-a-step-with-multiple-inputs-and-outputs","title":"Defining a Step with Multiple Inputs and Outputs","text":"

Here's an example of how to define a new step with multiple inputs and outputs:

class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n    input3: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n        output2: int = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n        self.output.output2 = self.input2 + self.input3\n
"},{"location":"reference/concepts/step.html#running-a-step-with-multiple-inputs","title":"Running a Step with Multiple Inputs","text":"

To run a step with multiple inputs, you can do the following:

step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-multiple-step-outputs","title":"Accessing Multiple Step Outputs","text":"

The outputs of a step can be accessed using self.output.output_name. For example:

step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\nprint(step.output.output2)  # Outputs: 5\n
"},{"location":"reference/concepts/step.html#special-features","title":"Special Features","text":""},{"location":"reference/concepts/step.html#the-execute-method","title":"The Execute method","text":"

The execute method in the Step class is automatically decorated with the StepMetaClass._execute_wrapper function due to the metaclass StepMetaClass. This provides several advantages:

  1. Automatic Output Validation: The decorator ensures that the output of the execute method is always a StepOutput instance. This means that the output is automatically validated against the defined output model, ensuring data integrity and consistency.

  2. Logging: The decorator provides automatic logging at the start and end of the execute method. This includes logging the input and output of the step, which can be useful for debugging and understanding the flow of data.

  3. Error Handling: If an error occurs during the execution of the Step, the decorator catches the exception and logs an error message before re-raising the exception. This provides a clear indication of where the error occurred.

  4. Simplifies Step Implementation: Since the decorator handles output validation, logging, and error handling, the user can focus on implementing the logic of the execute method without worrying about these aspects.

  5. Consistency: By automatically decorating the execute method, the library ensures that these features are consistently applied across all steps, regardless of who implements them or how they are used. This makes the behavior of steps predictable and consistent.

  6. Prevents Double Wrapping: The decorator checks if the function is already wrapped with StepMetaClass._execute_wrapper and prevents double wrapping. This ensures that the decorator doesn't interfere with itself if execute is overridden in subclasses.

Notice that you never have to explicitly return anything from the execute method. The StepMetaClass._execute_wrapper decorator takes care of that for you.

Implementation examples for custom metaclass which can be used to override the default behavior of the StepMetaClass._execute_wrapper:

    class MyMetaClass(StepMetaClass):\n        @classmethod\n        def _log_end_message(cls, step: Step, skip_logging: bool = False, *args, **kwargs):\n            print(\"It's me from custom meta class\")\n            super()._log_end_message(step, skip_logging, *args, **kwargs)\n\n    class MyMetaClass2(StepMetaClass):\n        @classmethod\n        def _validate_output(cls, step: Step, skip_validating: bool = False, *args, **kwargs):\n            # i want always have a dummy value in the output\n            step.output.dummy_value = \"dummy\"\n\n    class YourClassWithCustomMeta(Step, metaclass=MyMetaClass):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n\n    class YourClassWithCustomMeta2(Step, metaclass=MyMetaClass2):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n
"},{"location":"reference/concepts/step.html#sparkstep","title":"SparkStep","text":"

The SparkStep class is a subclass of Step that is designed for steps that interact with Spark. It extends the Step class with SparkSession support. Spark steps are expected to return a Spark DataFrame as output. The spark property is available to access the active SparkSession instance. Output in a SparkStep is expected to be a DataFrame although optional.

"},{"location":"reference/concepts/step.html#using-a-sparkstep","title":"Using a SparkStep","text":"

Here's an example of how to use a SparkStep:

class MySparkStep(SparkStep):\n    input1: str = Field(...)\n\n    class Output(StepOutput):\n        output1: DataFrame = Field(...)\n\n    def execute(self):\n        # Your logic here\n        df = self.spark.read.text(self.input1)\n        self.output.output1 = df\n

To run a SparkStep, you can do the following:

step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\n

To access the output of a SparkStep, you can do the following:

step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\ndf = step.output.output1\ndf.show()\n
"},{"location":"reference/concepts/step.html#conclusion","title":"Conclusion","text":"

In this document, we've covered the key features of the Step class in the Koheesio framework, including its ability to define custom units of logic, manage inputs and outputs, and support for serialization. The automatic decoration of the execute method provides several advantages that simplify step implementation and ensure consistency across all steps.

Whether you're defining a new operation in your data pipeline or managing the flow of data between steps, Step provides a robust and efficient solution.

"},{"location":"reference/concepts/step.html#further-reading","title":"Further Reading","text":"

For more information, you can refer to the following resources:

  • Python Pydantic Documentation
  • Python YAML Documentation

Refer to the API documentation for more details on the Step class and its methods.

"},{"location":"reference/spark/readers.html","title":"Reader Module","text":"

The Reader module in Koheesio provides a set of classes for reading data from various sources. A Reader is a type of SparkStep that reads data from a source based on the input parameters and stores the result in self.output.df for subsequent steps.

"},{"location":"reference/spark/readers.html#what-is-a-reader","title":"What is a Reader?","text":"

A Reader is a subclass of SparkStep that reads data from a source and stores the result. The source could be a file, a database, a web API, or any other data source. The result is stored in a DataFrame, which is accessible through the df property of the Reader.

"},{"location":"reference/spark/readers.html#api-reference","title":"API Reference","text":"

See API Reference for a detailed description of the Reader class and its methods.

"},{"location":"reference/spark/readers.html#key-features-of-a-reader","title":"Key Features of a Reader","text":"
  1. Read Method: The Reader class provides a read method that calls the execute method and returns the result. Essentially, calling .read() is a shorthand for calling .execute().output.df. This allows you to read data from a Reader without having to call the execute method directly. This is a convenience method that simplifies the usage of a Reader.

Here's an example of how to use the .read() method:

# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the .read() method to get the data as a DataFrame\ndf = my_reader.read()\n\n# Now df is a DataFrame with the data read by MyReader\n

In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader, you call the .read() method to read the data and get it back as a DataFrame. The DataFrame df now contains the data read by MyReader.

  1. DataFrame Property: The Reader class provides a df property as a shorthand for accessing self.output.df. If self.output.df is None, the execute method is run first. This property ensures that the data is loaded and ready to be used, even if the execute method hasn't been explicitly called.

Here's an example of how to use the df property:

# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the df property to get the data as a DataFrame\ndf = my_reader.df\n\n# Now df is a DataFrame with the data read by MyReader\n

In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader, you access the df property to get the data as a DataFrame. The DataFrame df now contains the data read by MyReader.

  1. SparkSession: Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession, which can be used to perform distributed data processing tasks.

Here's an example of how to use the spark property:

# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the spark property to get the SparkSession\nspark = my_reader.spark\n\n# Now spark is the SparkSession associated with MyReader\n

In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader, you access the spark property to get the SparkSession. The SparkSession spark can now be used to perform distributed data processing tasks.

"},{"location":"reference/spark/readers.html#how-to-define-a-reader","title":"How to Define a Reader?","text":"

To define a Reader, you create a subclass of the Reader class and implement the execute method. The execute method should read from the source and store the result in self.output.df. This is an abstract method, which means it must be implemented in any subclass of Reader.

Here's an example of a Reader:

class MyReader(Reader):\n  def execute(self):\n    # read data from source\n    data = read_from_source()\n    # store result in self.output.df\n    self.output.df = data\n
"},{"location":"reference/spark/readers.html#understanding-inheritance-in-readers","title":"Understanding Inheritance in Readers","text":"

Just like a Step, a Reader is defined as a subclass that inherits from the base Reader class. This means it inherits all the properties and methods from the Reader class and can add or override them as needed. The main method that needs to be overridden is the execute method, which should implement the logic for reading data from the source and storing it in self.output.df.

"},{"location":"reference/spark/readers.html#benefits-of-using-readers-in-data-pipelines","title":"Benefits of Using Readers in Data Pipelines","text":"

Using Reader classes in your data pipelines has several benefits:

  1. Simplicity: Readers abstract away the details of reading data from various sources, allowing you to focus on the logic of your pipeline.

  2. Consistency: By using Readers, you ensure that data is read in a consistent manner across different parts of your pipeline.

  3. Flexibility: Readers can be easily swapped out for different data sources without changing the rest of your pipeline.

  4. Efficiency: Readers automatically manage resources like connections and file handles, ensuring efficient use of resources.

By using the concept of a Reader, you can create data pipelines that are simple, consistent, flexible, and efficient.

"},{"location":"reference/spark/readers.html#examples-of-reader-classes-in-koheesio","title":"Examples of Reader Classes in Koheesio","text":"

Koheesio provides a variety of Reader subclasses for reading data from different sources. Here are just a few examples:

  1. Teradata Reader: A Reader subclass for reading data from Teradata databases. It's defined in the koheesio/steps/readers/teradata.py file.

  2. Snowflake Reader: A Reader subclass for reading data from Snowflake databases. It's defined in the koheesio/steps/readers/snowflake.py file.

  3. Box Reader: A Reader subclass for reading data from Box. It's defined in the koheesio/steps/integrations/box.py file.

These are just a few examples of the many Reader subclasses available in Koheesio. Each Reader subclass is designed to read data from a specific source. They all inherit from the base Reader class and implement the execute method to read data from their respective sources and store it in self.output.df.

Please note that this is not an exhaustive list. Koheesio provides many more Reader subclasses for a wide range of data sources. For a complete list, please refer to the Koheesio documentation or the source code.

More readers can be found in the koheesio/steps/readers module.

"},{"location":"reference/spark/transformations.html","title":"Transformation Module","text":"

The Transformation module in Koheesio provides a set of classes for transforming data within a DataFrame. A Transformation is a type of SparkStep that takes a DataFrame as input, applies a transformation, and returns a DataFrame as output. The transformation logic is implemented in the execute method of each Transformation subclass.

"},{"location":"reference/spark/transformations.html#what-is-a-transformation","title":"What is a Transformation?","text":"

A Transformation is a subclass of SparkStep that applies a transformation to a DataFrame and stores the result. The transformation could be any operation that modifies the data or structure of the DataFrame, such as adding a new column, filtering rows, or aggregating data.

Using Transformation classes ensures that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.

"},{"location":"reference/spark/transformations.html#api-reference","title":"API Reference","text":"

See API Reference for a detailed description of the Transformation classes and their methods.

"},{"location":"reference/spark/transformations.html#types-of-transformations","title":"Types of Transformations","text":"

There are three main types of transformations in Koheesio:

  1. Transformation: This is the base class for all transformations. It takes a DataFrame as input and returns a DataFrame as output. The transformation logic is implemented in the execute method.

  2. ColumnsTransformation: This is an extended Transformation class with a preset validator for handling column(s) data. It standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.

  3. ColumnsTransformationWithTarget: This is an extended ColumnsTransformation class with an additional target_column field. This field can be used to store the result of the transformation in a new column. If the target_column is not provided, the result will be stored in the source column.

Each type of transformation has its own use cases and advantages. The right one to use depends on the specific requirements of your data pipeline.

"},{"location":"reference/spark/transformations.html#how-to-define-a-transformation","title":"How to Define a Transformation","text":"

To define a Transformation, you create a subclass of the Transformation class and implement the execute method. The execute method should take a DataFrame from self.input.df, apply a transformation, and store the result in self.output.df.

Transformation classes abstract away some of the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.

Here's an example of a Transformation:

class MyTransformation(Transformation):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # apply transformation\n        transformed_data = apply_transformation(data)\n        # store result in self.output.df\n        self.output.df = transformed_data\n

In this example, MyTransformation is a subclass of Transformation that you've defined. The execute method gets the data from self.input.df, applies a transformation called apply_transformation (undefined in this example), and stores the result in self.output.df.

"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformation","title":"How to Define a ColumnsTransformation","text":"

To define a ColumnsTransformation, you create a subclass of the ColumnsTransformation class and implement the execute method. The execute method should apply a transformation to the specified columns of the DataFrame.

ColumnsTransformation classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.

Here's an example of a ColumnsTransformation:

from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n

In this example, AddOne is a subclass of ColumnsTransformation that you've defined. The execute method adds 1 to each column in self.get_columns().

The ColumnsTransformation class has a ColumnConfig class that can be used to configure the behavior of the class. This class has the following fields:

  • run_for_all_data_type: Allows to run the transformation for all columns of a given type.
  • limit_data_type: Allows to limit the transformation to a specific data type.
  • data_type_strict_mode: Toggles strict mode for data type validation. Will only work if limit_data_type is set.

Note that data types need to be specified as a SparkDatatype enum. Users should not have to interact with the ColumnConfig class directly.

"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformationwithtarget","title":"How to Define a ColumnsTransformationWithTarget","text":"

To define a ColumnsTransformationWithTarget, you create a subclass of the ColumnsTransformationWithTarget class and implement the func method. The func method should return the transformation that will be applied to the column(s). The execute method, which is already preset, will use the get_columns_with_target method to loop over all the columns and apply this function to transform the DataFrame.

Here's an example of a ColumnsTransformationWithTarget:

from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n

In this example, AddOneWithTarget is a subclass of ColumnsTransformationWithTarget that you've defined. The func method adds 1 to the values of a given column.

The ColumnsTransformationWithTarget class has an additional target_column field. This field can be used to store the result of the transformation in a new column. If the target_column is not provided, the result will be stored in the source column. If more than one column is passed, the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.

The ColumnsTransformationWithTarget class also has a get_columns_with_target method. This method returns an iterator of the columns and handles the target_column as well.

"},{"location":"reference/spark/transformations.html#key-features-of-a-transformation","title":"Key Features of a Transformation","text":"
  1. Execute Method: The Transformation class provides an execute method to implement in your subclass. This method should take a DataFrame from self.input.df, apply a transformation, and store the result in self.output.df.

    For ColumnsTransformation and ColumnsTransformationWithTarget, the execute method is already implemented in the base class. Instead of overriding execute, you implement a func method in your subclass. This func method should return the transformation to be applied to each column. The execute method will then apply this func to each column in a loop.

  2. DataFrame Property: The Transformation class provides a df property as a shorthand for accessing self.input.df. This property ensures that the data is ready to be transformed, even if the execute method hasn't been explicitly called. This is useful for 'early validation' of the input data.

  3. SparkSession: Every Transformation has a SparkSession available as self.spark. This is the currently active SparkSession, which can be used to perform distributed data processing tasks.

  4. Columns Property: The ColumnsTransformation and ColumnsTransformationWithTarget classes provide a columns property. This property standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.

  5. Target Column Property: The ColumnsTransformationWithTarget class provides a target_column property. This field can be used to store the result of the transformation in a new column. If the target_column is not provided, the result will be stored in the source column.

"},{"location":"reference/spark/transformations.html#examples-of-transformation-classes-in-koheesio","title":"Examples of Transformation Classes in Koheesio","text":"

Koheesio provides a variety of Transformation subclasses for transforming data in different ways. Here are some examples:

  • DataframeLookup: This transformation joins two dataframes together based on a list of join mappings. It allows you to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.

    Here's an example of how to use the DataframeLookup transformation:

    from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\nspark = SparkSession.builder.getOrCreate()\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.execute().df\n
  • HashUUID5: This transformation is a subclass of Transformation and provides an interface to generate a UUID5 hash for each row in the DataFrame. The hash is generated based on the values of the specified source columns.

    Here's an example of how to use the HashUUID5 transformation:

    from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\n\nspark = SparkSession.builder.getOrCreate()\ndf = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\n\nhash_transform = HashUUID5(\n    df=df,\n    source_columns=[\"id\", \"value\"],\n    target_column=\"hash\"\n)\n\noutput_df = hash_transform.execute().df\n

In this example, HashUUID5 is a subclass of Transformation. After creating an instance of HashUUID5, you call the execute method to apply the transformation. The execute method generates a UUID5 hash for each row in the DataFrame based on the values of the id and value columns and stores the result in a new column named hash.

"},{"location":"reference/spark/transformations.html#benefits-of-using-koheesio-transformations","title":"Benefits of using Koheesio Transformations","text":"

Using a Koheesio Transformation over plain Spark provides several benefits:

  1. Consistency: By using Transformation classes, you ensure that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.

  2. Abstraction: Transformation classes abstract away the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.

  3. Flexibility: Transformation classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.

  4. Early Input Validation: As a Transformation is a type of SparkStep, which in turn is a Step and a type of Pydantic BaseModel, all inputs are validated when an instance of a Transformation class is created. This early validation helps catch errors related to invalid input, such as an invalid column name, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.

  5. Ease of Testing: Transformation classes are designed to be easily testable. This can make it easier to write unit tests for your data pipeline, helping to ensure its correctness and reliability.

  6. Robustness: Koheesio has been extensively tested with hundreds of unit tests, ensuring that the Transformation classes work as expected under a wide range of conditions. This makes your data pipelines more robust and less likely to fail due to unexpected inputs or edge cases.

By using the concept of a Transformation, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.

"},{"location":"reference/spark/transformations.html#advanced-usage-of-transformations","title":"Advanced Usage of Transformations","text":"

Transformations can be combined and chained together to create complex data processing pipelines. Here's an example of how to chain transformations:

from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\n# Create a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Define two DataFrames\ndf1 = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\ndf2 = spark.createDataFrame([(1, \"C\"), (3, \"D\")], [\"id\", \"value\"])\n\n# Define the first transformation\nlookup = DataframeLookup(\n    other=df2,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\n# Apply the first transformation\noutput_df = lookup.transform(df1)\n\n# Define the second transformation\nhash_transform = HashUUID5(\n    source_columns=[\"id\", \"value\", \"right_value\"],\n    target_column=\"hash\"\n)\n\n# Apply the second transformation\noutput_df2 = hash_transform.transform(output_df)\n

In this example, DataframeLookup is a subclass of ColumnsTransformation and HashUUID5 is a subclass of Transformation. After creating instances of DataframeLookup and HashUUID5, you call the transform method to apply each transformation. The transform method of DataframeLookup performs a left join with df2 on the id column and adds the value column from df2 to the result DataFrame as right_value. The transform method of HashUUID5 generates a UUID5 hash for each row in the DataFrame based on the values of the id, value, and right_value columns and stores the result in a new column named hash.

"},{"location":"reference/spark/transformations.html#troubleshooting-transformations","title":"Troubleshooting Transformations","text":"

If you encounter an error when using a transformation, here are some steps you can take to troubleshoot:

  1. Check the Input Data: Make sure the input DataFrame to the transformation is correct. You can use the show method of the DataFrame to print the first few rows of the DataFrame.

  2. Check the Transformation Parameters: Make sure the parameters passed to the transformation are correct. For example, if you're using a DataframeLookup, make sure the join mappings and target columns are correctly specified.

  3. Check the Transformation Logic: If the input data and parameters are correct, there might be an issue with the transformation logic. You can use PySpark's logging utilities to log intermediate results and debug the transformation logic.

  4. Check the Output Data: If the transformation executes without errors but the output data is not as expected, you can use the show method of the DataFrame to print the first few rows of the output DataFrame. This can help you identify any issues with the transformation logic.

"},{"location":"reference/spark/transformations.html#conclusion","title":"Conclusion","text":"

The Transformation module in Koheesio provides a powerful and flexible way to transform data in a DataFrame. By using Transformation classes, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable. Whether you're performing simple transformations like adding a new column, or complex transformations like joining multiple DataFrames, the Transformation module has you covered.

"},{"location":"reference/spark/writers.html","title":"Writer Module","text":"

The Writer module in Koheesio provides a set of classes for writing data to various destinations. A Writer is a type of SparkStep that takes data from self.input.df and writes it to a destination based on the output parameters.

"},{"location":"reference/spark/writers.html#what-is-a-writer","title":"What is a Writer?","text":"

A Writer is a subclass of SparkStep that writes data to a destination. The data to be written is taken from a DataFrame, which is accessible through the df property of the Writer.

"},{"location":"reference/spark/writers.html#how-to-define-a-writer","title":"How to Define a Writer?","text":"

To define a Writer, you create a subclass of the Writer class and implement the execute method. The execute method should take data from self.input.df and write it to the destination.

Here's an example of a Writer:

class MyWriter(Writer):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # write data to destination\n        write_to_destination(data)\n
"},{"location":"reference/spark/writers.html#key-features-of-a-writer","title":"Key Features of a Writer","text":"
  1. Write Method: The Writer class provides a write method that calls the execute method and writes the data to the destination. Essentially, calling .write() is a shorthand for calling .execute().output.df. This allows you to write data to a Writer without having to call the execute method directly. This is a convenience method that simplifies the usage of a Writer.

    Here's an example of how to use the .write() method:

    # Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the .write() method to write the data\nmy_writer.write()\n\n# The data from MyWriter's DataFrame is now written to the destination\n

    In this example, MyWriter is a subclass of Writer that you've defined. After creating an instance of MyWriter, you call the .write() method to write the data to the destination. The data from MyWriter's DataFrame is now written to the destination.

  2. DataFrame Property: The Writer class provides a df property as a shorthand for accessing self.input.df. This property ensures that the data is ready to be written, even if the execute method hasn't been explicitly called.

    Here's an example of how to use the df property:

    # Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the df property to get the data as a DataFrame\ndf = my_writer.df\n\n# Now df is a DataFrame with the data that will be written by MyWriter\n

    In this example, MyWriter is a subclass of Writer that you've defined. After creating an instance of MyWriter, you access the df property to get the data as a DataFrame. The DataFrame df now contains the data that will be written by MyWriter.

  3. SparkSession: Every Writer has a SparkSession available as self.spark. This is the currently active SparkSession, which can be used to perform distributed data processing tasks.

    Here's an example of how to use the spark property:

    # Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the spark property to get the SparkSession\nspark = my_writer.spark\n\n# Now spark is the SparkSession associated with MyWriter\n

    In this example, MyWriter is a subclass of Writer that you've defined. After creating an instance of MyWriter, you access the spark property to get the SparkSession. The SparkSession spark can now be used to perform distributed data processing tasks.

"},{"location":"reference/spark/writers.html#understanding-inheritance-in-writers","title":"Understanding Inheritance in Writers","text":"

Just like a Step, a Writer is defined as a subclass that inherits from the base Writer class. This means it inherits all the properties and methods from the Writer class and can add or override them as needed. The main method that needs to be overridden is the execute method, which should implement the logic for writing data from self.input.df to the destination.

"},{"location":"reference/spark/writers.html#examples-of-writer-classes-in-koheesio","title":"Examples of Writer Classes in Koheesio","text":"

Koheesio provides a variety of Writer subclasses for writing data to different destinations. Here are just a few examples:

  • BoxFileWriter
  • DeltaTableStreamWriter
  • DeltaTableWriter
  • DummyWriter
  • ForEachBatchStreamWriter
  • KafkaWriter
  • SnowflakeWriter
  • StreamWriter

Please note that this is not an exhaustive list. Koheesio provides many more Writer subclasses for a wide range of data destinations. For a complete list, please refer to the Koheesio documentation or the source code.

"},{"location":"reference/spark/writers.html#benefits-of-using-writers-in-data-pipelines","title":"Benefits of Using Writers in Data Pipelines","text":"

Using Writer classes in your data pipelines has several benefits:

  1. Simplicity: Writers abstract away the details of writing data to various destinations, allowing you to focus on the logic of your pipeline.
  2. Consistency: By using Writers, you ensure that data is written in a consistent manner across different parts of your pipeline.
  3. Flexibility: Writers can be easily swapped out for different data destinations without changing the rest of your pipeline.
  4. Efficiency: Writers automatically manage resources like connections and file handles, ensuring efficient use of resources.
  5. Early Input Validation: As a Writer is a type of SparkStep, which in turn is a Step and a type of Pydantic BaseModel, all inputs are validated when an instance of a Writer class is created. This early validation helps catch errors related to invalid input, such as an invalid URL for a database, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.

By using the concept of a Writer, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.

"},{"location":"tutorials/advanced-data-processing.html","title":"Advanced Data Processing with Koheesio","text":"

In this guide, we will explore some advanced data processing techniques using Koheesio. We will cover topics such as complex transformations, handling large datasets, and optimizing performance.

"},{"location":"tutorials/advanced-data-processing.html#complex-transformations","title":"Complex Transformations","text":"

Koheesio provides a variety of built-in transformations, but sometimes you may need to perform more complex operations on your data. In such cases, you can create custom transformations.

Here's an example of a custom transformation that normalizes a column in a DataFrame:

from pyspark.sql import DataFrame\nfrom koheesio.spark.transformations.transform import Transform\n\ndef normalize_column(df: DataFrame, column: str) -> DataFrame:\n    max_value = df.agg({column: \"max\"}).collect()[0][0]\n    min_value = df.agg({column: \"min\"}).collect()[0][0]\n    return df.withColumn(column, (df[column] - min_value) / (max_value - min_value))\n\n\nclass NormalizeColumnTransform(Transform):\n    column: str\n\n    def transform(self, df: DataFrame) -> DataFrame:\n        return normalize_column(df, self.column)\n
"},{"location":"tutorials/advanced-data-processing.html#handling-large-datasets","title":"Handling Large Datasets","text":"

When working with large datasets, it's important to manage resources effectively to ensure good performance. Koheesio provides several features to help with this.

"},{"location":"tutorials/advanced-data-processing.html#partitioning","title":"Partitioning","text":"

Partitioning is a technique that divides your data into smaller, more manageable pieces, called partitions. Koheesio allows you to specify the partitioning scheme for your data when writing it to a target.

from koheesio.steps.writers.delta import DeltaTableWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\nclass MyTask(EtlTask):\n    target = DeltaTableWriter(table=\"my_table\", partitionBy=[\"column1\", \"column2\"])\n
"},{"location":"tutorials/getting-started.html","title":"Getting Started with Koheesio","text":""},{"location":"tutorials/getting-started.html#requirements","title":"Requirements","text":"
  • Python 3.9+
"},{"location":"tutorials/getting-started.html#installation","title":"Installation","text":""},{"location":"tutorials/getting-started.html#poetry","title":"Poetry","text":"

If you're using Poetry, add the following entry to the pyproject.toml file:

pyproject.toml
[[tool.poetry.source]]\nname = \"nike\"\nurl = \"https://artifactory.nike.com/artifactory/api/pypi/python-virtual/simple\"\nsecondary = true\n
poetry add koheesio\n
"},{"location":"tutorials/getting-started.html#pip","title":"pip","text":"

If you're using pip, run the following command to install Koheesio:

Requires pip.

pip install koheesio\n
"},{"location":"tutorials/getting-started.html#basic-usage","title":"Basic Usage","text":"

Once you've installed Koheesio, you can start using it in your Python scripts. Here's a basic example:

from koheesio import Step\n\n# Define a step\nclass MyStep(Step):\n    def execute(self):\n        # Your step logic here\n\n# Create an instance of the step\nstep = MyStep()\n\n# Run the step\nstep.execute()\n
"},{"location":"tutorials/getting-started.html#advanced-usage","title":"Advanced Usage","text":"
from pyspark.sql.functions import lit\nfrom pyspark.sql import DataFrame, SparkSession\n\n# Step 1: import Koheesio dependencies\nfrom koheesio.context import Context\nfrom koheesio.steps.readers.dummy import DummyReader\nfrom koheesio.steps.transformations.camel_to_snake import CamelToSnakeTransformation\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\n# Step 2: Set up a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Step 3: Configure your Context\ncontext = Context({\n    \"source\": DummyReader(),\n    \"transformations\": [CamelToSnakeTransformation()],\n    \"target\": DummyWriter(),\n    \"my_favorite_movie\": \"inception\",\n})\n\n# Step 4: Create a Task\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: DataFrame = None) -> DataFrame:\n        df = df.withColumn(\"MyFavoriteMovie\", lit(self.my_favorite_movie))\n        return super().transform(df)\n\n# Step 5: Run your Task\ntask = MyFavoriteMovieTask(**context)\ntask.run()\n
"},{"location":"tutorials/getting-started.html#contributing","title":"Contributing","text":"

If you want to contribute to Koheesio, check out the CONTRIBUTING.md file in this repository. It contains guidelines for contributing, including how to submit issues and pull requests.

"},{"location":"tutorials/getting-started.html#testing","title":"Testing","text":"

To run the tests for Koheesio, use the following command:

make dev-test\n

This will run all the tests in the tests directory.

"},{"location":"tutorials/hello-world.html","title":"Simple Examples","text":""},{"location":"tutorials/hello-world.html#creating-a-custom-step","title":"Creating a Custom Step","text":"

This example demonstrates how to use the SparkStep class from the koheesio library to create a custom step named HelloWorldStep.

"},{"location":"tutorials/hello-world.html#code","title":"Code","text":"
from koheesio.steps.step import SparkStep\n\nclass HelloWorldStep(SparkStep):\n    message: str\n\n    def execute(self) -> SparkStep.Output:\n        # create a DataFrame with a single row containing the message\n        self.output.df = self.spark.createDataFrame([(1, self.message)], [\"id\", \"message\"])\n
"},{"location":"tutorials/hello-world.html#usage","title":"Usage","text":"
hello_world_step = HelloWorldStep(message=\"Hello, World!\")\nhello_world_step.execute()\n\nhello_world_step.output.df.show()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code","title":"Understanding the Code","text":"

The HelloWorldStep class is a SparkStep in Koheesio, designed to generate a DataFrame with a single row containing a custom message. Here's a more detailed overview:

  • HelloWorldStep inherits from SparkStep, a fundamental building block in Koheesio for creating data processing steps with Apache Spark.
  • It has a message attribute. When creating an instance of HelloWorldStep, you can pass a custom message that will be used in the DataFrame.
  • SparkStep has a spark attribute, which is the active SparkSession. This is the entry point for any Spark functionality, allowing the step to interact with the Spark cluster.
  • SparkStep also includes an Output class, used to store the output of the step. In this case, Output has a df attribute to store the output DataFrame.
  • The execute method creates a DataFrame with the custom message and stores it in output.df. It doesn't return a value explicitly; instead, the output DataFrame can be accessed via output.df.
  • Koheesio uses pydantic for automatic validation of the step's input and output, ensuring they are correctly defined and of the correct types.

Note: Pydantic is a data validation library that provides a way to validate that the data (in this case, the input and output of the step) conforms to the expected format.

"},{"location":"tutorials/hello-world.html#creating-a-custom-task","title":"Creating a Custom Task","text":"

This example demonstrates how to use the EtlTask from the koheesio library to create a custom task named MyFavoriteMovieTask.

"},{"location":"tutorials/hello-world.html#code_1","title":"Code","text":"
from typing import Any\nfrom pyspark.sql import DataFrame, functions as f\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.tasks.etl_task import EtlTask\n\n\ndef add_column(df: DataFrame, target_column: str, value: Any):\n    return df.withColumn(target_column, f.lit(value))\n\n\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n        df = df or self.extract()\n\n        # pre-transformations specific to this class\n        pre_transformations = [\n            Transform(add_column, target_column=\"myFavoriteMovie\", value=self.my_favorite_movie)\n        ]\n\n        # execute transformations one by one\n        for t in pre_transformations:\n            df = t.transform(df)\n\n        self.output.transform_df = df\n        return df\n
"},{"location":"tutorials/hello-world.html#configuration","title":"Configuration","text":"

Here is the sample.yaml configuration file used in this example:

raw_layer:\n  catalog: development\n  schema: my_favorite_team\n  table: some_random_table\nmovies:\n  favorite: Office Space\nhash_settings:\n  source_columns:\n  - id\n  - foo\n  target_column: hash_uuid5\nsource:\n  range: 4\n
"},{"location":"tutorials/hello-world.html#usage_1","title":"Usage","text":"
from pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\n\ncontext = Context.from_yaml(\"sample.yaml\")\n\nSparkSession.builder.getOrCreate()\n\nmy_fav_mov_task = MyFavoriteMovieTask(\n    source=DummyReader(**context.raw_layer),\n    target=DummyWriter(truncate=False),\n    my_favorite_movie=context.movies.favorite,\n)\nmy_fav_mov_task.execute()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code_1","title":"Understanding the Code","text":"

This example creates a MyFavoriteMovieTask that adds a column named myFavoriteMovie to the DataFrame. The value for this column is provided when the task is instantiated.

The MyFavoriteMovieTask class is a custom task that extends the EtlTask from the koheesio library. It demonstrates how to add a custom transformation to a DataFrame. Here's a detailed breakdown:

  • MyFavoriteMovieTask inherits from EtlTask, a base class in Koheesio for creating Extract-Transform-Load (ETL) tasks with Apache Spark.

  • It has a my_favorite_movie attribute. When creating an instance of MyFavoriteMovieTask, you can pass a custom movie title that will be used in the DataFrame.

  • The transform method is where the main logic of the task is implemented. It first extracts the data (if not already provided), then applies a series of transformations to the DataFrame.

  • In this case, the transformation is adding a new column to the DataFrame named myFavoriteMovie, with the value set to the my_favorite_movie attribute. This is done using the add_column function and the Transform class from Koheesio.

  • The transformed DataFrame is then stored in self.output.transform_df.

  • The sample.yaml configuration file is used to provide the context for the task, including the source data and the favorite movie title.

  • In the usage example, an instance of MyFavoriteMovieTask is created with a DummyReader as the source, a DummyWriter as the target, and the favorite movie title from the context. The task is then executed, which runs the transformations and stores the result in self.output.transform_df.

"},{"location":"tutorials/learn-koheesio.html","title":"Learn Koheesio","text":"

Koheesio is designed to simplify the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.

"},{"location":"tutorials/learn-koheesio.html#core-concepts","title":"Core Concepts","text":"

Koheesio is built around several core concepts:

  • Step: The fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.

    See the Step documentation for more information.

  • Context: A configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.

    See the Context documentation for more information.

  • Logger: A class for logging messages at different levels.

    See the Logger documentation for more information.

The Logger and Context classes provide support, enabling detailed logging of the pipeline's execution and customization of the pipeline's behavior based on the environment, respectively.

"},{"location":"tutorials/learn-koheesio.html#implementations","title":"Implementations","text":"

In the context of Koheesio, an implementation refers to a specific way of executing Steps, the fundamental units of work in Koheesio. Each implementation uses a different technology or approach to process data along with its own set of Steps, designed to work with the specific technology or approach used by the implementation.

For example, the Spark implementation includes Steps for reading data from a Spark DataFrame, transforming the data using Spark operations, and writing the data to a Spark-supported destination.

Currently, Koheesio supports two implementations: Spark, and AsyncIO.

"},{"location":"tutorials/learn-koheesio.html#spark","title":"Spark","text":"

Requires: Apache Spark (pyspark) Installation: pip install koheesio[spark] Module: koheesio.spark

This implementation uses Apache Spark, a powerful open-source unified analytics engine for large-scale data processing.

Steps that use this implementation can leverage Spark's capabilities for distributed data processing, making it suitable for handling large volumes of data. The Spark implementation includes the following types of Steps:

  • Reader: from koheesio.spark.readers import Reader A type of Step that reads data from a source and stores the result (to make it available for subsequent steps). For more information, see the Reader documentation.

  • Writer: from koheesio.spark.writers import Writer This controls how data is written to the output in both batch and streaming contexts. For more information, see the Writer documentation.

  • Transformation: from koheesio.spark.transformations import Transformation A type of Step that takes a DataFrame as input and returns a DataFrame as output. For more information, see the Transformation documentation.

In any given pipeline, you can expect to use Readers, Writers, and Transformations to express the ETL logic. Readers are responsible for extracting data from various sources, such as databases, files, or APIs. Transformations then process this data, performing operations like filtering, aggregation, or conversion. Finally, Writers handle the loading of the transformed data to the desired destination, which could be a database, a file, or a data stream.

"},{"location":"tutorials/learn-koheesio.html#async","title":"Async","text":"

Module: koheesio.asyncio

This implementation uses Python's asyncio library for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives. Steps that use this implementation can perform data processing tasks asynchronously, which can be beneficial for IO-bound tasks.

"},{"location":"tutorials/learn-koheesio.html#best-practices","title":"Best Practices","text":"

Here are some best practices for using Koheesio:

  1. Use Context: The Context class in Koheesio is designed to behave like a dictionary, but with added features. It's a good practice to use Context to customize the behavior of a task. This allows you to share variables across tasks and adapt the behavior of a task based on its environment; for example, by changing the source or target of the data between development and production environments.

  2. Modular Design: Each step in the pipeline (reading, transformation, writing) should be encapsulated in its own class, making the code easier to understand and maintain. This also promotes re-usability as steps can be reused across different tasks.

  3. Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks. Make sure to leverage this feature to make your pipelines robust and fault-tolerant.

  4. Logging: Use the built-in logging feature in Koheesio to log information and errors in data processing tasks. This can be very helpful for debugging and monitoring the pipeline. Koheesio sets the log level to WARNING by default, but you can change it to INFO or DEBUG as needed.

  5. Testing: Each step can be tested independently, making it easier to write unit tests. It's a good practice to write tests for your steps to ensure they are working as expected.

  6. Use Transformations: The Transform class in Koheesio allows you to define transformations on your data. It's a good practice to encapsulate your transformation logic in Transform classes for better readability and maintainability.

  7. Consistent Structure: Koheesio enforces a consistent structure for data processing tasks. Stick to this structure to make your codebase easier to understand for new developers.

  8. Use Readers and Writers: Use the built-in Reader and Writer classes in Koheesio to handle data extraction and loading. This not only simplifies your code but also makes it more robust and efficient.

Remember, these are general best practices and might need to be adapted based on your specific use case and requirements.

"},{"location":"tutorials/learn-koheesio.html#pydantic","title":"Pydantic","text":"

Koheesio Steps are Pydantic models, which means they can be validated and serialized. This makes it easy to define the inputs and outputs of a Step, and to validate them before running the Step. Pydantic models also provide a consistent way to define the schema of the data that a Step expects and produces, making it easier to understand and maintain the code.

Learn more about Pydantic here.

"},{"location":"tutorials/onboarding.html","title":"Onboarding","text":"

tags: - doctype/how-to

"},{"location":"tutorials/onboarding.html#onboarding-to-koheesio","title":"Onboarding to Koheesio","text":"

Koheesio is a Python library that simplifies the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.

This guide will walk you through the process of transforming a traditional Spark application into a Koheesio pipeline along with explaining the advantages of using Koheesio over raw Spark.

"},{"location":"tutorials/onboarding.html#traditional-spark-application","title":"Traditional Spark Application","text":"

First let's create a simple Spark application that you might use to process data.

The following Spark application reads a CSV file, performs a transformation, and writes the result to a Delta table. The transformation includes filtering data where age is greater than 18 and performing an aggregation to calculate the average salary per country. The result is then written to a Delta table partitioned by country.

from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, avg\n\nspark = SparkSession.builder.getOrCreate()\n\n# Read data from CSV file\ndf = spark.read.csv(\"input.csv\", header=True, inferSchema=True)\n\n# Filter data where age is greater than 18\ndf = df.filter(col(\"age\") > 18)\n\n# Perform aggregation\ndf = df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n\n# Write data to Delta table with partitioning\ndf.write.format(\"delta\").partitionBy(\"country\").save(\"/path/to/delta_table\")\n
"},{"location":"tutorials/onboarding.html#transforming-to-koheesio","title":"Transforming to Koheesio","text":"

The same pipeline can be rewritten using Koheesio's EtlTask. In this version, each step (reading, transformations, writing) is encapsulated in its own class, making the code easier to understand and maintain.

First, a CsvReader is defined to read the input CSV file. Then, a DeltaTableWriter is defined to write the result to a Delta table partitioned by country.

Two transformations are defined: 1. one to filter data where age is greater than 18 2. and, another to calculate the average salary per country.

These transformations are then passed to an EtlTask along with the reader and writer. Finally, the EtlTask is executed to run the pipeline.

from koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta.batch import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\nfrom pyspark.sql.functions import col, avg\n\n# Define reader\nreader = CsvReader(path=\"input.csv\", header=True, inferSchema=True)\n\n# Define writer\nwriter = DeltaTableWriter(table=\"delta_table\", partition_by=[\"country\"])\n\n# Define transformations\nage_transformation = Transform(\n    func=lambda df: df.filter(col(\"age\") > 18)\n)\navg_salary_per_country = Transform(\n    func=lambda df: df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n)\n\n# Define and execute EtlTask\ntask = EtlTask(\n    source=reader, \n    target=writer, \n    transformations=[\n        age_transformation,\n        avg_salary_per_country\n    ]\n)\ntask.execute()\n
This approach with Koheesio provides several advantages. It makes the code more modular and easier to test. Each step can be tested independently and reused across different tasks. It also makes the pipeline more readable and easier to maintain.

"},{"location":"tutorials/onboarding.html#advantages-of-koheesio","title":"Advantages of Koheesio","text":"

Using Koheesio instead of raw Spark has several advantages:

  • Modularity: Each step in the pipeline (reading, transformation, writing) is encapsulated in its own class, making the code easier to understand and maintain.
  • Reusability: Steps can be reused across different tasks, reducing code duplication.
  • Testability: Each step can be tested independently, making it easier to write unit tests.
  • Flexibility: The behavior of a task can be customized using a Context class.
  • Consistency: Koheesio enforces a consistent structure for data processing tasks, making it easier for new developers to understand the codebase.
  • Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks.
  • Logging: Koheesio provides a consistent way to log information and errors in data processing tasks.

In contrast, using the plain PySpark API for transformations can lead to more verbose and less structured code, which can be harder to understand, maintain, and test. It also doesn't provide the same level of error handling, logging, and flexibility as the Koheesio Transform class.

"},{"location":"tutorials/onboarding.html#using-a-context-class","title":"Using a Context Class","text":"

Here's a simple example of how to use a Context class to customize the behavior of a task. The Context class in Koheesio is designed to behave like a dictionary, but with added features.

from koheesio import Context\nfrom koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\n\ncontext = Context({  # this could be stored in a JSON or YAML\n    \"age_threshold\": 18,\n    \"reader_options\": {\n        \"path\": \"input.csv\",\n        \"header\": True,\n        \"inferSchema\": True\n    },\n    \"writer_options\": {\n        \"table\": \"delta_table\",\n        \"partition_by\": [\"country\"]\n    }\n})\n\ntask = EtlTask(\n    source = CsvReader(**context.reader_options),\n    target = DeltaTableWriter(**context.writer_options),\n    transformations = [\n        Transform(func=lambda df: df.filter(df[\"age\"] > context.age_threshold))\n    ]\n)\n\ntask.execute()\n

In this example, we're using CsvReader to read the input data, DeltaTableWriter to write the output data, and a Transform step to filter the data based on the age threshold. The options for the reader and writer are stored in a Context object, which can be easily updated or loaded from a JSON or YAML file.

"},{"location":"tutorials/testing-koheesio-steps.html","title":"Testing Koheesio Tasks","text":"

Testing is a crucial part of any software development process. Koheesio provides a structured way to define and execute data processing tasks, which makes it easier to build, test, and maintain complex data workflows. This guide will walk you through the process of testing Koheesio tasks.

"},{"location":"tutorials/testing-koheesio-steps.html#unit-testing","title":"Unit Testing","text":"

Unit testing involves testing individual components of the software in isolation. In the context of Koheesio, this means testing individual tasks or steps.

Here's an example of how to unit test a Koheesio task:

from koheesio.tasks.etl_task import EtlTask\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.steps.transformations import Transform\nfrom pyspark.sql import SparkSession, DataFrame\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df: DataFrame) -> DataFrame:\n    return df.filter(col(\"Age\") > 18)\n\n\ndef test_etl_task():\n    # Initialize SparkSession\n    spark = SparkSession.builder.getOrCreate()\n\n    # Create a DataFrame for testing\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    df = spark.createDataFrame(data, [\"Name\", \"Age\"])\n\n    # Define the task\n    task = EtlTask(\n        source=DummyReader(df=df),\n        target=DummyWriter(),\n        transformations=[\n            Transform(filter_age)\n        ]\n    )\n\n    # Execute the task\n    task.execute()\n\n    # Assert the result\n    result_df = task.output.df\n    assert result_df.count() == 2\n    assert result_df.filter(\"Name == 'Tom'\").count() == 0\n

In this example, we're testing an EtlTask that reads data from a DataFrame, applies a filter transformation, and writes the result to another DataFrame. The test asserts that the task correctly filters out rows where the age is less than or equal to 18.

"},{"location":"tutorials/testing-koheesio-steps.html#integration-testing","title":"Integration Testing","text":"

Integration testing involves testing the interactions between different components of the software. In the context of Koheesio, this means testing the entirety of data flowing through one or more tasks.

We'll create a simple test for a hypothetical EtlTask that uses DeltaReader and DeltaWriter. We'll use pytest and unittest.mock to mock the responses of the reader and writer. First, let's assume that you have an EtlTask defined in a module named my_module. This task reads data from a Delta table, applies some transformations, and writes the result to another Delta table.

Here's an example of how to write an integration test for this task:

# my_module.py\nfrom koheesio.tasks.etl_task import EtlTask\nfrom koheesio.spark.readers.delta import DeltaReader\nfrom koheesio.steps.writers.delta import DeltaWriter\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.context import Context\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df):\n    return df.filter(col(\"Age\") > 18)\n\n\ncontext = Context({\n    \"reader_options\": {\n        \"table\": \"input_table\"\n    },\n    \"writer_options\": {\n        \"table\": \"output_table\"\n    }\n})\n\ntask = EtlTask(\n    source=DeltaReader(**context.reader_options),\n    target=DeltaWriter(**context.writer_options),\n    transformations=[\n        Transform(filter_age)\n    ]\n)\n

Now, let's create a test for this task. We'll use pytest and unittest.mock to mock the responses of the reader and writer. We'll also use a pytest fixture to create a test context and a test DataFrame.

# test_my_module.py\nimport pytest\nfrom unittest.mock import MagicMock, patch\nfrom pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import Reader\nfrom koheesio.steps.writers import Writer\n\nfrom my_module import task\n\n@pytest.fixture(scope=\"module\")\ndef spark():\n    return SparkSession.builder.getOrCreate()\n\n@pytest.fixture(scope=\"module\")\ndef test_context():\n    return Context({\n        \"reader_options\": {\n            \"table\": \"test_input_table\"\n        },\n        \"writer_options\": {\n            \"table\": \"test_output_table\"\n        }\n    })\n\n@pytest.fixture(scope=\"module\")\ndef test_df(spark):\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    return spark.createDataFrame(data, [\"Name\", \"Age\"])\n\ndef test_etl_task(spark, test_context, test_df):\n    # Mock the read method of the Reader class\n    with patch.object(Reader, \"read\", return_value=test_df):\n        # Mock the write method of the Writer class\n        with patch.object(Writer, \"write\") as mock_write:\n            # Execute the task\n            task.execute()\n\n            # Assert the result\n            result_df = task.output.df\n            assert result_df.count() == 2\n            assert result_df.filter(\"Name == 'Tom'\").count() == 0\n\n            # Assert that the reader and writer were called with the correct arguments\n            Reader.read.assert_called_once_with(**test_context.reader_options)\n            mock_write.assert_called_once_with(**test_context.writer_options)\n

In this test, we're mocking the DeltaReader and DeltaWriter to return a test DataFrame and check that they're called with the correct arguments. We're also asserting that the task correctly filters out rows where the age is less than or equal to 18.

"},{"location":"misc/tags.html","title":"{{ page.title }}","text":""},{"location":"misc/tags.html#doctypeexplanation","title":"doctype/explanation","text":"
  • Approach documentation
"},{"location":"misc/tags.html#doctypehow-to","title":"doctype/how-to","text":"
  • How to
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":""},{"location":"index.html#koheesio","title":"Koheesio","text":"CI/CD Package Meta

Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

The framework is versatile, aiming to support multiple implementations and working seamlessly with various data processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.

Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type safety and structured configurations within pipeline components.

Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.

"},{"location":"index.html#what-sets-koheesio-apart-from-other-libraries","title":"What sets Koheesio apart from other libraries?\"","text":"

Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.

Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus on utility in a multitude of scenarios. Our preference is for integration, not competition...

We invite contributions from all, promoting collaboration and innovation in the data engineering community.

"},{"location":"index.html#koheesio-core-components","title":"Koheesio Core Components","text":"

Here are the key components included in Koheesio:

  • Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.
    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
  • Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.
  • Logger: This is a class for logging messages at different levels.
"},{"location":"index.html#installation","title":"Installation","text":"

You can install Koheesio using either pip or poetry.

"},{"location":"index.html#using-pip","title":"Using Pip","text":"

To install Koheesio using pip, run the following command in your terminal:

pip install koheesio\n
"},{"location":"index.html#using-hatch","title":"Using Hatch","text":"

If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your pyproject.toml.

"},{"location":"index.html#using-poetry","title":"Using Poetry","text":"

If you're using poetry for package management, you can add Koheesio to your project with the following command:

poetry add koheesio\n

or add the following line to your pyproject.toml (under [tool.poetry.dependencies]), making sure to replace ... with the version you want to have installed:

koheesio = {version = \"...\"}\n
"},{"location":"index.html#extras","title":"Extras","text":"

Koheesio also provides some additional features that can be useful in certain scenarios. These include:

  • Spark Expectations: Available through the koheesio.steps.integration.spark.dq.spark_expectations module; installable through the se extra.

    • SE Provides Data Quality checks for Spark DataFrames.
    • For more information, refer to the Spark Expectations docs.
  • Box: Available through the koheesio.steps.integration.box module; installable through the box extra.

    • Box is a cloud content management and file sharing service for businesses.
  • SFTP: Available through the koheesio.steps.integration.spark.sftp module; installable through the sftp extra.

    • SFTP is a network protocol used for secure file transfer over a secure shell.

Note: Some of the steps require extra dependencies. See the Extras section for additional info. Extras can be added to Poetry by adding extras=['name_of_the_extra'] to the toml entry mentioned above

"},{"location":"index.html#contributing","title":"Contributing","text":""},{"location":"index.html#how-to-contribute","title":"How to Contribute","text":"

We welcome contributions to our project! Here's a brief overview of our development process:

  • Code Standards: We use pylint, black, and mypy to maintain code standards. Please ensure your code passes these checks by running make check. No errors or warnings should be reported by the linter before you submit a pull request.

  • Testing: We use pytest for testing. Run the tests with make test and ensure all tests pass before submitting a pull request.

  • Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with admin rights will create a new release on GitHub and publish the new version to PyPI.

For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.

"},{"location":"index.html#additional-resources","title":"Additional Resources","text":"
  • General GitHub documentation
  • GitHub pull request documentation
  • Nike OSS
"},{"location":"api_reference/index.html","title":"API Reference","text":""},{"location":"api_reference/index.html#koheesio.ABOUT","title":"koheesio.ABOUT module-attribute","text":"
ABOUT = _about()\n
"},{"location":"api_reference/index.html#koheesio.VERSION","title":"koheesio.VERSION module-attribute","text":"
VERSION = __version__\n
"},{"location":"api_reference/index.html#koheesio.BaseModel","title":"koheesio.BaseModel","text":"

Base model for all models.

Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.

Additional methods and properties: Different Modes

This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.

  • Normal mode: you need to know the values ahead of time

    normal_mode = YourOwnModel(a=\"foo\", b=42)\n

  • Lazy mode: being able to defer the validation until later

    lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
    The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end.

  • With statements: With statements are also allowed. The validate_output method from the earlier example will run upon exit of the with-statement.

    with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n
    Note: that a lazy mode BaseModel object is required to work with a with-statement.

Examples:

from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n

In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output method is then called to validate the instance.

Koheesio specific configuration:

Koheesio models are configured differently from Pydantic defaults. The following configuration is used:

  1. extra=\"allow\"

    This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.

  2. arbitrary_types_allowed=True

    This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.

  3. populate_by_name=True

    This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.

  4. validate_assignment=False

    This setting determines whether the model should be revalidated when the data is changed. If set to True, every time a field is assigned a new value, the entire model is validated again.

    Pydantic default is (also) False, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.

  5. revalidate_instances=\"subclass-instances\"

    This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never, which means that the model and dataclass instances are not revalidated during validation.

  6. validate_default=True

    This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.

  7. frozen=False

    This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.

  8. coerce_numbers_to_str=True

    This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number type to str. Pydantic doesn't allow number types (int, float, Decimal) to be coerced as type str by default.

  9. use_enum_values=True

    This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.

"},{"location":"api_reference/index.html#koheesio.BaseModel--fields","title":"Fields","text":"

Every Koheesio BaseModel has two fields: name and description. These fields are used to provide a name and a description to the model.

  • name: This is the name of the Model. If not provided, it defaults to the class name.

  • description: This is the description of the Model. It has several default behaviors:

    • If not provided, it defaults to the docstring of the class.
    • If the docstring is not provided, it defaults to the name of the class.
    • For multi-line descriptions, it has the following behaviors:
      • Only the first non-empty line is used.
      • Empty lines are removed.
      • Only the first 3 lines are considered.
      • Only the first 120 characters are considered.
"},{"location":"api_reference/index.html#koheesio.BaseModel--validators","title":"Validators","text":"
  • _set_name_and_description: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/index.html#koheesio.BaseModel--properties","title":"Properties","text":"
  • log: Returns a logger with the name of the class.
"},{"location":"api_reference/index.html#koheesio.BaseModel--class-methods","title":"Class Methods","text":"
  • from_basemodel: Returns a new BaseModel instance based on the data of another BaseModel.
  • from_context: Creates BaseModel instance from a given Context.
  • from_dict: Creates BaseModel instance from a given dictionary.
  • from_json: Creates BaseModel instance from a given JSON string.
  • from_toml: Creates BaseModel object from a given toml file.
  • from_yaml: Creates BaseModel object from a given yaml file.
  • lazy: Constructs the model without doing validation.
"},{"location":"api_reference/index.html#koheesio.BaseModel--dunder-methods","title":"Dunder Methods","text":"
  • __add__: Allows to add two BaseModel instances together.
  • __enter__: Allows for using the model in a with-statement.
  • __exit__: Allows for using the model in a with-statement.
  • __setitem__: Set Item dunder method for BaseModel.
  • __getitem__: Get Item dunder method for BaseModel.
"},{"location":"api_reference/index.html#koheesio.BaseModel--instance-methods","title":"Instance Methods","text":"
  • hasattr: Check if given key is present in the model.
  • get: Get an attribute of the model, but don't fail if not present.
  • merge: Merge key,value map with self.
  • set: Allows for subscribing / assigning to class[key].
  • to_context: Converts the BaseModel instance to a Context object.
  • to_dict: Converts the BaseModel instance to a dictionary.
  • to_json: Converts the BaseModel instance to a JSON string.
  • to_yaml: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/index.html#koheesio.BaseModel.description","title":"description class-attribute instance-attribute","text":"
description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.log","title":"log property","text":"
log: Logger\n

Returns a logger with the name of the class

"},{"location":"api_reference/index.html#koheesio.BaseModel.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.name","title":"name class-attribute instance-attribute","text":"
name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_basemodel","title":"from_basemodel classmethod","text":"
from_basemodel(basemodel: BaseModel, **kwargs)\n

Returns a new BaseModel instance based on the data of another BaseModel

Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_context","title":"from_context classmethod","text":"
from_context(context: Context) -> BaseModel\n

Creates BaseModel instance from a given Context

You have to make sure that the Context object has the necessary attributes to create the model.

Examples:

class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n

Parameters:

Name Type Description Default context Context required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_dict","title":"from_dict classmethod","text":"
from_dict(data: Dict[str, Any]) -> BaseModel\n

Creates BaseModel instance from a given dictionary

Parameters:

Name Type Description Default data Dict[str, Any] required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel instance from a given JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.from_json : Deserializes a JSON string to a Context object

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel object from a given toml file

Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file, or string containing toml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> BaseModel\n

Creates BaseModel object from a given yaml file

Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.get","title":"get","text":"
get(key: str, default: Optional[Any] = None)\n

Get an attribute of the model, but don't fail if not present

Similar to dict.get()

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n

Parameters:

Name Type Description Default key str

name of the key to get

required default Optional[Any]

Default value in case the attribute does not exist

None

Returns:

Type Description Any

The value of the attribute

Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.hasattr","title":"hasattr","text":"
hasattr(key: str) -> bool\n

Check if given key is present in the model

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.lazy","title":"lazy classmethod","text":"
lazy()\n

Constructs the model without doing validation

Essentially an alias to BaseModel.construct()

Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.merge","title":"merge","text":"
merge(other: Union[Dict, BaseModel])\n

Merge key,value map with self

Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}.

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n

Parameters:

Name Type Description Default other Union[Dict, BaseModel]

Dict or another instance of a BaseModel class that will be added to self

required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.set","title":"set","text":"
set(key: str, value: Any)\n

Allows for subscribing / assigning to class[key].

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n

Parameters:

Name Type Description Default key str

The key of the attribute to assign to

required value Any

Value that should be assigned to the given key

required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_context","title":"to_context","text":"
to_context() -> Context\n

Converts the BaseModel instance to a Context object

Returns:

Type Description Context Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Converts the BaseModel instance to a dictionary

Returns:

Type Description Dict[str, Any] Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_json","title":"to_json","text":"
to_json(pretty: bool = False)\n

Converts the BaseModel instance to a JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.to_json : Serializes a Context object to a JSON string

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Converts the BaseModel instance to a YAML string

BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.validate","title":"validate","text":"
validate() -> BaseModel\n

Validate the BaseModel instance

This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.

This method is intended to be used with the lazy method. The lazy method is used to create an instance of the BaseModel without immediate validation. The validate method is then used to validate the instance after.

Note: in the Pydantic BaseModel, the validate method throws a deprecated warning. This is because Pydantic recommends using the validate_model method instead. However, we are using the validate method here in a different context and a slightly different way.

Examples:

class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate method is then called to validate the instance.

Returns:

Type Description BaseModel

The BaseModel instance

Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/index.html#koheesio.Context","title":"koheesio.Context","text":"
Context(*args, **kwargs)\n

The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.

Key Features
  • Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
  • Recursive merging: Merges two Contexts together, with the incoming Context having priority.
  • Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
  • Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.

For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.

Methods:

Name Description add

Add a key/value pair to the context.

get

Get value of a given key.

get_item

Acts just like .get, except that it returns the key also.

contains

Check if the context contains a given key.

merge

Merge this context with the context of another, where the incoming context has priority.

to_dict

Returns all parameters of the context as a dict.

from_dict

Creates Context object from the given dict.

from_yaml

Creates Context object from a given yaml file.

from_json

Creates Context object from a given json file.

Dunder methods
  • __iter__(): Allows for iteration across a Context.
  • __len__(): Returns the length of the Context.
  • __getitem__(item): Makes class subscriptable.
Inherited from Mapping
  • items(): Returns all items of the Context.
  • keys(): Returns all keys of the Context.
  • values(): Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/index.html#koheesio.Context.add","title":"add","text":"
add(key: str, value: Any) -> Context\n

Add a key/value pair to the context

Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n
"},{"location":"api_reference/index.html#koheesio.Context.contains","title":"contains","text":"
contains(key: str) -> bool\n

Check if the context contains a given key

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n
"},{"location":"api_reference/index.html#koheesio.Context.from_dict","title":"from_dict classmethod","text":"
from_dict(kwargs: dict) -> Context\n

Creates Context object from the given dict

Parameters:

Name Type Description Default kwargs dict required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given json file

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Security

(from https://jsonpickle.github.io/)

jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    > The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json--warning","title":"! Warning !","text":"

The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.

"},{"location":"api_reference/index.html#koheesio.Context.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given toml file

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file or string containing toml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> Context\n

Creates Context object from a given yaml file

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.get","title":"get","text":"
get(key: str, default: Any = None, safe: bool = True) -> Any\n

Get value of a given key

The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get() method otherwise.

Parameters:

Name Type Description Default key str

Can be a real key, or can be a dotted notation of a nested key

required default Any

Default value to return

None safe bool

Toggles whether to fail or not when item cannot be found

True

Returns:

Type Description Any

Value of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n

Returns c

Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n
"},{"location":"api_reference/index.html#koheesio.Context.get_all","title":"get_all","text":"
get_all() -> dict\n

alias to to_dict()

Source code in src/koheesio/context.py
def get_all(self) -> dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n
"},{"location":"api_reference/index.html#koheesio.Context.get_item","title":"get_item","text":"
get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n

Acts just like .get, except that it returns the key also

Returns:

Type Description Dict[str, Any]

key/value-pair of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n

Returns {'a.b': 'c'}

Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n
"},{"location":"api_reference/index.html#koheesio.Context.merge","title":"merge","text":"
merge(context: Context, recursive: bool = False) -> Context\n

Merge this context with the context of another, where the incoming context has priority.

Parameters:

Name Type Description Default context Context

Another Context class

required recursive bool

Recursively merge two dictionaries to an arbitrary depth

False

Returns:

Type Description Context

updated context

Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/index.html#koheesio.Context.process_value","title":"process_value","text":"
process_value(value: Any) -> Any\n

Processes the given value, converting dictionaries to Context objects as needed.

Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n
"},{"location":"api_reference/index.html#koheesio.Context.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Returns all parameters of the context as a dict

Returns:

Type Description dict

containing all parameters of the context

Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n
"},{"location":"api_reference/index.html#koheesio.Context.to_json","title":"to_json","text":"
to_json(pretty: bool = False) -> str\n

Returns all parameters of the context as a json string

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/index.html#koheesio.Context.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Returns all parameters of the context as a yaml string

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin","title":"koheesio.ExtraParamsMixin","text":"

Mixin class that adds support for arbitrary keyword arguments to Pydantic models.

The keyword arguments are extracted from the model's values and moved to a params dictionary.

"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.extra_params","title":"extra_params cached property","text":"
extra_params: Dict[str, Any]\n

Extract params (passed as arbitrary kwargs) from values and move them to params dict

"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.params","title":"params class-attribute instance-attribute","text":"
params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory","title":"koheesio.LoggingFactory","text":"
LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n

Logging factory to be used to generate logger instances.

Parameters:

Name Type Description Default name Optional[str] None env Optional[str] None logger_id Optional[str] None Source code in src/koheesio/logger.py
def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute instance-attribute","text":"
CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.ENV","title":"ENV class-attribute instance-attribute","text":"
ENV: Optional[str] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER","title":"LOGGER class-attribute instance-attribute","text":"
LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute instance-attribute","text":"
LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute instance-attribute","text":"
LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute instance-attribute","text":"
LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute instance-attribute","text":"
LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute instance-attribute","text":"
LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute instance-attribute","text":"
LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.add_handlers","title":"add_handlers staticmethod","text":"
add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n

Add handlers to existing root logger.

Parameters:

Name Type Description Default handler_class required handlers_config required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.get_logger","title":"get_logger staticmethod","text":"
get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n

Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.

Parameters:

Name Type Description Default name str required inherit_from_koheesio bool False

Returns:

Name Type Description logger Logger Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n
"},{"location":"api_reference/index.html#koheesio.Step","title":"koheesio.Step","text":"

Base class for a step

A custom unit of logic that can be executed.

The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self) method, specifying the expected inputs and outputs.

Note: since the Step class is meta classed, the execute method is wrapped with the do_execute function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.

Methods and Attributes

The Step class has several attributes and methods.

Background

A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.

A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!

The diagram serves to illustrate the concept of a Step:

\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n

Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.

  • Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
  • Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the execute method of the Step class with the _execute_wrapper function. This ensures that the execute method always returns the output of the Step along with providing logging and validation of the output.
  • Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute.
  • The Output class can be extended to add additional fields to the output of the Step.

Examples:

class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -> MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/index.html#koheesio.Step--input","title":"INPUT","text":"

The following fields are available by default on the Step class: - name: Name of the Step. If not set, the name of the class will be used. - description: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.

When subclassing a Step, any additional pydantic field will be treated as input to the Step. See also the explanation on the .execute() method below.

"},{"location":"api_reference/index.html#koheesio.Step--output","title":"OUTPUT","text":"

Every Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute. The Output class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute().

  • Output: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class.
  • output: Allows you to interact with the Output of the Step lazily (see above and StepOutput)

When subclassing a Step, any additional pydantic field added to the nested Output class will be treated as output of the Step. See also the description of StepOutput for more information.

"},{"location":"api_reference/index.html#koheesio.Step--methods","title":"Methods:","text":"
  • execute: Abstract method to implement for new steps.
    • The Inputs of the step can be accessed, using self.input_name.
    • The output of the step can be accessed, using self.output.output_name.
  • run: Alias to .execute() method. You can use this to run the step, but execute is preferred.
  • to_yaml: YAML dump the step
  • get_description: Get the description of the Step

When subclassing a Step, execute is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.

Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute function making it always return a StepOutput. See also the explanation on the do_execute function.

"},{"location":"api_reference/index.html#koheesio.Step--class-methods","title":"class methods:","text":"
  • from_step: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
  • get_description: Get the description of the Step
"},{"location":"api_reference/index.html#koheesio.Step--dunder-methods","title":"dunder methods:","text":"
  • __getattr__: Allows input to be accessed through self.input_name
  • __repr__ and __str__: String representation of a step
"},{"location":"api_reference/index.html#koheesio.Step.output","title":"output property writable","text":"
output: Output\n

Interact with the output of the Step

"},{"location":"api_reference/index.html#koheesio.Step.Output","title":"Output","text":"

Output class for Step

"},{"location":"api_reference/index.html#koheesio.Step.execute","title":"execute abstractmethod","text":"
execute()\n

Abstract method to implement for new steps.

The Inputs of the step can be accessed, using self.input_name

Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute function making it always return the Steps output

Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/index.html#koheesio.Step.from_step","title":"from_step classmethod","text":"
from_step(step: Step, **kwargs)\n

Returns a new Step instance based on the data of another Step or BaseModel instance

Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_json","title":"repr_json","text":"
repr_json(simple=False) -> str\n

dump the step to json, meant for representation

Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid json

Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -> Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_yaml","title":"repr_yaml","text":"
repr_yaml(simple=False) -> str\n

dump the step to yaml, meant for representation

Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n  a: foo\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid yaml

Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n
"},{"location":"api_reference/index.html#koheesio.Step.run","title":"run","text":"
run()\n

Alias to .execute()

Source code in src/koheesio/steps/__init__.py
def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/index.html#koheesio.StepOutput","title":"koheesio.StepOutput","text":"

Class for the StepOutput model

Usage

Setting up the StepOutputs class is done like this:

class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n

"},{"location":"api_reference/index.html#koheesio.StepOutput.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/index.html#koheesio.StepOutput.validate_output","title":"validate_output","text":"
validate_output() -> StepOutput\n

Validate the output of the Step

Essentially, this method is a wrapper around the validate method of the BaseModel class

Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/index.html#koheesio.print_logo","title":"koheesio.print_logo","text":"
print_logo()\n
Source code in src/koheesio/__init__.py
def print_logo():\n    global _logo_printed\n    global _koheesio_print_logo\n\n    if not _logo_printed and _koheesio_print_logo:\n        print(ABOUT)\n        _logo_printed = True\n
"},{"location":"api_reference/context.html","title":"Context","text":"

The Context module is a part of the Koheesio framework and is primarily used for managing the environment configuration where a Task or Step runs. It helps in adapting the behavior of a Task/Step based on the environment it operates in, thereby avoiding the repetition of configuration values across different tasks.

The Context class, which is a key component of this module, functions similarly to a dictionary but with additional features. It supports operations like handling nested keys, recursive merging of contexts, and serialization/deserialization to and from various formats like JSON, YAML, and TOML.

For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.

"},{"location":"api_reference/context.html#koheesio.context.Context","title":"koheesio.context.Context","text":"
Context(*args, **kwargs)\n

The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.

Key Features
  • Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
  • Recursive merging: Merges two Contexts together, with the incoming Context having priority.
  • Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
  • Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.

For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.

Methods:

Name Description add

Add a key/value pair to the context.

get

Get value of a given key.

get_item

Acts just like .get, except that it returns the key also.

contains

Check if the context contains a given key.

merge

Merge this context with the context of another, where the incoming context has priority.

to_dict

Returns all parameters of the context as a dict.

from_dict

Creates Context object from the given dict.

from_yaml

Creates Context object from a given yaml file.

from_json

Creates Context object from a given json file.

Dunder methods
  • __iter__(): Allows for iteration across a Context.
  • __len__(): Returns the length of the Context.
  • __getitem__(item): Makes class subscriptable.
Inherited from Mapping
  • items(): Returns all items of the Context.
  • keys(): Returns all keys of the Context.
  • values(): Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.add","title":"add","text":"
add(key: str, value: Any) -> Context\n

Add a key/value pair to the context

Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n
"},{"location":"api_reference/context.html#koheesio.context.Context.contains","title":"contains","text":"
contains(key: str) -> bool\n

Check if the context contains a given key

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_dict","title":"from_dict classmethod","text":"
from_dict(kwargs: dict) -> Context\n

Creates Context object from the given dict

Parameters:

Name Type Description Default kwargs dict required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given json file

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Security

(from https://jsonpickle.github.io/)

jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    > The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json--warning","title":"! Warning !","text":"

The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.

"},{"location":"api_reference/context.html#koheesio.context.Context.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> Context\n

Creates Context object from a given toml file

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file or string containing toml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> Context\n

Creates Context object from a given yaml file

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description Context Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get","title":"get","text":"
get(key: str, default: Any = None, safe: bool = True) -> Any\n

Get value of a given key

The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get() method otherwise.

Parameters:

Name Type Description Default key str

Can be a real key, or can be a dotted notation of a nested key

required default Any

Default value to return

None safe bool

Toggles whether to fail or not when item cannot be found

True

Returns:

Type Description Any

Value of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n

Returns c

Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_all","title":"get_all","text":"
get_all() -> dict\n

alias to to_dict()

Source code in src/koheesio/context.py
def get_all(self) -> dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_item","title":"get_item","text":"
get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n

Acts just like .get, except that it returns the key also

Returns:

Type Description Dict[str, Any]

key/value-pair of the requested item

Example

Example of a nested call:

context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n

Returns {'a.b': 'c'}

Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n
"},{"location":"api_reference/context.html#koheesio.context.Context.merge","title":"merge","text":"
merge(context: Context, recursive: bool = False) -> Context\n

Merge this context with the context of another, where the incoming context has priority.

Parameters:

Name Type Description Default context Context

Another Context class

required recursive bool

Recursively merge two dictionaries to an arbitrary depth

False

Returns:

Type Description Context

updated context

Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/context.html#koheesio.context.Context.process_value","title":"process_value","text":"
process_value(value: Any) -> Any\n

Processes the given value, converting dictionaries to Context objects as needed.

Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Returns all parameters of the context as a dict

Returns:

Type Description dict

containing all parameters of the context

Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_json","title":"to_json","text":"
to_json(pretty: bool = False) -> str\n

Returns all parameters of the context as a json string

Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.

Why jsonpickle?

(from https://jsonpickle.github.io/)

Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Returns all parameters of the context as a yaml string

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the context

Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n
"},{"location":"api_reference/intro_api.html","title":"Intro api","text":""},{"location":"api_reference/intro_api.html#api-reference","title":"API Reference","text":"

You can navigate the API by clicking on the modules listed on the left to access the documentation.

"},{"location":"api_reference/logger.html","title":"Logger","text":"

Loggers are used to log messages from your application.

For a comprehensive guide on the usage, examples, and additional features of the logging classes, please refer to the reference/concepts/logging section of the Koheesio documentation.

Classes:

Name Description LoggingFactory

Logging factory to be used to generate logger instances.

Masked

Represents a masked value.

MaskedString

Represents a masked string value.

MaskedInt

Represents a masked integer value.

MaskedFloat

Represents a masked float value.

MaskedDict

Represents a masked dictionary value.

LoggerIDFilter

Filter which injects run_id information into the log.

Functions:

Name Description warn

Issue a warning.

"},{"location":"api_reference/logger.html#koheesio.logger.T","title":"koheesio.logger.T module-attribute","text":"
T = TypeVar('T')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter","title":"koheesio.logger.LoggerIDFilter","text":"

Filter which injects run_id information into the log.

"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.LOGGER_ID","title":"LOGGER_ID class-attribute instance-attribute","text":"
LOGGER_ID: str = str(uuid4())\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.filter","title":"filter","text":"
filter(record)\n
Source code in src/koheesio/logger.py
def filter(self, record):\n    record.logger_id = LoggerIDFilter.LOGGER_ID\n\n    return True\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory","title":"koheesio.logger.LoggingFactory","text":"
LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n

Logging factory to be used to generate logger instances.

Parameters:

Name Type Description Default name Optional[str] None env Optional[str] None logger_id Optional[str] None Source code in src/koheesio/logger.py
def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute instance-attribute","text":"
CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.ENV","title":"ENV class-attribute instance-attribute","text":"
ENV: Optional[str] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER","title":"LOGGER class-attribute instance-attribute","text":"
LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute instance-attribute","text":"
LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute instance-attribute","text":"
LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute instance-attribute","text":"
LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute instance-attribute","text":"
LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute instance-attribute","text":"
LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute instance-attribute","text":"
LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.add_handlers","title":"add_handlers staticmethod","text":"
add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n

Add handlers to existing root logger.

Parameters:

Name Type Description Default handler_class required handlers_config required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.get_logger","title":"get_logger staticmethod","text":"
get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n

Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.

Parameters:

Name Type Description Default name str required inherit_from_koheesio bool False

Returns:

Name Type Description logger Logger Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked","title":"koheesio.logger.Masked","text":"
Masked(value: T)\n

Represents a masked value.

Parameters:

Name Type Description Default value T

The value to be masked.

required

Attributes:

Name Type Description _value T

The original value.

Methods:

Name Description __repr__

Returns a string representation of the masked value.

__str__

Returns a string representation of the masked value.

__get_validators__

Returns a generator of validators for the masked value.

validate

Validates the masked value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked.validate","title":"validate classmethod","text":"
validate(v: Any, _values)\n

Validate the input value and return an instance of the class.

Parameters:

Name Type Description Default v Any

The input value to validate.

required _values Any

Additional values used for validation.

required

Returns:

Name Type Description instance cls

An instance of the class.

Source code in src/koheesio/logger.py
@classmethod\ndef validate(cls, v: Any, _values):\n    \"\"\"\n    Validate the input value and return an instance of the class.\n\n    Parameters\n    ----------\n    v : Any\n        The input value to validate.\n    _values : Any\n        Additional values used for validation.\n\n    Returns\n    -------\n    instance : cls\n        An instance of the class.\n\n    \"\"\"\n    return cls(v)\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedDict","title":"koheesio.logger.MaskedDict","text":"
MaskedDict(value: T)\n

Represents a masked dictionary value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedFloat","title":"koheesio.logger.MaskedFloat","text":"
MaskedFloat(value: T)\n

Represents a masked float value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedInt","title":"koheesio.logger.MaskedInt","text":"
MaskedInt(value: T)\n

Represents a masked integer value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedString","title":"koheesio.logger.MaskedString","text":"
MaskedString(value: T)\n

Represents a masked string value.

Source code in src/koheesio/logger.py
def __init__(self, value: T):\n    self._value = value\n
"},{"location":"api_reference/utils.html","title":"Utils","text":"

Utility functions

"},{"location":"api_reference/utils.html#koheesio.utils.convert_str_to_bool","title":"koheesio.utils.convert_str_to_bool","text":"
convert_str_to_bool(value) -> Any\n

Converts a string to a boolean if the string is either 'true' or 'false'

Source code in src/koheesio/utils.py
def convert_str_to_bool(value) -> Any:\n    \"\"\"Converts a string to a boolean if the string is either 'true' or 'false'\"\"\"\n    if isinstance(value, str) and (v := value.lower()) in [\"true\", \"false\"]:\n        value = v == \"true\"\n    return value\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_args_for_func","title":"koheesio.utils.get_args_for_func","text":"
get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]\n

Helper function that matches keyword arguments (params) on a given function

This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to construct a new Callable (partial) function on which the input was mapped.

Example
input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\ndef example_func(a: str):\n    return a\n\n\nfunc, kwargs = get_args_for_func(example_func, input_dict)\n

In this example, - func would be a callable with the input mapped toward it (i.e. can be called like any normal function) - kwargs would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})

Parameters:

Name Type Description Default func Callable

The function to inspect

required params Dict

Dictionary with keyword values that will be mapped on the 'func'

required

Returns:

Type Description Tuple[Callable, Dict[str, Any]]
  • Callable a partial() func with the found keyword values mapped toward it
  • Dict[str, Any] the keyword args that match the func
Source code in src/koheesio/utils.py
def get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]:\n    \"\"\"Helper function that matches keyword arguments (params) on a given function\n\n    This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to\n     construct a new Callable (partial) function on which the input was mapped.\n\n    Example\n    -------\n    ```python\n    input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\n    def example_func(a: str):\n        return a\n\n\n    func, kwargs = get_args_for_func(example_func, input_dict)\n    ```\n\n    In this example,\n    - `func` would be a callable with the input mapped toward it (i.e. can be called like any normal function)\n    - `kwargs` would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})\n\n    Parameters\n    ----------\n    func: Callable\n        The function to inspect\n    params: Dict\n        Dictionary with keyword values that will be mapped on the 'func'\n\n    Returns\n    -------\n    Tuple[Callable, Dict[str, Any]]\n        - Callable\n            a partial() func with the found keyword values mapped toward it\n        - Dict[str, Any]\n            the keyword args that match the func\n    \"\"\"\n    _kwargs = {k: v for k, v in params.items() if k in inspect.getfullargspec(func).args}\n    return (\n        partial(func, **_kwargs),\n        _kwargs,\n    )\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_project_root","title":"koheesio.utils.get_project_root","text":"
get_project_root() -> Path\n

Returns project root path.

Source code in src/koheesio/utils.py
def get_project_root() -> Path:\n    \"\"\"Returns project root path.\"\"\"\n    cmd = Path(__file__)\n    return Path([i for i in cmd.parents if i.as_uri().endswith(\"src\")][0]).parent\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_random_string","title":"koheesio.utils.get_random_string","text":"
get_random_string(length: int = 64, prefix: Optional[str] = None) -> str\n

Generate a random string of specified length

Source code in src/koheesio/utils.py
def get_random_string(length: int = 64, prefix: Optional[str] = None) -> str:\n    \"\"\"Generate a random string of specified length\"\"\"\n    if prefix:\n        return f\"{prefix}_{uuid.uuid4().hex}\"[0:length]\n    return f\"{uuid.uuid4().hex}\"[0:length]\n
"},{"location":"api_reference/utils.html#koheesio.utils.import_class","title":"koheesio.utils.import_class","text":"
import_class(module_class: str) -> Any\n

Import class and module based on provided string.

Parameters:

Name Type Description Default module_class str required

Returns:

Type Description object Class from specified input string. Source code in src/koheesio/utils.py
def import_class(module_class: str) -> Any:\n    \"\"\"Import class and module based on provided string.\n\n    Parameters\n    ----------\n    module_class module+class to be imported.\n\n    Returns\n    -------\n    object  Class from specified input string.\n\n    \"\"\"\n    module_path, class_name = module_class.rsplit(\".\", 1)\n    module = import_module(module_path)\n\n    return getattr(module, class_name)\n
"},{"location":"api_reference/asyncio/index.html","title":"Asyncio","text":"

This module provides classes for asynchronous steps in the koheesio package.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep","title":"koheesio.asyncio.AsyncStep","text":"

Asynchronous step class that inherits from Step and uses the AsyncStepMetaClass metaclass.

Attributes:

Name Type Description Output AsyncStepOutput

The output class for the asynchronous step.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep.Output","title":"Output","text":"

Output class for asyncio step.

This class represents the output of the asyncio step. It inherits from the AsyncStepOutput class.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepMetaClass","title":"koheesio.asyncio.AsyncStepMetaClass","text":"

Metaclass for asynchronous steps.

This metaclass is used to define asynchronous steps in the Koheesio framework. It inherits from the StepMetaClass and provides additional functionality for executing asynchronous steps.

Attributes: None

Methods: _execute_wrapper: Wrapper method for executing asynchronous steps.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput","title":"koheesio.asyncio.AsyncStepOutput","text":"

Represents the output of an asynchronous step.

This class extends the base Step.Output class and provides additional functionality for merging key-value maps.

Attributes:

Name Type Description ...

Methods:

Name Description merge

Merge key-value map with self.

"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput.merge","title":"merge","text":"
merge(other: Union[Dict, StepOutput])\n

Merge key,value map with self

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n

Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}.

Parameters:

Name Type Description Default other Union[Dict, StepOutput]

Dict or another instance of a StepOutputs class that will be added to self

required Source code in src/koheesio/asyncio/__init__.py
def merge(self, other: Union[Dict, StepOutput]):\n    \"\"\"Merge key,value map with self\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Parameters\n    ----------\n    other: Union[Dict, StepOutput]\n        Dict or another instance of a StepOutputs class that will be added to self\n    \"\"\"\n    if isinstance(other, StepOutput):\n        other = other.model_dump()  # ensures we really have a dict\n\n    if not iscoroutine(other):\n        for k, v in other.items():\n            self.set(k, v)\n\n    return self\n
"},{"location":"api_reference/asyncio/http.html","title":"Http","text":"

This module contains async implementation of HTTP step.

"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep","title":"koheesio.asyncio.http.AsyncHttpGetStep","text":"

Represents an asynchronous HTTP GET step.

This class inherits from the AsyncHttpStep class and specifies the HTTP method as GET.

Attributes: method (HttpMethod): The HTTP method for the step, set to HttpMethod.GET.

"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = GET\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep","title":"koheesio.asyncio.http.AsyncHttpStep","text":"

Asynchronous HTTP step for making HTTP requests using aiohttp.

Parameters:

Name Type Description Default client_session Optional[ClientSession]

Aiohttp ClientSession.

required url List[URL]

List of yarl.URL.

required retry_options Optional[RetryOptionsBase]

Retry options for the request.

required connector Optional[BaseConnector]

Connector for the aiohttp request.

required headers Optional[Dict[str, Union[str, SecretStr]]]

Request headers.

required Output

responses_urls : Optional[List[Tuple[Dict[str, Any], yarl.URL]]] List of responses from the API and request URL.

Examples:

>>> import asyncio\n>>> from aiohttp import ClientSession\n>>> from aiohttp.connector import TCPConnector\n>>> from aiohttp_retry import ExponentialRetry\n>>> from koheesio.steps.async.http import AsyncHttpStep\n>>> from yarl import URL\n>>> from typing import Dict, Any, Union, List, Tuple\n>>>\n>>> # Initialize the AsyncHttpStep\n>>> async def main():\n>>>     session = ClientSession()\n>>>     urls = [URL('https://example.com/api/1'), URL('https://example.com/api/2')]\n>>>     retry_options = ExponentialRetry()\n>>>     connector = TCPConnector(limit=10)\n>>>     headers = {'Content-Type': 'application/json'}\n>>>     step = AsyncHttpStep(\n>>>         client_session=session,\n>>>         url=urls,\n>>>         retry_options=retry_options,\n>>>         connector=connector,\n>>>         headers=headers\n>>>     )\n>>>\n>>>     # Execute the step\n>>>     responses_urls=  await step.get()\n>>>\n>>>     return responses_urls\n>>>\n>>> # Run the main function\n>>> responses_urls = asyncio.run(main())\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.client_session","title":"client_session class-attribute instance-attribute","text":"
client_session: Optional[ClientSession] = Field(default=None, description='Aiohttp ClientSession', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.connector","title":"connector class-attribute instance-attribute","text":"
connector: Optional[BaseConnector] = Field(default=None, description='Connector for the aiohttp request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.headers","title":"headers class-attribute instance-attribute","text":"
headers: Dict[str, Union[str, SecretStr]] = Field(default_factory=dict, description='Request headers', alias='header', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.method","title":"method class-attribute instance-attribute","text":"
method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.retry_options","title":"retry_options class-attribute instance-attribute","text":"
retry_options: Optional[RetryOptionsBase] = Field(default=None, description='Retry options for the request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.timeout","title":"timeout class-attribute instance-attribute","text":"
timeout: None = Field(default=None, description='[Optional] Request timeout')\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.url","title":"url class-attribute instance-attribute","text":"
url: List[URL] = Field(default=None, alias='urls', description='Expecting list, as there is no value in executing async request for one value.\\n        yarl.URL is preferable, because params/data can be injected into URL instance', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output","title":"Output","text":"

Output class for Step

"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output.responses_urls","title":"responses_urls class-attribute instance-attribute","text":"
responses_urls: Optional[List[Tuple[Dict[str, Any], URL]]] = Field(default=None, description='List of responses from the API and request URL', repr=False)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.delete","title":"delete async","text":"
delete() -> List[Tuple[Dict[str, Any], URL]]\n

Make DELETE requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def delete(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make DELETE requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.DELETE)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.execute","title":"execute","text":"
execute() -> Output\n

Execute the step.

Raises:

Type Description ValueError

If the specified HTTP method is not implemented in AsyncHttpStep.

Source code in src/koheesio/asyncio/http.py
def execute(self) -> AsyncHttpStep.Output:\n    \"\"\"\n    Execute the step.\n\n    Raises\n    ------\n    ValueError\n        If the specified HTTP method is not implemented in AsyncHttpStep.\n    \"\"\"\n    # By design asyncio does not allow its event loop to be nested. This presents a practical problem:\n    #   When in an environment where the event loop is already running\n    #   it\u2019s impossible to run tasks and wait for the result.\n    #   Trying to do so will give the error \u201cRuntimeError: This event loop is already running\u201d.\n    #   The issue pops up in various environments, such as web servers, GUI applications and in\n    #   Jupyter/DataBricks notebooks.\n    nest_asyncio.apply()\n\n    map_method_func = {\n        HttpMethod.GET: self.get,\n        HttpMethod.POST: self.post,\n        HttpMethod.PUT: self.put,\n        HttpMethod.DELETE: self.delete,\n    }\n\n    if self.method not in map_method_func:\n        raise ValueError(f\"Method {self.method} not implemented in AsyncHttpStep.\")\n\n    self.output.responses_urls = asyncio.run(map_method_func[self.method]())\n\n    return self.output\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get","title":"get async","text":"
get() -> List[Tuple[Dict[str, Any], URL]]\n

Make GET requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def get(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make GET requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.GET)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_headers","title":"get_headers","text":"
get_headers()\n

Get the request headers.

Returns:

Type Description Optional[Dict[str, Union[str, SecretStr]]]

The request headers.

Source code in src/koheesio/asyncio/http.py
def get_headers(self):\n    \"\"\"\n    Get the request headers.\n\n    Returns\n    -------\n    Optional[Dict[str, Union[str, SecretStr]]]\n        The request headers.\n    \"\"\"\n    _headers = None\n\n    if self.headers:\n        _headers = {k: v.get_secret_value() if isinstance(v, SecretStr) else v for k, v in self.headers.items()}\n\n        for k, v in self.headers.items():\n            if isinstance(v, SecretStr):\n                self.headers[k] = v.get_secret_value()\n\n    return _headers or self.headers\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_options","title":"get_options","text":"
get_options()\n

Get the options of the step.

Source code in src/koheesio/asyncio/http.py
def get_options(self):\n    \"\"\"\n    Get the options of the step.\n    \"\"\"\n    warnings.warn(\"get_options is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.post","title":"post async","text":"
post() -> List[Tuple[Dict[str, Any], URL]]\n

Make POST requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def post(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make POST requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.POST)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.put","title":"put async","text":"
put() -> List[Tuple[Dict[str, Any], URL]]\n

Make PUT requests.

Returns:

Type Description List[Tuple[Dict[str, Any], URL]]

A list of response data and corresponding request URLs.

Source code in src/koheesio/asyncio/http.py
async def put(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make PUT requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.PUT)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.request","title":"request async","text":"
request(method: HttpMethod, url: URL, **kwargs) -> Tuple[Dict[str, Any], URL]\n

Make an HTTP request.

Parameters:

Name Type Description Default method HttpMethod

The HTTP method to use for the request.

required url URL

The URL to make the request to.

required kwargs Any

Additional keyword arguments to pass to the request.

{}

Returns:

Type Description Tuple[Dict[str, Any], URL]

A tuple containing the response data and the request URL.

Source code in src/koheesio/asyncio/http.py
async def request(\n    self,\n    method: HttpMethod,\n    url: yarl.URL,\n    **kwargs,\n) -> Tuple[Dict[str, Any], yarl.URL]:\n    \"\"\"\n    Make an HTTP request.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        The HTTP method to use for the request.\n    url : yarl.URL\n        The URL to make the request to.\n    kwargs : Any\n        Additional keyword arguments to pass to the request.\n\n    Returns\n    -------\n    Tuple[Dict[str, Any], yarl.URL]\n        A tuple containing the response data and the request URL.\n    \"\"\"\n    async with self.__retry_client.request(method=method, url=url, **kwargs) as response:\n        res = await response.json()\n\n    return (res, response.request_info.url)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.set_outputs","title":"set_outputs","text":"
set_outputs(response)\n

Set the outputs of the step.

Parameters:

Name Type Description Default response Any

The response data.

required Source code in src/koheesio/asyncio/http.py
def set_outputs(self, response):\n    \"\"\"\n    Set the outputs of the step.\n\n    Parameters\n    ----------\n    response : Any\n        The response data.\n    \"\"\"\n    warnings.warn(\"set outputs is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.validate_timeout","title":"validate_timeout","text":"
validate_timeout(timeout)\n

Validate the 'data' field.

Parameters:

Name Type Description Default data Any

The value of the 'timeout' field.

required

Raises:

Type Description ValueError

If 'data' is not allowed in AsyncHttpStep.

Source code in src/koheesio/asyncio/http.py
@field_validator(\"timeout\")\ndef validate_timeout(cls, timeout):\n    \"\"\"\n    Validate the 'data' field.\n\n    Parameters\n    ----------\n    data : Any\n        The value of the 'timeout' field.\n\n    Raises\n    ------\n    ValueError\n        If 'data' is not allowed in AsyncHttpStep.\n    \"\"\"\n    if timeout:\n        raise ValueError(\"timeout is not allowed in AsyncHttpStep. Provide timeout through retry_options.\")\n
"},{"location":"api_reference/integrations/index.html","title":"Integrations","text":"

Nothing to see here, move along.

"},{"location":"api_reference/integrations/box.html","title":"Box","text":"

Box Module

The module is used to facilitate various interactions with Box service. The implementation is based on the functionalities available in Box Python SDK: https://github.com/box/box-python-sdk

Prerequisites
  • Box Application is created in the developer portal using the JWT auth method (Developer Portal - My Apps - Create)
  • Application is authorized for the enterprise (Developer Portal - MyApp - Authorization)
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box","title":"koheesio.integrations.box.Box","text":"
Box(**data)\n

Configuration details required for the authentication can be obtained in the Box Developer Portal by generating the Public / Private key pair in \"Application Name -> Configuration -> Add and Manage Public Keys\".

The downloaded JSON file will look like this:

{\n  \"boxAppSettings\": {\n    \"clientID\": \"client_id\",\n    \"clientSecret\": \"client_secret\",\n    \"appAuth\": {\n      \"publicKeyID\": \"public_key_id\",\n      \"privateKey\": \"private_key\",\n      \"passphrase\": \"pass_phrase\"\n    }\n  },\n  \"enterpriseID\": \"123456\"\n}\n
This class is used as a base for the rest of Box integrations, however it can also be used separately to obtain the Box client which is created at class initialization.

Examples:

b = Box(\n    client_id=\"client_id\",\n    client_secret=\"client_secret\",\n    enterprise_id=\"enterprise_id\",\n    jwt_key_id=\"jwt_key_id\",\n    rsa_private_key_data=\"rsa_private_key_data\",\n    rsa_private_key_passphrase=\"rsa_private_key_passphrase\",\n)\nb.client\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.auth_options","title":"auth_options property","text":"
auth_options\n

Get a dictionary of authentication options, that can be handily used in the child classes

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client","title":"client class-attribute instance-attribute","text":"
client: SkipValidation[Client] = None\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_id","title":"client_id class-attribute instance-attribute","text":"
client_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientID', description='Client ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_secret","title":"client_secret class-attribute instance-attribute","text":"
client_secret: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientSecret', description='Client Secret from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.enterprise_id","title":"enterprise_id class-attribute instance-attribute","text":"
enterprise_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='enterpriseID', description='Enterprise ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.jwt_key_id","title":"jwt_key_id class-attribute instance-attribute","text":"
jwt_key_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='publicKeyID', description='PublicKeyID for the public/private generated key pair.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_data","title":"rsa_private_key_data class-attribute instance-attribute","text":"
rsa_private_key_data: Union[SecretStr, SecretBytes] = Field(default=..., alias='privateKey', description='Private key generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_passphrase","title":"rsa_private_key_passphrase class-attribute instance-attribute","text":"
rsa_private_key_passphrase: Union[SecretStr, SecretBytes] = Field(default=..., alias='passphrase', description='Private key passphrase generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/box.py
def execute(self):\n    # Plug to be able to unit test ABC\n    pass\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.init_client","title":"init_client","text":"
init_client()\n

Set up the Box client.

Source code in src/koheesio/integrations/box.py
def init_client(self):\n    \"\"\"Set up the Box client.\"\"\"\n    if not self.client:\n        self.client = Client(JWTAuth(**self.auth_options))\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader","title":"koheesio.integrations.box.BoxCsvFileReader","text":"
BoxCsvFileReader(**data)\n

Class facilitates reading one or multiple CSV files with the same structure directly from Box and producing Spark Dataframe.

Notes

To manually identify the ID of the file in Box, open the file through Web UI, and copy ID from the page URL, e.g. https://foo.ent.box.com/file/1234567890 , where 1234567890 is the ID.

Examples:

from koheesio.steps.integrations.box import BoxCsvFileReader\nfrom pyspark.sql.types import StructType\n\nschema = StructType(...)\nb = BoxCsvFileReader(\n    client_id=\"\",\n    client_secret=\"\",\n    enterprise_id=\"\",\n    jwt_key_id=\"\",\n    rsa_private_key_data=\"\",\n    rsa_private_key_passphrase=\"\",\n    file=[\"1\", \"2\"],\n    schema=schema,\n).execute()\nb.df.show()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.file","title":"file class-attribute instance-attribute","text":"
file: Union[str, list[str]] = Field(default=..., description='ID or list of IDs for the files to read.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.execute","title":"execute","text":"
execute()\n

Loop through the list of provided file identifiers and load data into dataframe. For traceability purposes the following columns will be added to the dataframe: * meta_file_id: the identifier of the file on Box * meta_file_name: name of the file

Returns:

Type Description DataFrame Source code in src/koheesio/integrations/box.py
def execute(self):\n    \"\"\"\n    Loop through the list of provided file identifiers and load data into dataframe.\n    For traceability purposes the following columns will be added to the dataframe:\n        * meta_file_id: the identifier of the file on Box\n        * meta_file_name: name of the file\n\n    Returns\n    -------\n    DataFrame\n    \"\"\"\n    df = None\n    for f in self.file:\n        self.log.debug(f\"Reading contents of file with the ID '{f}' into Spark DataFrame\")\n        file = self.client.file(file_id=f)\n        data = file.content().decode(\"utf-8\").splitlines()\n        rdd = self.spark.sparkContext.parallelize(data)\n        temp_df = self.spark.read.csv(rdd, header=True, schema=self.schema_, **self.params)\n        temp_df = (\n            temp_df\n            # fmt: off\n            .withColumn(\"meta_file_id\", lit(file.object_id))\n            .withColumn(\"meta_file_name\", lit(file.get().name))\n            .withColumn(\"meta_load_timestamp\", expr(\"to_utc_timestamp(current_timestamp(), current_timezone())\"))\n            # fmt: on\n        )\n\n        df = temp_df if not df else df.union(temp_df)\n\n    self.output.df = df\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader","title":"koheesio.integrations.box.BoxCsvPathReader","text":"
BoxCsvPathReader(**data)\n

Read all CSV files from the specified path into the dataframe. Files can be filtered using the regular expression in the 'filter' parameter. The default behavior is to read all CSV / TXT files from the specified path.

Notes

The class does not contain archival capability as it is presumed that the user wants to make sure that the full pipeline is successful (for example, the source data was transformed and saved) prior to moving the source files. Use BoxToBoxFileMove class instead and provide the list of IDs from 'file_id' output.

Examples:

from koheesio.steps.integrations.box import BoxCsvPathReader\n\nauth_params = {...}\nb = BoxCsvPathReader(**auth_params, path=\"foo/bar/\").execute()\nb.df  # Spark Dataframe\n...  # do something with the dataframe\nfrom koheesio.steps.integrations.box import BoxToBoxFileMove\n\nbm = BoxToBoxFileMove(**auth_params, file=b.file_id, path=\"/foo/bar/archive\")\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.filter","title":"filter class-attribute instance-attribute","text":"
filter: Optional[str] = Field(default='.csv|.txt$', description='[Optional] Regexp to filter folder contents')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.path","title":"path class-attribute instance-attribute","text":"
path: str = Field(default=..., description='Box path')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.execute","title":"execute","text":"
execute()\n

Identify the list of files from the source Box path that match desired filter and load them into Dataframe

Source code in src/koheesio/integrations/box.py
def execute(self):\n    \"\"\"\n    Identify the list of files from the source Box path that match desired filter and load them into Dataframe\n    \"\"\"\n    folder = BoxFolderGet.from_step(self).execute().folder\n\n    # Identify the list of files that should be processed\n    files = [item for item in folder.get_items() if item.type == \"file\" and re.search(self.filter, item.name)]\n\n    if len(files) > 0:\n        self.log.info(\n            f\"A total of {len(files)} files, that match the mask '{self.mask}' has been detected in {self.path}.\"\n            f\" They will be loaded into Spark Dataframe: {files}\"\n        )\n    else:\n        raise BoxPathIsEmptyError(f\"Path '{self.path}' is empty or none of files match the mask '{self.filter}'\")\n\n    file = [file_id.object_id for file_id in files]\n    self.output.df = BoxCsvFileReader.from_step(self, file=file).read()\n    self.output.file = file  # e.g. if files should be archived after pipeline is successful\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase","title":"koheesio.integrations.box.BoxFileBase","text":"
BoxFileBase(**data)\n

Generic class to facilitate interactions with Box folders.

Box SDK provides File class that has various properties and methods to interact with Box files. The object can be obtained in multiple ways: * provide Box file identified to file parameter (the identifier can be obtained, for example, from URL) * provide existing object to file parameter (boxsdk.object.file.File)

Notes

Refer to BoxFolderBase for mor info about folder and path parameters

See Also

boxsdk.object.file.File

Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.files","title":"files class-attribute instance-attribute","text":"
files: conlist(Union[File, str], min_length=1) = Field(default=..., alias='file', description='List of Box file objects or identifiers')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.folder","title":"folder class-attribute instance-attribute","text":"
folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.path","title":"path class-attribute instance-attribute","text":"
path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.action","title":"action","text":"
action(file: File, folder: Folder)\n

Abstract class for File level actions.

Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n    \"\"\"\n    Abstract class for File level actions.\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.execute","title":"execute","text":"
execute()\n

Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects from various parameter inputs

Source code in src/koheesio/integrations/box.py
def execute(self):\n    \"\"\"\n    Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects\n    from various parameter inputs\n    \"\"\"\n    if self.path:\n        _folder = BoxFolderGet.from_step(self).execute().folder\n    else:\n        _folder = self.client.folder(folder_id=self.folder) if isinstance(self.folder, str) else self.folder\n\n    for _file in self.files:\n        _file = self.client.file(file_id=_file) if isinstance(_file, str) else _file\n        self.action(file=_file, folder=_folder)\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter","title":"koheesio.integrations.box.BoxFileWriter","text":"
BoxFileWriter(**data)\n

Write file or a file-like object to Box.

Examples:

from koheesio.steps.integrations.box import BoxFileWriter\n\nauth_params = {...}\nf1 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=\"path/to/my/file.ext\").execute()\n# or\nimport io\n\nb = io.BytesIO(b\"my-sample-data\")\nf2 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=b, name=\"file.ext\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.description","title":"description class-attribute instance-attribute","text":"
description: Optional[str] = Field(None, description='Optional description to add to the file in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file","title":"file class-attribute instance-attribute","text":"
file: Union[str, BytesIO] = Field(default=..., description='Path to file or a file-like object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file_name","title":"file_name class-attribute instance-attribute","text":"
file_name: Optional[str] = Field(default=None, description=\"When file path or name is provided to 'file' parameter, this will override the original name.When binary stream is provided, the 'name' should be used to set the desired name for the Box file.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output","title":"Output","text":"

Output class for BoxFileWriter.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.file","title":"file class-attribute instance-attribute","text":"
file: File = Field(default=..., description='File object in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.shared_link","title":"shared_link class-attribute instance-attribute","text":"
shared_link: str = Field(default=..., description='Shared link for the Box file')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.action","title":"action","text":"
action()\n
Source code in src/koheesio/integrations/box.py
def action(self):\n    _file = self.file\n    _name = self.file_name\n\n    if isinstance(_file, str):\n        _name = _name if _name else PurePath(_file).name\n        with open(_file, \"rb\") as f:\n            _file = BytesIO(f.read())\n\n    folder: Folder = BoxFolderGet.from_step(self, create_sub_folders=True).execute().folder\n    folder.preflight_check(size=0, name=_name)\n\n    self.log.info(f\"Uploading file '{_name}' to Box folder '{folder.get().name}'...\")\n    _box_file: File = folder.upload_stream(file_stream=_file, file_name=_name, file_description=self.description)\n\n    self.output.file = _box_file\n    self.output.shared_link = _box_file.get_shared_link()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.execute","title":"execute","text":"
execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n    self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.validate_name_for_binary_data","title":"validate_name_for_binary_data","text":"
validate_name_for_binary_data(values)\n

Validate 'file_name' parameter when providing a binary input for 'file'.

Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"before\")\ndef validate_name_for_binary_data(cls, values):\n    \"\"\"Validate 'file_name' parameter when providing a binary input for 'file'.\"\"\"\n    file, file_name = values.get(\"file\"), values.get(\"file_name\")\n    if not isinstance(file, str) and not file_name:\n        raise AttributeError(\"The parameter 'file_name' is mandatory when providing a binary input for 'file'.\")\n\n    return values\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase","title":"koheesio.integrations.box.BoxFolderBase","text":"
BoxFolderBase(**data)\n

Generic class to facilitate interactions with Box folders.

Box SDK provides Folder class that has various properties and methods to interact with Box folders. The object can be obtained in multiple ways: * provide Box folder identified to folder parameter (the identifier can be obtained, for example, from URL) * provide existing object to folder parameter (boxsdk.object.folder.Folder) * provide filesystem-like path to path parameter

See Also

boxsdk.object.folder.Folder

Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.folder","title":"folder class-attribute instance-attribute","text":"
folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.path","title":"path class-attribute instance-attribute","text":"
path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.root","title":"root class-attribute instance-attribute","text":"
root: Optional[Union[Folder, str]] = Field(default='0', description='Folder object or identifier of the folder that should be used as root')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output","title":"Output","text":"

Define outputs for the BoxFolderBase class

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output.folder","title":"folder class-attribute instance-attribute","text":"
folder: Optional[Folder] = Field(default=None, description='Box folder object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.action","title":"action","text":"
action()\n

Placeholder for 'action' method, that should be implemented in the child classes

Returns:

Type Description Folder or None Source code in src/koheesio/integrations/box.py
def action(self):\n    \"\"\"\n    Placeholder for 'action' method, that should be implemented in the child classes\n\n    Returns\n    -------\n        Folder or None\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.execute","title":"execute","text":"
execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n    self.output.folder = self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.validate_folder_or_path","title":"validate_folder_or_path","text":"
validate_folder_or_path()\n

Validations for 'folder' and 'path' parameter usage

Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"after\")\ndef validate_folder_or_path(self):\n    \"\"\"\n    Validations for 'folder' and 'path' parameter usage\n    \"\"\"\n    folder_value = self.folder\n    path_value = self.path\n\n    if folder_value and path_value:\n        raise AttributeError(\"Cannot user 'folder' and 'path' parameter at the same time\")\n\n    if not folder_value and not path_value:\n        raise AttributeError(\"Neither 'folder' nor 'path' parameters are set\")\n\n    return self\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate","title":"koheesio.integrations.box.BoxFolderCreate","text":"
BoxFolderCreate(**data)\n

Explicitly create the new Box folder object and parent directories.

Examples:

from koheesio.steps.integrations.box import BoxFolderCreate\n\nauth_params = {...}\nfolder = BoxFolderCreate(**auth_params, path=\"/foo/bar\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.create_sub_folders","title":"create_sub_folders class-attribute instance-attribute","text":"
create_sub_folders: bool = Field(default=True, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.validate_folder","title":"validate_folder","text":"
validate_folder(folder)\n

Validate 'folder' parameter

Source code in src/koheesio/integrations/box.py
@field_validator(\"folder\")\ndef validate_folder(cls, folder):\n    \"\"\"\n    Validate 'folder' parameter\n    \"\"\"\n    if folder:\n        raise AttributeError(\"Only 'path' parameter is allowed in the context of folder creation.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete","title":"koheesio.integrations.box.BoxFolderDelete","text":"
BoxFolderDelete(**data)\n

Delete existing Box folder based on object, identifier or path.

Examples:

from koheesio.steps.integrations.box import BoxFolderDelete\n\nauth_params = {...}\nBoxFolderDelete(**auth_params, path=\"/foo/bar\").execute()\n# or\nBoxFolderDelete(**auth_params, folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxFolderDelete(**auth_params, folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete.action","title":"action","text":"
action()\n

Delete folder action

Returns:

Type Description None Source code in src/koheesio/integrations/box.py
def action(self):\n    \"\"\"\n    Delete folder action\n\n    Returns\n    -------\n        None\n    \"\"\"\n    if self.folder:\n        folder = self._obj_from_id\n    else:  # path\n        folder = BoxFolderGet.from_step(self).action()\n\n    self.log.info(f\"Deleting Box folder '{folder}'...\")\n    folder.delete()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet","title":"koheesio.integrations.box.BoxFolderGet","text":"
BoxFolderGet(**data)\n

Get the Box folder object for an existing folder or create a new folder and parent directories.

Examples:

from koheesio.steps.integrations.box import BoxFolderGet\n\nauth_params = {...}\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\n# or\nfolder = BoxFolderGet(**auth_params, path=\"1\").execute().folder\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.create_sub_folders","title":"create_sub_folders class-attribute instance-attribute","text":"
create_sub_folders: Optional[bool] = Field(False, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.action","title":"action","text":"
action()\n

Get folder action

Returns:

Name Type Description folder Folder

Box Folder object as specified in Box SDK

Source code in src/koheesio/integrations/box.py
def action(self):\n    \"\"\"\n    Get folder action\n\n    Returns\n    -------\n    folder: Folder\n        Box Folder object as specified in Box SDK\n    \"\"\"\n    current_folder_object = None\n\n    if self.folder:\n        current_folder_object = self._obj_from_id\n\n    if self.path:\n        cleaned_path_parts = [p for p in PurePath(self.path).parts if p.strip() not in [None, \"\", \" \", \"/\"]]\n        current_folder_object = self.client.folder(folder_id=self.root) if isinstance(self.root, str) else self.root\n\n        for next_folder_name in cleaned_path_parts:\n            current_folder_object = self._get_or_create_folder(current_folder_object, next_folder_name)\n\n    self.log.info(f\"Folder identified or created: {current_folder_object}\")\n    return current_folder_object\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderNotFoundError","title":"koheesio.integrations.box.BoxFolderNotFoundError","text":"

Error when a provided box path does not exist.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxPathIsEmptyError","title":"koheesio.integrations.box.BoxPathIsEmptyError","text":"

Exception when provided Box path is empty or no files matched the mask.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase","title":"koheesio.integrations.box.BoxReaderBase","text":"
BoxReaderBase(**data)\n

Base class for Box readers.

Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the Spark reader.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.schema_","title":"schema_ class-attribute instance-attribute","text":"
schema_: Optional[StructType] = Field(None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output","title":"Output","text":"

Make default reader output optional to gracefully handle 'no-files / folder' cases.

"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.execute","title":"execute abstractmethod","text":"
execute() -> Output\n
Source code in src/koheesio/integrations/box.py
@abstractmethod\ndef execute(self) -> Output:\n    raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy","title":"koheesio.integrations.box.BoxToBoxFileCopy","text":"
BoxToBoxFileCopy(**data)\n

Copy one or multiple files to the target Box path.

Examples:

from koheesio.steps.integrations.box import BoxToBoxFileCopy\n\nauth_params = {...}\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileCopy(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy.action","title":"action","text":"
action(file: File, folder: Folder)\n

Copy file to the desired destination and extend file description with the processing info

Parameters:

Name Type Description Default file File

File object as specified in Box SDK

required folder Folder

Folder object as specified in Box SDK

required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n    \"\"\"\n    Copy file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Copying '{file.get()}' to '{folder.get()}'...\")\n    file.copy(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove","title":"koheesio.integrations.box.BoxToBoxFileMove","text":"
BoxToBoxFileMove(**data)\n

Move one or multiple files to the target Box path

Examples:

from koheesio.steps.integrations.box import BoxToBoxFileMove\n\nauth_params = {...}\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileMove(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove.action","title":"action","text":"
action(file: File, folder: Folder)\n

Move file to the desired destination and extend file description with the processing info

Parameters:

Name Type Description Default file File

File object as specified in Box SDK

required folder Folder

Folder object as specified in Box SDK

required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n    \"\"\"\n    Move file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Moving '{file.get()}' to '{folder.get()}'...\")\n    file.move(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n
"},{"location":"api_reference/integrations/spark/index.html","title":"Spark","text":""},{"location":"api_reference/integrations/spark/sftp.html","title":"Sftp","text":"

This module contains the SFTPWriter class and the SFTPWriteMode enum.

The SFTPWriter class is used to write data to a file on an SFTP server. It uses the Paramiko library to establish an SFTP connection and write data to the server. The data to be written is provided by a BufferWriter, which generates the data in a buffer. See the docstring of the SFTPWriter class for more details. Refer to koheesio.spark.writers.buffer for more details on the BufferWriter interface.

The SFTPWriteMode enum defines the different write modes that the SFTPWriter can use. These modes determine how the SFTPWriter behaves when the file it is trying to write to already exists on the server. For more details on each mode, see the docstring of the SFTPWriteMode enum.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode","title":"koheesio.integrations.spark.sftp.SFTPWriteMode","text":"

The different write modes for the SFTPWriter.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--overwrite","title":"OVERWRITE:","text":"
  • If the file exists, it will be overwritten.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--append","title":"APPEND:","text":"
  • If the file exists, the new data will be appended to it.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--ignore","title":"IGNORE:","text":"
  • If the file exists, the method will return without writing anything.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--exclusive","title":"EXCLUSIVE:","text":"
  • If the file exists, an error will be raised.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--backup","title":"BACKUP:","text":"
  • If the file exists and the new data is different from the existing data, a backup will be created and the file will be overwritten.
  • If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--update","title":"UPDATE:","text":"
  • If the file exists and the new data is different from the existing data, the file will be overwritten.
  • If the file exists and the new data is the same as the existing data, the method will return without writing anything.
  • If the file does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.BACKUP","title":"BACKUP class-attribute instance-attribute","text":"
BACKUP = 'backup'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.EXCLUSIVE","title":"EXCLUSIVE class-attribute instance-attribute","text":"
EXCLUSIVE = 'exclusive'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.IGNORE","title":"IGNORE class-attribute instance-attribute","text":"
IGNORE = 'ignore'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.OVERWRITE","title":"OVERWRITE class-attribute instance-attribute","text":"
OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.UPDATE","title":"UPDATE class-attribute instance-attribute","text":"
UPDATE = 'update'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.write_mode","title":"write_mode property","text":"
write_mode\n

Return the write mode for the given SFTPWriteMode.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.from_string","title":"from_string classmethod","text":"
from_string(mode: str)\n

Return the SFTPWriteMode for the given string.

Source code in src/koheesio/integrations/spark/sftp.py
@classmethod\ndef from_string(cls, mode: str):\n    \"\"\"Return the SFTPWriteMode for the given string.\"\"\"\n    return cls[mode.upper()]\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter","title":"koheesio.integrations.spark.sftp.SFTPWriter","text":"

Write a Dataframe to SFTP through a BufferWriter

Concept
  • This class uses Paramiko to connect to an SFTP server and write the contents of a buffer to a file on the server.
  • This implementation takes inspiration from https://github.com/springml/spark-sftp

Parameters:

Name Type Description Default path Union[str, Path]

Path to the folder to write to

required file_name Optional[str]

Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension.

None host str

SFTP Host

required port int

SFTP Port

required username SecretStr

SFTP Server Username

None password SecretStr

SFTP Server Password

None buffer_writer BufferWriter

This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.

required mode

Write mode: overwrite, append, ignore, exclusive, backup, or update. See the docstring of SFTPWriteMode for more details.

required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.buffer_writer","title":"buffer_writer class-attribute instance-attribute","text":"
buffer_writer: InstanceOf[BufferWriter] = Field(default=..., description='This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.client","title":"client property","text":"
client: SFTPClient\n

Return the SFTP client. If it doesn't exist, create it.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.file_name","title":"file_name class-attribute instance-attribute","text":"
file_name: Optional[str] = Field(default=None, description='Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension!', alias='filename')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.host","title":"host class-attribute instance-attribute","text":"
host: str = Field(default=..., description='SFTP Host')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.mode","title":"mode class-attribute instance-attribute","text":"
mode: SFTPWriteMode = Field(default=OVERWRITE, description='Write mode: overwrite, append, ignore, exclusive, backup, or update.' + __doc__)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.password","title":"password class-attribute instance-attribute","text":"
password: Optional[SecretStr] = Field(default=None, description='SFTP Server Password')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.path","title":"path class-attribute instance-attribute","text":"
path: Union[str, Path] = Field(default=..., description='Path to the folder to write to', alias='prefix')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.port","title":"port class-attribute instance-attribute","text":"
port: int = Field(default=..., description='SFTP Port')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.transport","title":"transport property","text":"
transport\n

Return the transport for the SFTP connection. If it doesn't exist, create it.

If the username and password are provided, use them to connect to the SFTP server.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.username","title":"username class-attribute instance-attribute","text":"
username: Optional[SecretStr] = Field(default=None, description='SFTP Server Username')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_mode","title":"write_mode property","text":"
write_mode\n

Return the write mode for the given SFTPWriteMode.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.check_file_exists","title":"check_file_exists","text":"
check_file_exists(file_path: str) -> bool\n

Check if a file exists on the SFTP server.

Source code in src/koheesio/integrations/spark/sftp.py
def check_file_exists(self, file_path: str) -> bool:\n    \"\"\"\n    Check if a file exists on the SFTP server.\n    \"\"\"\n    try:\n        self.client.stat(file_path)\n        return True\n    except IOError:\n        return False\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n    buffer_output: InstanceOf[BufferWriter.Output] = self.buffer_writer.write(self.df)\n\n    # write buffer to the SFTP server\n    try:\n        self._handle_write_mode(self.path.as_posix(), buffer_output)\n    finally:\n        self._close_client()\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_path_and_file_name","title":"validate_path_and_file_name","text":"
validate_path_and_file_name(data: dict) -> dict\n

Validate the path, make sure path and file_name are Path objects.

Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"before\")\ndef validate_path_and_file_name(cls, data: dict) -> dict:\n    \"\"\"Validate the path, make sure path and file_name are Path objects.\"\"\"\n    path_or_str = data.get(\"path\")\n\n    if isinstance(path_or_str, str):\n        # make sure the path is a Path object\n        path_or_str = Path(path_or_str)\n\n    if not isinstance(path_or_str, Path):\n        raise ValueError(f\"Invalid path: {path_or_str}\")\n\n    if file_name := data.get(\"file_name\", data.get(\"filename\")):\n        path_or_str = path_or_str / file_name\n        try:\n            del data[\"filename\"]\n        except KeyError:\n            pass\n        data[\"file_name\"] = file_name\n\n    data[\"path\"] = path_or_str\n    return data\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_sftp_host","title":"validate_sftp_host","text":"
validate_sftp_host(v) -> str\n

Validate the host

Source code in src/koheesio/integrations/spark/sftp.py
@field_validator(\"host\")\ndef validate_sftp_host(cls, v) -> str:\n    \"\"\"Validate the host\"\"\"\n    # remove the sftp:// prefix if present\n    if v.startswith(\"sftp://\"):\n        v = v.replace(\"sftp://\", \"\")\n\n    # remove the trailing slash if present\n    if v.endswith(\"/\"):\n        v = v[:-1]\n\n    return v\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_file","title":"write_file","text":"
write_file(file_path: str, buffer_output: InstanceOf[Output])\n

Using Paramiko, write the data in the buffer to SFTP.

Source code in src/koheesio/integrations/spark/sftp.py
def write_file(self, file_path: str, buffer_output: InstanceOf[BufferWriter.Output]):\n    \"\"\"\n    Using Paramiko, write the data in the buffer to SFTP.\n    \"\"\"\n    with self.client.open(file_path, self.write_mode) as file:\n        self.log.debug(f\"Writing file {file_path} to SFTP...\")\n        file.write(buffer_output.read())\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp","title":"koheesio.integrations.spark.sftp.SendCsvToSftp","text":"

Write a DataFrame to an SFTP server as a CSV file.

This class uses the PandasCsvBufferWriter to generate the CSV data and the SFTPWriter to write the data to the SFTP server.

Example
from koheesio.spark.writers import SendCsvToSftp\n\nwriter = SendCsvToSftp(\n    # SFTP Parameters\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.tsv.gz\",\n    # CSV Parameters\n    header=True,\n    sep=\"   \",\n    quote='\"',\n    timestampFormat=\"%Y-%m-%d\",\n    lineSep=os.linesep,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n

In this example, the DataFrame df is written to the file file.csv.gz in the folder /path/to/folder on the SFTP server. The file is written as a CSV file with a tab delimiter (TSV), double quotes as the quote character, and gzip compression.

Parameters:

Name Type Description Default path Union[str, Path]

Path to the folder to write to.

required file_name Optional[str]

Name of the file. If not provided, it's expected to be part of the path.

required host str

SFTP Host.

required port int

SFTP Port.

required username SecretStr

SFTP Server Username.

required password SecretStr

SFTP Server Password.

required mode

Write mode: overwrite, append, ignore, exclusive, backup, or update.

required header

Whether to write column names as the first line. Default is True.

required sep

Field delimiter for the output file. Default is ','.

required quote

Character used to quote fields. Default is '\"'.

required quoteAll

Whether all values should be enclosed in quotes. Default is False.

required escape

Character used to escape sep and quote when needed. Default is '\\'.

required timestampFormat

Date format for datetime objects. Default is '%Y-%m-%dT%H:%M:%S.%f'.

required lineSep

Character used as line separator. Default is os.linesep.

required compression

Compression to use for the output data. Default is None.

required For required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.buffer_writer","title":"buffer_writer class-attribute instance-attribute","text":"
buffer_writer: PandasCsvBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n    SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"
set_up_buffer_writer() -> SendCsvToSftp\n

Set up the buffer writer, passing all CSV related options to it.

Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendCsvToSftp\":\n    \"\"\"Set up the buffer writer, passing all CSV related options to it.\"\"\"\n    self.buffer_writer = PandasCsvBufferWriter(**self.get_options(options_type=\"kohesio_pandas_buffer_writer\"))\n    return self\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp","title":"koheesio.integrations.spark.sftp.SendJsonToSftp","text":"

Write a DataFrame to an SFTP server as a JSON file.

This class uses the PandasJsonBufferWriter to generate the JSON data and the SFTPWriter to write the data to the SFTP server.

Example
from koheesio.spark.writers import SendJsonToSftp\n\nwriter = SendJsonToSftp(\n    # SFTP Parameters (Inherited from SFTPWriter)\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.json.gz\",\n    # JSON Parameters (Inherited from PandasJsonBufferWriter)\n    orient=\"records\",\n    date_format=\"iso\",\n    double_precision=2,\n    date_unit=\"ms\",\n    lines=False,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n

In this example, the DataFrame df is written to the file file.json.gz in the folder /path/to/folder on the SFTP server. The file is written as a JSON file with gzip compression.

Parameters:

Name Type Description Default path Union[str, Path]

Path to the folder on the SFTP server.

required file_name Optional[str]

Name of the file, including extension. If not provided, expected to be part of the path.

required host str

SFTP Host.

required port int

SFTP Port.

required username SecretStr

SFTP Server Username.

required password SecretStr

SFTP Server Password.

required mode

Write mode: overwrite, append, ignore, exclusive, backup, or update.

required orient

Format of the JSON string. Default is 'records'.

required lines

If True, output is one JSON object per line. Only used when orient='records'. Default is True.

required date_format

Type of date conversion. Default is 'iso'.

required double_precision

Decimal places for encoding floating point values. Default is 10.

required force_ascii

If True, encoded string is ASCII. Default is True.

required compression

Compression to use for output data. Default is None.

required See Also

For more details on the JSON parameters, refer to the PandasJsonBufferWriter class documentation.

"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.buffer_writer","title":"buffer_writer class-attribute instance-attribute","text":"
buffer_writer: PandasJsonBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n    SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"
set_up_buffer_writer() -> SendJsonToSftp\n

Set up the buffer writer, passing all JSON related options to it.

Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendJsonToSftp\":\n    \"\"\"Set up the buffer writer, passing all JSON related options to it.\"\"\"\n    self.buffer_writer = PandasJsonBufferWriter(\n        **self.get_options(), compression=self.compression, columns=self.columns\n    )\n    return self\n
"},{"location":"api_reference/integrations/spark/dq/index.html","title":"Dq","text":""},{"location":"api_reference/integrations/spark/dq/spark_expectations.html","title":"Spark expectations","text":"

Koheesio step for running data quality rules with Spark Expectations engine.

"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","title":"koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","text":"

Run DQ rules for an input dataframe with Spark Expectations engine.

References

Spark Expectations: https://engineering.nike.com/spark-expectations/1.0.0/

"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.drop_meta_column","title":"drop_meta_column class-attribute instance-attribute","text":"
drop_meta_column: bool = Field(default=False, alias='drop_meta_columns', description='Whether to drop meta columns added by spark expectations on the output df')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.enable_debugger","title":"enable_debugger class-attribute instance-attribute","text":"
enable_debugger: bool = Field(default=False, alias='debugger', description='...')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_format","title":"error_writer_format class-attribute instance-attribute","text":"
error_writer_format: Optional[str] = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_mode","title":"error_writer_mode class-attribute instance-attribute","text":"
error_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writing_options","title":"error_writing_options class-attribute instance-attribute","text":"
error_writing_options: Optional[Dict[str, str]] = Field(default_factory=dict, alias='error_writing_options', description='Options for writing to the error table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the stats and err table. Separate output formats can be specified for each table using the error_writer_format and stats_writer_format params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.mode","title":"mode class-attribute instance-attribute","text":"
mode: Union[str, BatchOutputMode] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err and stats table. Separate output modes can be specified for each table using the error_writer_mode and stats_writer_mode params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.product_id","title":"product_id class-attribute instance-attribute","text":"
product_id: str = Field(default=..., description='Spark Expectations product identifier')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.rules_table","title":"rules_table class-attribute instance-attribute","text":"
rules_table: str = Field(default=..., alias='product_rules_table', description='DQ rules table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.se_user_conf","title":"se_user_conf class-attribute instance-attribute","text":"
se_user_conf: Dict[str, Any] = Field(default={se_notifications_enable_email: False, se_notifications_enable_slack: False}, alias='user_conf', description='SE user provided confs', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_streaming","title":"statistics_streaming class-attribute instance-attribute","text":"
statistics_streaming: Dict[str, Any] = Field(default={se_enable_streaming: False}, alias='stats_streaming_options', description='SE stats streaming options ', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_table","title":"statistics_table class-attribute instance-attribute","text":"
statistics_table: str = Field(default=..., alias='dq_stats_table_name', description='DQ stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_format","title":"stats_writer_format class-attribute instance-attribute","text":"
stats_writer_format: Optional[str] = Field(default='delta', alias='stats_writer_format', description='The format used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_mode","title":"stats_writer_mode class-attribute instance-attribute","text":"
stats_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='stats_writer_mode', description='The write mode that will be used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.target_table","title":"target_table class-attribute instance-attribute","text":"
target_table: str = Field(default=..., alias='target_table_name', description=\"The table that will contain good records. Won't write to it, but will write to the err table with same name plus _err suffix\")\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output","title":"Output","text":"

Output of the SparkExpectationsTransformation step.

"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.error_table_writer","title":"error_table_writer class-attribute instance-attribute","text":"
error_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations error table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.rules_df","title":"rules_df class-attribute instance-attribute","text":"
rules_df: DataFrame = Field(default=..., description='Output dataframe')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.se","title":"se class-attribute instance-attribute","text":"
se: SparkExpectations = Field(default=..., description='Spark Expectations object')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.stats_table_writer","title":"stats_table_writer class-attribute instance-attribute","text":"
stats_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations stats table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.execute","title":"execute","text":"
execute() -> Output\n

Apply data quality rules to a dataframe using the out-of-the-box SE decorator

Source code in src/koheesio/integrations/spark/dq/spark_expectations.py
def execute(self) -> Output:\n    \"\"\"\n    Apply data quality rules to a dataframe using the out-of-the-box SE decorator\n    \"\"\"\n    # read rules table\n    rules_df = self.spark.read.table(self.rules_table).cache()\n    self.output.rules_df = rules_df\n\n    @self._se.with_expectations(\n        target_table=self.target_table,\n        user_conf=self.se_user_conf,\n        # Below params are `False` by default, however exposing them here for extra visibility\n        # The writes can be handled by downstream Koheesio steps\n        write_to_table=False,\n        write_to_temp_table=False,\n    )\n    def inner(df: DataFrame) -> DataFrame:\n        \"\"\"Just a wrapper to be able to use Spark Expectations decorator\"\"\"\n        return df\n\n    output_df = inner(self.df)\n\n    if self.drop_meta_column:\n        output_df = output_df.drop(\"meta_dq_run_id\", \"meta_dq_run_datetime\")\n\n    self.output.df = output_df\n
"},{"location":"api_reference/models/index.html","title":"Models","text":"

Models package creates models that can be used to base other classes on.

  • Every model should be at least a pydantic BaseModel, but can also be a Step, or a StepOutput.
  • Every model is expected to be an ABC (Abstract Base Class)
  • Optionally a model can inherit ExtraParamsMixin that provides unpacking of kwargs into extra_params dict property removing need to create a dict before passing kwargs to a model initializer.

A Model class can be exceptionally handy when you need similar Pydantic models in multiple places, for example across Transformation and Reader classes.

"},{"location":"api_reference/models/index.html#koheesio.models.ListOfColumns","title":"koheesio.models.ListOfColumns module-attribute","text":"
ListOfColumns = Annotated[List[str], BeforeValidator(_list_of_columns_validation)]\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel","title":"koheesio.models.BaseModel","text":"

Base model for all models.

Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.

Additional methods and properties: Different Modes

This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.

  • Normal mode: you need to know the values ahead of time

    normal_mode = YourOwnModel(a=\"foo\", b=42)\n

  • Lazy mode: being able to defer the validation until later

    lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
    The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end.

  • With statements: With statements are also allowed. The validate_output method from the earlier example will run upon exit of the with-statement.

    with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n
    Note: that a lazy mode BaseModel object is required to work with a with-statement.

Examples:

from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n

In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output method is then called to validate the instance.

Koheesio specific configuration:

Koheesio models are configured differently from Pydantic defaults. The following configuration is used:

  1. extra=\"allow\"

    This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.

  2. arbitrary_types_allowed=True

    This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.

  3. populate_by_name=True

    This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.

  4. validate_assignment=False

    This setting determines whether the model should be revalidated when the data is changed. If set to True, every time a field is assigned a new value, the entire model is validated again.

    Pydantic default is (also) False, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.

  5. revalidate_instances=\"subclass-instances\"

    This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never, which means that the model and dataclass instances are not revalidated during validation.

  6. validate_default=True

    This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.

  7. frozen=False

    This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.

  8. coerce_numbers_to_str=True

    This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number type to str. Pydantic doesn't allow number types (int, float, Decimal) to be coerced as type str by default.

  9. use_enum_values=True

    This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.

"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--fields","title":"Fields","text":"

Every Koheesio BaseModel has two fields: name and description. These fields are used to provide a name and a description to the model.

  • name: This is the name of the Model. If not provided, it defaults to the class name.

  • description: This is the description of the Model. It has several default behaviors:

    • If not provided, it defaults to the docstring of the class.
    • If the docstring is not provided, it defaults to the name of the class.
    • For multi-line descriptions, it has the following behaviors:
      • Only the first non-empty line is used.
      • Empty lines are removed.
      • Only the first 3 lines are considered.
      • Only the first 120 characters are considered.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--validators","title":"Validators","text":"
  • _set_name_and_description: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--properties","title":"Properties","text":"
  • log: Returns a logger with the name of the class.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--class-methods","title":"Class Methods","text":"
  • from_basemodel: Returns a new BaseModel instance based on the data of another BaseModel.
  • from_context: Creates BaseModel instance from a given Context.
  • from_dict: Creates BaseModel instance from a given dictionary.
  • from_json: Creates BaseModel instance from a given JSON string.
  • from_toml: Creates BaseModel object from a given toml file.
  • from_yaml: Creates BaseModel object from a given yaml file.
  • lazy: Constructs the model without doing validation.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--dunder-methods","title":"Dunder Methods","text":"
  • __add__: Allows to add two BaseModel instances together.
  • __enter__: Allows for using the model in a with-statement.
  • __exit__: Allows for using the model in a with-statement.
  • __setitem__: Set Item dunder method for BaseModel.
  • __getitem__: Get Item dunder method for BaseModel.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--instance-methods","title":"Instance Methods","text":"
  • hasattr: Check if given key is present in the model.
  • get: Get an attribute of the model, but don't fail if not present.
  • merge: Merge key,value map with self.
  • set: Allows for subscribing / assigning to class[key].
  • to_context: Converts the BaseModel instance to a Context object.
  • to_dict: Converts the BaseModel instance to a dictionary.
  • to_json: Converts the BaseModel instance to a JSON string.
  • to_yaml: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.description","title":"description class-attribute instance-attribute","text":"
description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.log","title":"log property","text":"
log: Logger\n

Returns a logger with the name of the class

"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.name","title":"name class-attribute instance-attribute","text":"
name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_basemodel","title":"from_basemodel classmethod","text":"
from_basemodel(basemodel: BaseModel, **kwargs)\n

Returns a new BaseModel instance based on the data of another BaseModel

Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_context","title":"from_context classmethod","text":"
from_context(context: Context) -> BaseModel\n

Creates BaseModel instance from a given Context

You have to make sure that the Context object has the necessary attributes to create the model.

Examples:

class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n

Parameters:

Name Type Description Default context Context required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_dict","title":"from_dict classmethod","text":"
from_dict(data: Dict[str, Any]) -> BaseModel\n

Creates BaseModel instance from a given dictionary

Parameters:

Name Type Description Default data Dict[str, Any] required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_json","title":"from_json classmethod","text":"
from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel instance from a given JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.from_json : Deserializes a JSON string to a Context object

Parameters:

Name Type Description Default json_file_or_str Union[str, Path]

Pathlike string or Path that points to the json file or string containing json

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_toml","title":"from_toml classmethod","text":"
from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n

Creates BaseModel object from a given toml file

Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.

Parameters:

Name Type Description Default toml_file_or_str Union[str, Path]

Pathlike string or Path that points to the toml file, or string containing toml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_yaml","title":"from_yaml classmethod","text":"
from_yaml(yaml_file_or_str: str) -> BaseModel\n

Creates BaseModel object from a given yaml file

Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default yaml_file_or_str str

Pathlike string or Path that points to the yaml file, or string containing yaml

required

Returns:

Type Description BaseModel Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.get","title":"get","text":"
get(key: str, default: Optional[Any] = None)\n

Get an attribute of the model, but don't fail if not present

Similar to dict.get()

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n

Parameters:

Name Type Description Default key str

name of the key to get

required default Optional[Any]

Default value in case the attribute does not exist

None

Returns:

Type Description Any

The value of the attribute

Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.hasattr","title":"hasattr","text":"
hasattr(key: str) -> bool\n

Check if given key is present in the model

Parameters:

Name Type Description Default key str required

Returns:

Type Description bool Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.lazy","title":"lazy classmethod","text":"
lazy()\n

Constructs the model without doing validation

Essentially an alias to BaseModel.construct()

Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.merge","title":"merge","text":"
merge(other: Union[Dict, BaseModel])\n

Merge key,value map with self

Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}.

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n

Parameters:

Name Type Description Default other Union[Dict, BaseModel]

Dict or another instance of a BaseModel class that will be added to self

required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.set","title":"set","text":"
set(key: str, value: Any)\n

Allows for subscribing / assigning to class[key].

Examples:

step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n

Parameters:

Name Type Description Default key str

The key of the attribute to assign to

required value Any

Value that should be assigned to the given key

required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_context","title":"to_context","text":"
to_context() -> Context\n

Converts the BaseModel instance to a Context object

Returns:

Type Description Context Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_dict","title":"to_dict","text":"
to_dict() -> Dict[str, Any]\n

Converts the BaseModel instance to a dictionary

Returns:

Type Description Dict[str, Any] Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_json","title":"to_json","text":"
to_json(pretty: bool = False)\n

Converts the BaseModel instance to a JSON string

BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.

See Also

Context.to_json : Serializes a Context object to a JSON string

Parameters:

Name Type Description Default pretty bool

Toggles whether to return a pretty json string or not

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_yaml","title":"to_yaml","text":"
to_yaml(clean: bool = False) -> str\n

Converts the BaseModel instance to a YAML string

BaseModel offloads the serialization and deserialization of the YAML string to Context class.

Parameters:

Name Type Description Default clean bool

Toggles whether to remove !!python/object:... from yaml or not. Default: False

False

Returns:

Type Description str

containing all parameters of the BaseModel instance

Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.validate","title":"validate","text":"
validate() -> BaseModel\n

Validate the BaseModel instance

This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.

This method is intended to be used with the lazy method. The lazy method is used to create an instance of the BaseModel without immediate validation. The validate method is then used to validate the instance after.

Note: in the Pydantic BaseModel, the validate method throws a deprecated warning. This is because Pydantic recommends using the validate_model method instead. However, we are using the validate method here in a different context and a slightly different way.

Examples:

class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate method is then called to validate the instance.

Returns:

Type Description BaseModel

The BaseModel instance

Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin","title":"koheesio.models.ExtraParamsMixin","text":"

Mixin class that adds support for arbitrary keyword arguments to Pydantic models.

The keyword arguments are extracted from the model's values and moved to a params dictionary.

"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.extra_params","title":"extra_params cached property","text":"
extra_params: Dict[str, Any]\n

Extract params (passed as arbitrary kwargs) from values and move them to params dict

"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.params","title":"params class-attribute instance-attribute","text":"
params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/models/sql.html","title":"Sql","text":"

This module contains the base class for SQL steps.

"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep","title":"koheesio.models.sql.SqlBaseStep","text":"

Base class for SQL steps

params are used as placeholders for templating. These are identified with ${placeholder} in the SQL script.

Parameters:

Name Type Description Default sql_path

Path to a SQL file

required sql

SQL script to apply

required params

Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.

Note: any arbitrary kwargs passed to the class will be added to params.

required"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.params","title":"params class-attribute instance-attribute","text":"
params: Dict[str, Any] = Field(default_factory=dict, description='Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script. Note: any arbitrary kwargs passed to the class will be added to params.')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.query","title":"query property","text":"
query\n

Returns the query while performing params replacement

"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql","title":"sql class-attribute instance-attribute","text":"
sql: Optional[str] = Field(default=None, description='SQL script to apply')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql_path","title":"sql_path class-attribute instance-attribute","text":"
sql_path: Optional[Union[Path, str]] = Field(default=None, description='Path to a SQL file')\n
"},{"location":"api_reference/notifications/index.html","title":"Notifications","text":"

Notification module for sending messages to notification services (e.g. Slack, Email, etc.)

"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity","title":"koheesio.notifications.NotificationSeverity","text":"

Enumeration of allowed message severities

"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.ERROR","title":"ERROR class-attribute instance-attribute","text":"
ERROR = 'error'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.INFO","title":"INFO class-attribute instance-attribute","text":"
INFO = 'info'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.SUCCESS","title":"SUCCESS class-attribute instance-attribute","text":"
SUCCESS = 'success'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.WARN","title":"WARN class-attribute instance-attribute","text":"
WARN = 'warn'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.alert_icon","title":"alert_icon property","text":"
alert_icon: str\n

Return a colored circle in slack markup

"},{"location":"api_reference/notifications/slack.html","title":"Slack","text":"

Classes to ease interaction with Slack

"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification","title":"koheesio.notifications.slack.SlackNotification","text":"

Generic Slack notification class via the Blocks API

NOTE: channel parameter is used only with Slack Web API: https://api.slack.com/messaging/sending If webhook is used, the channel specification is not required

Example:

s = SlackNotification(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\",\n)\ns.execute()\n

"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.channel","title":"channel class-attribute instance-attribute","text":"
channel: Optional[str] = Field(default=None, description='Slack channel id')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.headers","title":"headers class-attribute instance-attribute","text":"
headers: Optional[Dict[str, Any]] = {'Content-type': 'application/json'}\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.message","title":"message class-attribute instance-attribute","text":"
message: str = Field(default=..., description='The message that gets posted to Slack')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.execute","title":"execute","text":"
execute()\n

Generate payload and send post request

Source code in src/koheesio/notifications/slack.py
def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.get_payload","title":"get_payload","text":"
get_payload()\n

Generate payload with Block Kit. More details: https://api.slack.com/block-kit

Source code in src/koheesio/notifications/slack.py
def get_payload(self):\n    \"\"\"\n    Generate payload with `Block Kit`.\n    More details: https://api.slack.com/block-kit\n    \"\"\"\n    payload = {\n        \"attachments\": [\n            {\n                \"blocks\": [\n                    {\n                        \"type\": \"section\",\n                        \"text\": {\n                            \"type\": \"mrkdwn\",\n                            \"text\": self.message,\n                        },\n                    }\n                ],\n            }\n        ]\n    }\n\n    if self.channel:\n        payload[\"channel\"] = self.channel\n\n    return json.dumps(payload)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity","title":"koheesio.notifications.slack.SlackNotificationWithSeverity","text":"

Slack notification class via the Blocks API with etra severity information and predefined extra fields

Example: from koheesio.steps.integrations.notifications import NotificationSeverity

s = SlackNotificationWithSeverity(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\"\n    severity=NotificationSeverity.ERROR,\n    title=\"Title\",\n    environment=\"dev\",\n    application=\"Application\"\n)\ns.execute()\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.application","title":"application class-attribute instance-attribute","text":"
application: str = Field(default=..., description='Pipeline or application name')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.environment","title":"environment class-attribute instance-attribute","text":"
environment: str = Field(default=..., description='Environment description, e.g. dev / qa /prod')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(use_enum_values=False)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.severity","title":"severity class-attribute instance-attribute","text":"
severity: NotificationSeverity = Field(default=..., description='Severity of the message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.timestamp","title":"timestamp class-attribute instance-attribute","text":"
timestamp: datetime = Field(default=utcnow(), alias='execution_timestamp', description='Pipeline or application execution timestamp')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.title","title":"title class-attribute instance-attribute","text":"
title: str = Field(default=..., description='Title of your message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.execute","title":"execute","text":"
execute()\n

Generate payload and send post request

Source code in src/koheesio/notifications/slack.py
def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.message = self.get_payload_message()\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.get_payload_message","title":"get_payload_message","text":"
get_payload_message()\n

Generate payload message based on the predefined set of parameters

Source code in src/koheesio/notifications/slack.py
def get_payload_message(self):\n    \"\"\"\n    Generate payload message based on the predefined set of parameters\n    \"\"\"\n    return dedent(\n        f\"\"\"\n            {self.severity.alert_icon}   *{self.severity.name}:*  {self.title}\n            *Environment:* {self.environment}\n            *Application:* {self.application}\n            *Message:* {self.message}\n            *Timestamp:* {self.timestamp}\n        \"\"\"\n    )\n
"},{"location":"api_reference/secrets/index.html","title":"Secrets","text":"

Module for secret integrations.

Contains abstract class for various secret integrations also known as SecretContext.

"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret","title":"koheesio.secrets.Secret","text":"

Abstract class for various secret integrations. All secrets are wrapped into Context class for easy access. Either existing context can be provided, or new context will be created and returned at runtime.

Secrets are wrapped into the pydantic.SecretStr.

"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.context","title":"context class-attribute instance-attribute","text":"
context: Optional[Context] = Field(Context({}), description='Existing `Context` instance can be used for secrets, otherwise new empty context will be created.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.parent","title":"parent class-attribute instance-attribute","text":"
parent: Optional[str] = Field(default=..., description='Group secrets from one secure path under this friendly name', pattern='^[a-zA-Z0-9_]+$')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.root","title":"root class-attribute instance-attribute","text":"
root: Optional[str] = Field(default='secrets', description='All secrets will be grouped under this root.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output","title":"Output","text":"

Output class for Secret.

"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output.context","title":"context class-attribute instance-attribute","text":"
context: Context = Field(default=..., description='Koheesio context')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.encode_secret_values","title":"encode_secret_values classmethod","text":"
encode_secret_values(data: dict)\n

Encode secret values in the dictionary.

Ensures that all values in the dictionary are wrapped in SecretStr.

Source code in src/koheesio/secrets/__init__.py
@classmethod\ndef encode_secret_values(cls, data: dict):\n    \"\"\"Encode secret values in the dictionary.\n\n    Ensures that all values in the dictionary are wrapped in SecretStr.\n    \"\"\"\n    encoded_dict = {}\n    for key, value in data.items():\n        if isinstance(value, dict):\n            encoded_dict[key] = cls.encode_secret_values(value)\n        else:\n            encoded_dict[key] = SecretStr(value)\n    return encoded_dict\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.execute","title":"execute","text":"
execute()\n

Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.

Source code in src/koheesio/secrets/__init__.py
def execute(self):\n    \"\"\"\n    Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.\n    \"\"\"\n    context = Context(self.encode_secret_values(data={self.root: {self.parent: self._get_secrets()}}))\n    self.output.context = self.context.merge(context=context)\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.get","title":"get","text":"
get() -> Context\n

Convenience method to return context with secrets.

Source code in src/koheesio/secrets/__init__.py
def get(self) -> Context:\n    \"\"\"\n    Convenience method to return context with secrets.\n    \"\"\"\n    self.execute()\n    return self.output.context\n
"},{"location":"api_reference/secrets/cerberus.html","title":"Cerberus","text":"

Module for retrieving secrets from Cerberus.

Secrets are stored as SecretContext and can be accessed accordingly.

See CerberusSecret for more information.

"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret","title":"koheesio.secrets.cerberus.CerberusSecret","text":"

Retrieve secrets from Cerberus and wrap them into Context class for easy access. All secrets are stored under the \"secret\" root and \"parent\". \"Parent\" either derived from the secure data path by replacing \"/\" and \"-\", or manually provided by the user. Secrets are wrapped into the pydantic.SecretStr.

Example:

context = {\n    \"secrets\": {\n        \"parent\": {\n            \"webhook\": SecretStr(\"**********\"),\n            \"description\": SecretStr(\"**********\"),\n        }\n    }\n}\n

Values can be decoded like this:

context.secrets.parent.webhook.get_secret_value()\n
or if working with dictionary is preferable:
for key, value in context.get_all().items():\n    value.get_secret_value()\n

"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.aws_session","title":"aws_session class-attribute instance-attribute","text":"
aws_session: Optional[Session] = Field(default=None, description='AWS Session to pass to Cerberus client, can be used for local execution.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.path","title":"path class-attribute instance-attribute","text":"
path: str = Field(default=..., description=\"Secure data path, eg. 'app/my-sdb/my-secrets'\")\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.token","title":"token class-attribute instance-attribute","text":"
token: Optional[SecretStr] = Field(default=get('CERBERUS_TOKEN', None), description='Cerberus token, can be used for local development without AWS auth mechanism.Note: Token has priority over AWS session.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., description='Cerberus URL, eg. https://cerberus.domain.com')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.verbose","title":"verbose class-attribute instance-attribute","text":"
verbose: bool = Field(default=False, description='Enable verbose for Cerberus client')\n
"},{"location":"api_reference/spark/index.html","title":"Spark","text":"

Spark step module

"},{"location":"api_reference/spark/index.html#koheesio.spark.AnalysisException","title":"koheesio.spark.AnalysisException module-attribute","text":"
AnalysisException = AnalysisException\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.DataFrame","title":"koheesio.spark.DataFrame module-attribute","text":"
DataFrame = DataFrame\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkSession","title":"koheesio.spark.SparkSession module-attribute","text":"
SparkSession = SparkSession\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep","title":"koheesio.spark.SparkStep","text":"

Base class for a Spark step

Extends the Step class with SparkSession support. The following: - Spark steps are expected to return a Spark DataFrame as output. - spark property is available to access the active SparkSession instance.

"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.spark","title":"spark property","text":"
spark: Optional[SparkSession]\n

Get active SparkSession instance

"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output","title":"Output","text":"

Output class for SparkStep

"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.current_timestamp_utc","title":"koheesio.spark.current_timestamp_utc","text":"
current_timestamp_utc(spark: SparkSession) -> Column\n

Get the current timestamp in UTC

Source code in src/koheesio/spark/__init__.py
def current_timestamp_utc(spark: SparkSession) -> Column:\n    \"\"\"Get the current timestamp in UTC\"\"\"\n    return F.to_utc_timestamp(F.current_timestamp(), spark.conf.get(\"spark.sql.session.timeZone\"))\n
"},{"location":"api_reference/spark/delta.html","title":"Delta","text":"

Module for creating and managing Delta tables.

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep","title":"koheesio.spark.delta.DeltaTableStep","text":"

Class for creating and managing Delta tables.

DeltaTable aims to provide a simple interface to create and manage Delta tables. It is a wrapper around the Spark SQL API for Delta tables.

Example
from koheesio.steps import DeltaTableStep\n\nDeltaTableStep(\n    table=\"my_table\",\n    database=\"my_database\",\n    catalog=\"my_catalog\",\n    create_if_not_exists=True,\n    default_create_properties={\n        \"delta.randomizeFilePrefixes\": \"true\",\n        \"delta.checkpoint.writeStatsAsStruct\": \"true\",\n        \"delta.minReaderVersion\": \"2\",\n        \"delta.minWriterVersion\": \"5\",\n    },\n)\n

Methods:

Name Description get_persisted_properties

Get persisted properties of table.

add_property

Alter table and set table property.

add_properties

Alter table and add properties.

execute

Nothing to execute on a Table.

max_version_ts_of_last_execution

Max version timestamp of last execution. If no timestamp is found, returns 1900-01-01 00:00:00. Note: will raise an error if column VERSION_TIMESTAMP does not exist.

Properties
  • name -> str Deprecated. Use .table_name instead.
  • table_name -> str Table name.
  • dataframe -> DataFrame Returns a DataFrame to be able to interact with this table.
  • columns -> Optional[List[str]] Returns all column names as a list.
  • has_change_type -> bool Checks if a column named _change_type is present in the table.
  • exists -> bool Check if table exists.

Parameters:

Name Type Description Default table str

Table name.

required database str

Database or Schema name.

None catalog str

Catalog name.

None create_if_not_exists bool

Force table creation if it doesn't exist. Note: Default properties will be applied to the table during CREATION.

False default_create_properties Dict[str, str]

Default table properties to be applied during CREATION if force_creation True.

{\"delta.randomizeFilePrefixes\": \"true\", \"delta.checkpoint.writeStatsAsStruct\": \"true\", \"delta.minReaderVersion\": \"2\", \"delta.minWriterVersion\": \"5\"}"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.catalog","title":"catalog class-attribute instance-attribute","text":"
catalog: Optional[str] = Field(default=None, description='Catalog name. Note: Can be ignored if using a SparkCatalog that does not support catalog notation (e.g. Hive)')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.columns","title":"columns property","text":"
columns: Optional[List[str]]\n

Returns all column names as a list.

Example

DeltaTableStep(...).columns\n
Would for example return ['age', 'name'] if the table has columns age and name.

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.create_if_not_exists","title":"create_if_not_exists class-attribute instance-attribute","text":"
create_if_not_exists: bool = Field(default=False, alias='force_creation', description=\"Force table creation if it doesn't exist.Note: Default properties will be applied to the table during CREATION.\")\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.database","title":"database class-attribute instance-attribute","text":"
database: Optional[str] = Field(default=None, description='Database or Schema name.')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.dataframe","title":"dataframe property","text":"
dataframe: DataFrame\n

Returns a DataFrame to be able to interact with this table

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.default_create_properties","title":"default_create_properties class-attribute instance-attribute","text":"
default_create_properties: Dict[str, Union[str, bool, int]] = Field(default={'delta.randomizeFilePrefixes': 'true', 'delta.checkpoint.writeStatsAsStruct': 'true', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'}, description='Default table properties to be applied during CREATION if `create_if_not_exists` True')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.exists","title":"exists property","text":"
exists: bool\n

Check if table exists

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.has_change_type","title":"has_change_type property","text":"
has_change_type: bool\n

Checks if a column named _change_type is present in the table

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.is_cdf_active","title":"is_cdf_active property","text":"
is_cdf_active: bool\n

Check if CDF property is set and activated

Returns:

Type Description bool

delta.enableChangeDataFeed property is set to 'true'

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table","title":"table instance-attribute","text":"
table: str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table_name","title":"table_name property","text":"
table_name: str\n

Fully qualified table name in the form of catalog.database.table

"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_properties","title":"add_properties","text":"
add_properties(properties: Dict[str, Union[str, bool, int]], override: bool = False)\n

Alter table and add properties.

Parameters:

Name Type Description Default properties Dict[str, Union[str, int, bool]]

Properties to be added to table.

required override bool

Enable override of existing value for property in table.

False Source code in src/koheesio/spark/delta.py
def add_properties(self, properties: Dict[str, Union[str, bool, int]], override: bool = False):\n    \"\"\"Alter table and add properties.\n\n    Parameters\n    ----------\n    properties : Dict[str, Union[str, int, bool]]\n        Properties to be added to table.\n    override : bool, optional, default=False\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    for k, v in properties.items():\n        v_str = str(v) if not isinstance(v, bool) else str(v).lower()\n        self.add_property(key=k, value=v_str, override=override)\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_property","title":"add_property","text":"
add_property(key: str, value: Union[str, int, bool], override: bool = False)\n

Alter table and set table property.

Parameters:

Name Type Description Default key str

Property key(name).

required value Union[str, int, bool]

Property value.

required override bool

Enable override of existing value for property in table.

False Source code in src/koheesio/spark/delta.py
def add_property(self, key: str, value: Union[str, int, bool], override: bool = False):\n    \"\"\"Alter table and set table property.\n\n    Parameters\n    ----------\n    key: str\n        Property key(name).\n    value: Union[str, int, bool]\n        Property value.\n    override: bool\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    persisted_properties = self.get_persisted_properties()\n    v_str = str(value) if not isinstance(value, bool) else str(value).lower()\n\n    def _alter_table() -> None:\n        property_pair = f\"'{key}'='{v_str}'\"\n\n        try:\n            # noinspection SqlNoDataSourceInspection\n            self.spark.sql(f\"ALTER TABLE {self.table_name} SET TBLPROPERTIES ({property_pair})\")\n            self.log.debug(f\"Table `{self.table_name}` has been altered. Property `{property_pair}` added.\")\n        except Py4JJavaError as e:\n            msg = f\"Property `{key}` can not be applied to table `{self.table_name}`. Exception: {e}\"\n            self.log.warning(msg)\n            warnings.warn(msg)\n\n    if self.exists:\n        if key in persisted_properties and persisted_properties[key] != v_str:\n            if override:\n                self.log.debug(\n                    f\"Property `{key}` presents in `{self.table_name}` and has value `{persisted_properties[key]}`.\"\n                    f\"Override is enabled.The value will be changed to `{v_str}`.\"\n                )\n                _alter_table()\n            else:\n                self.log.debug(\n                    f\"Skipping adding property `{key}`, because it is already set \"\n                    f\"for table `{self.table_name}` to `{v_str}`. To override it, provide override=True\"\n                )\n        else:\n            _alter_table()\n    else:\n        self.default_create_properties[key] = v_str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.execute","title":"execute","text":"
execute()\n

Nothing to execute on a Table

Source code in src/koheesio/spark/delta.py
def execute(self):\n    \"\"\"Nothing to execute on a Table\"\"\"\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_column_type","title":"get_column_type","text":"
get_column_type(column: str) -> Optional[DataType]\n

Get the type of a column in the table.

Parameters:

Name Type Description Default column str

Column name.

required

Returns:

Type Description Optional[DataType]

Column type.

Source code in src/koheesio/spark/delta.py
def get_column_type(self, column: str) -> Optional[DataType]:\n    \"\"\"Get the type of a column in the table.\n\n    Parameters\n    ----------\n    column : str\n        Column name.\n\n    Returns\n    -------\n    Optional[DataType]\n        Column type.\n    \"\"\"\n    return self.dataframe.schema[column].dataType if self.columns and column in self.columns else None\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_persisted_properties","title":"get_persisted_properties","text":"
get_persisted_properties() -> Dict[str, str]\n

Get persisted properties of table.

Returns:

Type Description Dict[str, str]

Persisted properties as a dictionary.

Source code in src/koheesio/spark/delta.py
def get_persisted_properties(self) -> Dict[str, str]:\n    \"\"\"Get persisted properties of table.\n\n    Returns\n    -------\n    Dict[str, str]\n        Persisted properties as a dictionary.\n    \"\"\"\n    persisted_properties = {}\n    raw_options = self.spark.sql(f\"SHOW TBLPROPERTIES {self.table_name}\").collect()\n\n    for ro in raw_options:\n        key, value = ro.asDict().values()\n        persisted_properties[key] = value\n\n    return persisted_properties\n
"},{"location":"api_reference/spark/etl_task.html","title":"Etl task","text":"

ETL Task

Extract -> Transform -> Load

"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask","title":"koheesio.spark.etl_task.EtlTask","text":"

ETL Task

Etl stands for: Extract -> Transform -> Load

This task is a composition of a Reader (extract), a series of Transformations (transform) and a Writer (load). In other words, it reads data from a source, applies a series of transformations, and writes the result to a target.

Parameters:

Name Type Description Default name str

Name of the task

required description str

Description of the task

required source Reader

Source to read from [extract]

required transformations list[Transformation]

Series of transformations [transform]. The order of the transformations is important!

required target Writer

Target to write to [load]

required Example
from koheesio.tasks import EtlTask\n\nfrom koheesio.steps.readers import CsvReader\nfrom koheesio.steps.transformations.repartition import Repartition\nfrom koheesio.steps.writers import CsvWriter\n\netl_task = EtlTask(\n    name=\"My ETL Task\",\n    description=\"This is an example ETL task\",\n    source=CsvReader(path=\"path/to/source.csv\"),\n    transformations=[Repartition(num_partitions=2)],\n    target=DummyWriter(),\n)\n\netl_task.execute()\n

This code will read from a CSV file, repartition the DataFrame to 2 partitions, and write the result to the console.

Extending the EtlTask

The EtlTask is designed to be a simple and flexible way to define ETL processes. It is not designed to be a one-size-fits-all solution, but rather a starting point for building more complex ETL processes. If you need more complex functionality, you can extend the EtlTask class and override the extract, transform and load methods. You can also implement your own execute method to define the entire ETL process from scratch should you need more flexibility.

Advantages of using the EtlTask
  • It is a simple way to define ETL processes
  • It is easy to understand and extend
  • It is easy to test and debug
  • It is easy to maintain and refactor
  • It is easy to integrate with other tools and libraries
  • It is easy to use in a production environment
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.etl_date","title":"etl_date class-attribute instance-attribute","text":"
etl_date: datetime = Field(default=utcnow(), description=\"Date time when this object was created as iso format. Example: '2023-01-24T09:39:23.632374'\")\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.source","title":"source class-attribute instance-attribute","text":"
source: InstanceOf[Reader] = Field(default=..., description='Source to read from [extract]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.target","title":"target class-attribute instance-attribute","text":"
target: InstanceOf[Writer] = Field(default=..., description='Target to write to [load]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transformations","title":"transformations class-attribute instance-attribute","text":"
transformations: conlist(min_length=0, item_type=InstanceOf[Transformation]) = Field(default_factory=list, description='Series of transformations', alias='transforms')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output","title":"Output","text":"

Output class for EtlTask

"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.source_df","title":"source_df class-attribute instance-attribute","text":"
source_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .extract() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.target_df","title":"target_df class-attribute instance-attribute","text":"
target_df: DataFrame = Field(default=..., description='The Spark DataFrame used by .load() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.transform_df","title":"transform_df class-attribute instance-attribute","text":"
transform_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .transform() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.execute","title":"execute","text":"
execute()\n

Run the ETL process

Source code in src/koheesio/spark/etl_task.py
def execute(self):\n    \"\"\"Run the ETL process\"\"\"\n    self.log.info(f\"Task started at {self.etl_date}\")\n\n    # extract from source\n    self.output.source_df = self.extract()\n\n    # transform\n    self.output.transform_df = self.transform(self.output.source_df)\n\n    # load to target\n    self.output.target_df = self.load(self.output.transform_df)\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.extract","title":"extract","text":"
extract() -> DataFrame\n

Read from Source

logging is handled by the Reader.execute()-method's @do_execute decorator

Source code in src/koheesio/spark/etl_task.py
def extract(self) -> DataFrame:\n    \"\"\"Read from Source\n\n    logging is handled by the Reader.execute()-method's @do_execute decorator\n    \"\"\"\n    reader: Reader = self.source\n    return reader.read()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.load","title":"load","text":"
load(df: DataFrame) -> DataFrame\n

Write to Target

logging is handled by the Writer.execute()-method's @do_execute decorator

Source code in src/koheesio/spark/etl_task.py
def load(self, df: DataFrame) -> DataFrame:\n    \"\"\"Write to Target\n\n    logging is handled by the Writer.execute()-method's @do_execute decorator\n    \"\"\"\n    writer: Writer = self.target\n    writer.write(df)\n    return df\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.run","title":"run","text":"
run()\n

alias of execute

Source code in src/koheesio/spark/etl_task.py
def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transform","title":"transform","text":"
transform(df: DataFrame) -> DataFrame\n

Transform recursively

logging is handled by the Transformation.execute()-method's @do_execute decorator

Source code in src/koheesio/spark/etl_task.py
def transform(self, df: DataFrame) -> DataFrame:\n    \"\"\"Transform recursively\n\n    logging is handled by the Transformation.execute()-method's @do_execute decorator\n    \"\"\"\n    for t in self.transformations:\n        df = t.transform(df)\n    return df\n
"},{"location":"api_reference/spark/snowflake.html","title":"Snowflake","text":"

Snowflake steps and tasks for Koheesio

Every class in this module is a subclass of Step or Task and is used to perform operations on Snowflake.

Notes

Every Step in this module is based on SnowflakeBaseModel. The following parameters are available for every Step.

Parameters:

Name Type Description Default url str

Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL. required user str

Login name for the Snowflake user. Alias for sfUser.

required password SecretStr

Password for the Snowflake user. Alias for sfPassword.

required database str

The database to use for the session after connecting. Alias for sfDatabase.

required sfSchema str

The schema to use for the session after connecting. Alias for schema (\"schema\" is a reserved name in Pydantic, so we use sfSchema as main name instead).

required role str

The default security role to use for the session after connecting. Alias for sfRole.

required warehouse str

The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse.

required authenticator Optional[str]

Authenticator for the Snowflake user. Example: \"okta.com\".

None options Optional[Dict[str, Any]]

Extra options to pass to the Snowflake connector.

{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"} format str

The default snowflake format can be used natively in Databricks, use net.snowflake.spark.snowflake in other environments and make sure to install required JARs.

\"snowflake\""},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn","title":"koheesio.spark.snowflake.AddColumn","text":"

Add an empty column to a Snowflake table with given name and DataType

Example
AddColumn(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    col=\"MY_COL\",\n    dataType=StringType(),\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.column","title":"column class-attribute instance-attribute","text":"
column: str = Field(default=..., description='The name of the new column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The name of the Snowflake table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.type","title":"type class-attribute instance-attribute","text":"
type: DataType = Field(default=..., description='The DataType represented as a Spark DataType')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output","title":"Output","text":"

Output class for AddColumn

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='Query that was executed to add the column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    query = f\"ALTER TABLE {self.table} ADD COLUMN {self.column} {map_spark_type(self.type)}\".upper()\n    self.output.query = query\n    RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","title":"koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","text":"

Create (or Replace) a Snowflake table which has the same schema as a Spark DataFrame

Can be used as any Transformation. The DataFrame is however left unchanged, and only used for determining the schema of the Snowflake Table that is to be created (or replaced).

Example
CreateOrReplaceTableFromDataFrame(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=\"super-secret-password\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    df=df,\n).execute()\n

Or, as a Transformation:

CreateOrReplaceTableFromDataFrame(\n    ...\n    table=\"MY_TABLE\",\n).transform(df)\n

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., alias='table_name', description='The name of the (new) table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output","title":"Output","text":"

Output class for CreateOrReplaceTableFromDataFrame

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.input_schema","title":"input_schema class-attribute instance-attribute","text":"
input_schema: StructType = Field(default=..., description='The original schema from the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='Query that was executed to create the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.snowflake_schema","title":"snowflake_schema class-attribute instance-attribute","text":"
snowflake_schema: str = Field(default=..., description='Derived Snowflake table schema based on the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    self.output.df = self.df\n\n    input_schema = self.df.schema\n    self.output.input_schema = input_schema\n\n    snowflake_schema = \", \".join([f\"{c.name} {map_spark_type(c.dataType)}\" for c in input_schema])\n    self.output.snowflake_schema = snowflake_schema\n\n    table_name = f\"{self.database}.{self.sfSchema}.{self.table}\"\n    query = f\"CREATE OR REPLACE TABLE {table_name} ({snowflake_schema})\"\n    self.output.query = query\n\n    RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery","title":"koheesio.spark.snowflake.DbTableQuery","text":"

Read table from Snowflake using the dbtable option instead of query

Example
DbTableQuery(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"user\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"db.schema.table\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery.dbtable","title":"dbtable class-attribute instance-attribute","text":"
dbtable: str = Field(default=..., alias='table', description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema","title":"koheesio.spark.snowflake.GetTableSchema","text":"

Get the schema from a Snowflake table as a Spark Schema

Notes
  • This Step will execute a SELECT * FROM <table> LIMIT 1 query to get the schema of the table.
  • The schema will be stored in the table_schema attribute of the output.
  • table_schema is used as the attribute name to avoid conflicts with the schema attribute of Pydantic's BaseModel.
Example
schema = (\n    GetTableSchema(\n        database=\"MY_DB\",\n        schema_=\"MY_SCHEMA\",\n        warehouse=\"MY_WH\",\n        user=\"gid.account@nike.com\",\n        password=\"super-secret-password\",\n        role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n        table=\"MY_TABLE\",\n    )\n    .execute()\n    .table_schema\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The Snowflake table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output","title":"Output","text":"

Output class for GetTableSchema

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output.table_schema","title":"table_schema class-attribute instance-attribute","text":"
table_schema: StructType = Field(default=..., serialization_alias='schema', description='The Spark Schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.execute","title":"execute","text":"
execute() -> Output\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> Output:\n    query = f\"SELECT * FROM {self.table} LIMIT 1\"  # nosec B608: hardcoded_sql_expressions\n    df = Query(**self.get_options(), query=query).execute().df\n    self.output.table_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","text":"

Grant Snowflake privileges to a set of roles on a fully qualified object, i.e. database.schema.object_name

This class is a subclass of GrantPrivilegesOnObject and is used to grant privileges on a fully qualified object. The advantage of using this class is that it sets the object name to be fully qualified, i.e. database.schema.object_name.

Meaning, you can set the database, schema and object separately and the object name will be set to be fully qualified, i.e. database.schema.object_name.

Example
GrantPrivilegesOnFullyQualifiedObject(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    ...\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    ...\n)\n

In this example, the object name will be set to be fully qualified, i.e. MY_DB.MY_SCHEMA.MY_TABLE. If you were to use GrantPrivilegesOnObject instead, you would have to set the object name to be fully qualified yourself.

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject.set_object_name","title":"set_object_name","text":"
set_object_name()\n

Set the object name to be fully qualified, i.e. database.schema.object_name

Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef set_object_name(self):\n    \"\"\"Set the object name to be fully qualified, i.e. database.schema.object_name\"\"\"\n    # database, schema, obj_name\n    db = self.database\n    schema = self.model_dump()[\"sfSchema\"]  # since \"schema\" is a reserved name\n    obj_name = self.object\n\n    self.object = f\"{db}.{schema}.{obj_name}\"\n\n    return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnObject","text":"

A wrapper on Snowflake GRANT privileges

With this Step, you can grant Snowflake privileges to a set of roles on a table, a view, or an object

See Also

https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html

Parameters:

Name Type Description Default warehouse str

The name of the warehouse. Alias for sfWarehouse

required user str

The username. Alias for sfUser

required password SecretStr

The password. Alias for sfPassword

required role str

The role name

required object str

The name of the object to grant privileges on

required type str

The type of object to grant privileges on, e.g. TABLE, VIEW

required privileges Union[conlist(str, min_length=1), str]

The Privilege/Permission or list of Privileges/Permissions to grant on the given object.

required roles Union[conlist(str, min_length=1), str]

The Role or list of Roles to grant the privileges to

required Example
GrantPermissionsOnTable(\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    permissions=[\"SELECT\", \"INSERT\"],\n).execute()\n

In this example, the APPLICATION.SNOWFLAKE.ADMIN role will be granted SELECT and INSERT privileges on the MY_TABLE table using the MY_WH warehouse.

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.object","title":"object class-attribute instance-attribute","text":"
object: str = Field(default=..., description='The name of the object to grant privileges on')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.privileges","title":"privileges class-attribute instance-attribute","text":"
privileges: Union[conlist(str, min_length=1), str] = Field(default=..., alias='permissions', description='The Privilege/Permission or list of Privileges/Permissions to grant on the given object. See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.roles","title":"roles class-attribute instance-attribute","text":"
roles: Union[conlist(str, min_length=1), str] = Field(default=..., alias='role', validation_alias='roles', description='The Role or list of Roles to grant the privileges to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.type","title":"type class-attribute instance-attribute","text":"
type: str = Field(default=..., description='The type of object to grant privileges on, e.g. TABLE, VIEW')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output","title":"Output","text":"

Output class for GrantPrivilegesOnObject

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output.query","title":"query class-attribute instance-attribute","text":"
query: conlist(str, min_length=1) = Field(default=..., description='Query that was executed to grant privileges', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    self.output.query = []\n    roles = self.roles\n\n    for role in roles:\n        query = self.get_query(role)\n        self.output.query.append(query)\n        RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.get_query","title":"get_query","text":"
get_query(role: str)\n

Build the GRANT query

Parameters:

Name Type Description Default role str

The role name

required

Returns:

Name Type Description query str

The Query that performs the grant

Source code in src/koheesio/spark/snowflake.py
def get_query(self, role: str):\n    \"\"\"Build the GRANT query\n\n    Parameters\n    ----------\n    role: str\n        The role name\n\n    Returns\n    -------\n    query : str\n        The Query that performs the grant\n    \"\"\"\n    query = f\"GRANT {','.join(self.privileges)} ON {self.type} {self.object} TO ROLE {role}\".upper()\n    return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.set_roles_privileges","title":"set_roles_privileges","text":"
set_roles_privileges(values)\n

Coerce roles and privileges to be lists if they are not already.

Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"before\")\ndef set_roles_privileges(cls, values):\n    \"\"\"Coerce roles and privileges to be lists if they are not already.\"\"\"\n    roles_value = values.get(\"roles\") or values.get(\"role\")\n    privileges_value = values.get(\"privileges\")\n\n    if not (roles_value and privileges_value):\n        raise ValueError(\"You have to specify roles AND privileges when using 'GrantPrivilegesOnObject'.\")\n\n    # coerce values to be lists\n    values[\"roles\"] = [roles_value] if isinstance(roles_value, str) else roles_value\n    values[\"role\"] = values[\"roles\"][0]  # hack to keep the validator happy\n    values[\"privileges\"] = [privileges_value] if isinstance(privileges_value, str) else privileges_value\n\n    return values\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.validate_object_and_object_type","title":"validate_object_and_object_type","text":"
validate_object_and_object_type()\n

Validate that the object and type are set.

Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef validate_object_and_object_type(self):\n    \"\"\"Validate that the object and type are set.\"\"\"\n    object_value = self.object\n    if not object_value:\n        raise ValueError(\"You must provide an `object`, this should be the name of the object. \")\n\n    object_type = self.type\n    if not object_type:\n        raise ValueError(\n            \"You must provide a `type`, e.g. TABLE, VIEW, DATABASE. \"\n            \"See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html\"\n        )\n\n    return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable","title":"koheesio.spark.snowflake.GrantPrivilegesOnTable","text":"

Grant Snowflake privileges to a set of roles on a table

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.object","title":"object class-attribute instance-attribute","text":"
object: str = Field(default=..., alias='table', description='The name of the Table to grant Privileges on. This should be just the name of the table; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.type","title":"type class-attribute instance-attribute","text":"
type: str = 'TABLE'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView","title":"koheesio.spark.snowflake.GrantPrivilegesOnView","text":"

Grant Snowflake privileges to a set of roles on a view

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.object","title":"object class-attribute instance-attribute","text":"
object: str = Field(default=..., alias='view', description='The name of the View to grant Privileges on. This should be just the name of the view; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.type","title":"type class-attribute instance-attribute","text":"
type: str = 'VIEW'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query","title":"koheesio.spark.snowflake.Query","text":"

Query data from Snowflake and return the result as a DataFrame

Example
Query(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"SELECT * FROM MY_TABLE\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='The query to run')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.get_options","title":"get_options","text":"
get_options()\n

add query to options

Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    \"\"\"add query to options\"\"\"\n    options = super().get_options()\n    options[\"query\"] = self.query\n    return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.validate_query","title":"validate_query","text":"
validate_query(query)\n

Replace escape characters

Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    query = query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n    return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery","title":"koheesio.spark.snowflake.RunQuery","text":"

Run a query on Snowflake that does not return a result, e.g. create table statement

This is a wrapper around 'net.snowflake.spark.snowflake.Utils.runQuery' on the JVM

Example
RunQuery(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"account\",\n    password=\"***\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"CREATE TABLE test (col1 string)\",\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.query","title":"query class-attribute instance-attribute","text":"
query: str = Field(default=..., description='The query to run', alias='sql')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.execute","title":"execute","text":"
execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n    if not self.query:\n        self.log.warning(\"Empty string given as query input, skipping execution\")\n        return\n    # noinspection PyProtectedMember\n    self.spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(self.get_options(), self.query)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.get_options","title":"get_options","text":"
get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    # Executing the RunQuery without `host` option in Databricks throws:\n    # An error occurred while calling z:net.snowflake.spark.snowflake.Utils.runQuery.\n    # : java.util.NoSuchElementException: key not found: host\n    options = super().get_options()\n    options[\"host\"] = options[\"sfURL\"]\n    return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.validate_query","title":"validate_query","text":"
validate_query(query)\n

Replace escape characters

Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    return query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel","title":"koheesio.spark.snowflake.SnowflakeBaseModel","text":"

BaseModel for setting up Snowflake Driver options.

Notes
  • Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
  • Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
  • Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector

Parameters:

Name Type Description Default url str

Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL. required user str

Login name for the Snowflake user. Alias for sfUser.

required password SecretStr

Password for the Snowflake user. Alias for sfPassword.

required database str

The database to use for the session after connecting. Alias for sfDatabase.

required sfSchema str

The schema to use for the session after connecting. Alias for schema (\"schema\" is a reserved name in Pydantic, so we use sfSchema as main name instead).

required role str

The default security role to use for the session after connecting. Alias for sfRole.

required warehouse str

The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse.

required authenticator Optional[str]

Authenticator for the Snowflake user. Example: \"okta.com\".

None options Optional[Dict[str, Any]]

Extra options to pass to the Snowflake connector.

{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"} format str

The default snowflake format can be used natively in Databricks, use net.snowflake.spark.snowflake in other environments and make sure to install required JARs.

\"snowflake\""},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.authenticator","title":"authenticator class-attribute instance-attribute","text":"
authenticator: Optional[str] = Field(default=None, description='Authenticator for the Snowflake user', examples=['okta.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.database","title":"database class-attribute instance-attribute","text":"
database: str = Field(default=..., alias='sfDatabase', description='The database to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='snowflake', description='The default `snowflake` format can be used natively in Databricks, use `net.snowflake.spark.snowflake` in other environments and make sure to install required JARs.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field(default={'sfCompress': 'on', 'continue_on_error': 'off'}, description='Extra options to pass to the Snowflake connector')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.password","title":"password class-attribute instance-attribute","text":"
password: SecretStr = Field(default=..., alias='sfPassword', description='Password for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.role","title":"role class-attribute instance-attribute","text":"
role: str = Field(default=..., alias='sfRole', description='The default security role to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.sfSchema","title":"sfSchema class-attribute instance-attribute","text":"
sfSchema: str = Field(default=..., alias='schema', description='The schema to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., alias='sfURL', description='Hostname for the Snowflake account, e.g. <account>.snowflakecomputing.com', examples=['example.snowflakecomputing.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.user","title":"user class-attribute instance-attribute","text":"
user: str = Field(default=..., alias='sfUser', description='Login name for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.warehouse","title":"warehouse class-attribute instance-attribute","text":"
warehouse: str = Field(default=..., alias='sfWarehouse', description='The default virtual warehouse to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.get_options","title":"get_options","text":"
get_options()\n

Get the sfOptions as a dictionary.

Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    \"\"\"Get the sfOptions as a dictionary.\"\"\"\n    return {\n        key: value\n        for key, value in {\n            \"sfURL\": self.url,\n            \"sfUser\": self.user,\n            \"sfPassword\": self.password.get_secret_value(),\n            \"authenticator\": self.authenticator,\n            \"sfDatabase\": self.database,\n            \"sfSchema\": self.sfSchema,\n            \"sfRole\": self.role,\n            \"sfWarehouse\": self.warehouse,\n            **self.options,\n        }.items()\n        if value is not None\n    }\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader","title":"koheesio.spark.snowflake.SnowflakeReader","text":"

Wrapper around JdbcReader for Snowflake.

Example
sr = SnowflakeReader(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n)\ndf = sr.read()\n
Notes
  • Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
  • Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
  • Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: Optional[str] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeStep","title":"koheesio.spark.snowflake.SnowflakeStep","text":"

Expands the SnowflakeBaseModel so that it can be used as a Step

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep","title":"koheesio.spark.snowflake.SnowflakeTableStep","text":"

Expands the SnowflakeStep, adding a 'table' parameter

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.get_options","title":"get_options","text":"
get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n    options = super().get_options()\n    options[\"table\"] = self.table\n    return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTransformation","title":"koheesio.spark.snowflake.SnowflakeTransformation","text":"

Adds Snowflake parameters to the Transformation class

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter","title":"koheesio.spark.snowflake.SnowflakeWriter","text":"

Class for writing to Snowflake

See Also
  • koheesio.steps.writers.Writer
  • koheesio.steps.writers.BatchOutputMode
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.insert_type","title":"insert_type class-attribute instance-attribute","text":"
insert_type: Optional[BatchOutputMode] = Field(APPEND, alias='mode', description='The insertion type, append or overwrite')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='Target table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.execute","title":"execute","text":"
execute()\n

Write to Snowflake

Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    \"\"\"Write to Snowflake\"\"\"\n    self.log.debug(f\"writing to {self.table} with mode {self.insert_type}\")\n    self.df.write.format(self.format).options(**self.get_options()).option(\"dbtable\", self.table).mode(\n        self.insert_type\n    ).save()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema","title":"koheesio.spark.snowflake.SyncTableAndDataFrameSchema","text":"

Sync the schema's of a Snowflake table and a DataFrame. This will add NULL columns for the columns that are not in both and perform type casts where needed.

The Snowflake table will take priority in case of type conflicts.

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.df","title":"df class-attribute instance-attribute","text":"
df: DataFrame = Field(default=..., description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.dry_run","title":"dry_run class-attribute instance-attribute","text":"
dry_run: Optional[bool] = Field(default=False, description='Only show schema differences, do not apply changes')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='The table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output","title":"Output","text":"

Output class for SyncTableAndDataFrameSchema

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_df_schema","title":"new_df_schema class-attribute instance-attribute","text":"
new_df_schema: StructType = Field(default=..., description='New DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_sf_schema","title":"new_sf_schema class-attribute instance-attribute","text":"
new_sf_schema: StructType = Field(default=..., description='New Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_df_schema","title":"original_df_schema class-attribute instance-attribute","text":"
original_df_schema: StructType = Field(default=..., description='Original DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_sf_schema","title":"original_sf_schema class-attribute instance-attribute","text":"
original_sf_schema: StructType = Field(default=..., description='Original Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.sf_table_altered","title":"sf_table_altered class-attribute instance-attribute","text":"
sf_table_altered: bool = Field(default=False, description='Flag to indicate whether Snowflake schema has been altered')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    self.log.warning(\"Snowflake table will always take a priority in case of data type conflicts!\")\n\n    # spark side\n    df_schema = self.df.schema\n    self.output.original_df_schema = deepcopy(df_schema)  # using deepcopy to avoid storing in place changes\n    df_cols = [c.name.lower() for c in df_schema]\n\n    # snowflake side\n    sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n    self.output.original_sf_schema = sf_schema\n    sf_cols = [c.name.lower() for c in sf_schema]\n\n    if self.dry_run:\n        # Display differences between Spark DataFrame and Snowflake schemas\n        # and provide dummy values that are expected as class outputs.\n        self.log.warning(f\"Columns to be added to Snowflake table: {set(df_cols) - set(sf_cols)}\")\n        self.log.warning(f\"Columns to be added to Spark DataFrame: {set(sf_cols) - set(df_cols)}\")\n\n        self.output.new_df_schema = t.StructType()\n        self.output.new_sf_schema = t.StructType()\n        self.output.df = self.df\n        self.output.sf_table_altered = False\n\n    else:\n        # Add columns to SnowFlake table that exist in DataFrame\n        for df_column in df_schema:\n            if df_column.name.lower() not in sf_cols:\n                AddColumn(\n                    **self.get_options(),\n                    table=self.table,\n                    column=df_column.name,\n                    type=df_column.dataType,\n                ).execute()\n                self.output.sf_table_altered = True\n\n        if self.output.sf_table_altered:\n            sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n            sf_cols = [c.name.lower() for c in sf_schema]\n\n        self.output.new_sf_schema = sf_schema\n\n        # Add NULL columns to the DataFrame if they exist in SnowFlake but not in the df\n        df = self.df\n        for sf_col in self.output.original_sf_schema:\n            sf_col_name = sf_col.name.lower()\n            if sf_col_name not in df_cols:\n                sf_col_type = sf_col.dataType\n                df = df.withColumn(sf_col_name, f.lit(None).cast(sf_col_type))\n\n        # Put DataFrame columns in the same order as the Snowflake table\n        df = df.select(*sf_cols)\n\n        self.output.df = df\n        self.output.new_df_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","title":"koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","text":"

Synchronize a Delta table to a Snowflake table

  • Overwrite - only in batch mode
  • Append - supports batch and streaming mode
  • Merge - only in streaming mode
Example
SynchronizeDeltaToSnowflakeTask(\n    url=\"acme.snowflakecomputing.com\",\n    user=\"admin\",\n    role=\"ADMIN\",\n    warehouse=\"SF_WAREHOUSE\",\n    database=\"SF_DATABASE\",\n    schema=\"SF_SCHEMA\",\n    source_table=DeltaTableStep(...),\n    target_table=\"my_sf_table\",\n    key_columns=[\n        \"id\",\n    ],\n    streaming=False,\n).run()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.checkpoint_location","title":"checkpoint_location class-attribute instance-attribute","text":"
checkpoint_location: Optional[str] = Field(default=None, description='Checkpoint location to use')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.enable_deletion","title":"enable_deletion class-attribute instance-attribute","text":"
enable_deletion: Optional[bool] = Field(default=False, description='In case of merge synchronisation_mode add deletion statement in merge query.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.key_columns","title":"key_columns class-attribute instance-attribute","text":"
key_columns: Optional[List[str]] = Field(default_factory=list, description='Key columns on which merge statements will be MERGE statement will be applied.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.non_key_columns","title":"non_key_columns property","text":"
non_key_columns: List[str]\n

Columns of source table that aren't part of the (composite) primary key

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.persist_staging","title":"persist_staging class-attribute instance-attribute","text":"
persist_staging: Optional[bool] = Field(default=False, description='In case of debugging, set `persist_staging` to True to retain the staging table for inspection after synchronization.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.reader","title":"reader property","text":"
reader\n

DeltaTable reader

Returns:
DeltaTableReader the will yield source delta table\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.schema_tracking_location","title":"schema_tracking_location class-attribute instance-attribute","text":"
schema_tracking_location: Optional[str] = Field(default=None, description='Schema tracking location to use. Info: https://docs.delta.io/latest/delta-streaming.html#-schema-tracking')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.source_table","title":"source_table class-attribute instance-attribute","text":"
source_table: DeltaTableStep = Field(default=..., description='Source delta table to synchronize')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table","title":"staging_table property","text":"
staging_table\n

Intermediate table on snowflake where staging results are stored

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table_name","title":"staging_table_name class-attribute instance-attribute","text":"
staging_table_name: Optional[str] = Field(default=None, alias='staging_table', description='Optional snowflake staging name', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: Optional[bool] = Field(default=False, description=\"Should synchronisation happen in streaming or in batch mode. Streaming is supported in 'APPEND' and 'MERGE' mode. Batch is supported in 'OVERWRITE' and 'APPEND' mode.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.synchronisation_mode","title":"synchronisation_mode class-attribute instance-attribute","text":"
synchronisation_mode: BatchOutputMode = Field(default=MERGE, description=\"Determines if synchronisation will 'overwrite' any existing table, 'append' new rows or 'merge' with existing rows.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.target_table","title":"target_table class-attribute instance-attribute","text":"
target_table: str = Field(default=..., description='Target table in snowflake to synchronize to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer","title":"writer property","text":"
writer: Union[ForEachBatchStreamWriter, SnowflakeWriter]\n

Writer to persist to snowflake

Depending on configured options, this returns an SnowflakeWriter or ForEachBatchStreamWriter: - OVERWRITE/APPEND mode yields SnowflakeWriter - MERGE mode yields ForEachBatchStreamWriter

Returns:

Type Description Union[ForEachBatchStreamWriter, SnowflakeWriter]"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer_","title":"writer_ class-attribute instance-attribute","text":"
writer_: Optional[Union[ForEachBatchStreamWriter, SnowflakeWriter]] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.drop_table","title":"drop_table","text":"
drop_table(snowflake_table)\n

Drop a given snowflake table

Source code in src/koheesio/spark/snowflake.py
def drop_table(self, snowflake_table):\n    \"\"\"Drop a given snowflake table\"\"\"\n    self.log.warning(f\"Dropping table {snowflake_table} from snowflake\")\n    drop_table_query = f\"\"\"DROP TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(**self.get_options(), query=drop_table_query)\n    query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.execute","title":"execute","text":"
execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n    # extract\n    df = self.extract()\n    self.output.source_df = df\n\n    # synchronize\n    self.output.target_df = df\n    self.load(df)\n    if not self.persist_staging:\n        # If it's a streaming job, await for termination before dropping staging table\n        if self.streaming:\n            self.writer.await_termination()\n        self.drop_table(self.staging_table)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.extract","title":"extract","text":"
extract() -> DataFrame\n

Extract source table

Source code in src/koheesio/spark/snowflake.py
def extract(self) -> DataFrame:\n    \"\"\"\n    Extract source table\n    \"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        if not self.source_table.is_cdf_active:\n            raise RuntimeError(\n                f\"Source table {self.source_table.table_name} does not have CDF enabled. \"\n                f\"Set TBLPROPERTIES ('delta.enableChangeDataFeed' = true) to enable. \"\n                f\"Current properties = {self.source_table_properties}\"\n            )\n\n    df = self.reader.read()\n    self.output.source_df = df\n    return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.load","title":"load","text":"
load(df) -> DataFrame\n

Load source table into snowflake

Source code in src/koheesio/spark/snowflake.py
def load(self, df) -> DataFrame:\n    \"\"\"Load source table into snowflake\"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        self.log.info(f\"Truncating staging table {self.staging_table}\")\n        self.truncate_table(self.staging_table)\n    self.writer.write(df)\n    self.output.target_df = df\n    return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.run","title":"run","text":"
run()\n

alias of execute

Source code in src/koheesio/spark/snowflake.py
def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.truncate_table","title":"truncate_table","text":"
truncate_table(snowflake_table)\n

Truncate a given snowflake table

Source code in src/koheesio/spark/snowflake.py
def truncate_table(self, snowflake_table):\n    \"\"\"Truncate a given snowflake table\"\"\"\n    truncate_query = f\"\"\"TRUNCATE TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(\n        **self.get_options(),\n        query=truncate_query,\n    )\n    query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists","title":"koheesio.spark.snowflake.TableExists","text":"

Check if the table exists in Snowflake by using INFORMATION_SCHEMA.

Example
k = TableExists(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n    table=\"table\",\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output","title":"Output","text":"

Output class for TableExists

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output.exists","title":"exists class-attribute instance-attribute","text":"
exists: bool = Field(default=..., description='Whether or not the table exists')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    query = (\n        dedent(\n            # Force upper case, due to case-sensitivity of where clause\n            f\"\"\"\n        SELECT *\n        FROM INFORMATION_SCHEMA.TABLES\n        WHERE TABLE_CATALOG     = '{self.database}'\n          AND TABLE_SCHEMA      = '{self.sfSchema}'\n          AND TABLE_TYPE        = 'BASE TABLE'\n          AND upper(TABLE_NAME) = '{self.table.upper()}'\n        \"\"\"  # nosec B608: hardcoded_sql_expressions\n        )\n        .upper()\n        .strip()\n    )\n\n    self.log.debug(f\"Query that was executed to check if the table exists:\\n{query}\")\n\n    df = Query(**self.get_options(), query=query).read()\n\n    exists = df.count() > 0\n    self.log.info(f\"Table {self.table} {'exists' if exists else 'does not exist'}\")\n    self.output.exists = exists\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery","title":"koheesio.spark.snowflake.TagSnowflakeQuery","text":"

Provides Snowflake query tag pre-action that can be used to easily find queries through SF history search and further group them for debugging and cost tracking purposes.

Takes in query tag attributes as kwargs and additional Snowflake options dict that can optionally contain other set of pre-actions to be applied to a query, in that case existing pre-action aren't dropped, query tag pre-action will be added to them.

Passed Snowflake options dictionary is not modified in-place, instead anew dictionary containing updated pre-actions is returned.

Notes

See this article for explanation: https://select.dev/posts/snowflake-query-tags

Arbitrary tags can be applied, such as team, dataset names, business capability, etc.

Example
query_tag = AddQueryTag(\n    options={\"preactions\": ...},\n    task_name=\"cleanse_task\",\n    pipeline_name=\"ingestion-pipeline\",\n    etl_date=\"2022-01-01\",\n    pipeline_execution_time=\"2022-01-01T00:00:00\",\n    task_execution_time=\"2022-01-01T01:00:00\",\n    environment=\"dev\",\n    trace_id=\"e0fdec43-a045-46e5-9705-acd4f3f96045\",\n    span_id=\"cb89abea-1c12-471f-8b12-546d2d66f6cb\",\n    ),\n).execute().options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.options","title":"options class-attribute instance-attribute","text":"
options: Dict = Field(default_factory=dict, description='Additional Snowflake options, optionally containing additional preactions')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output","title":"Output","text":"

Output class for AddQueryTag

"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output.options","title":"options class-attribute instance-attribute","text":"
options: Dict = Field(default=..., description='Copy of provided SF options, with added query tag preaction')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.execute","title":"execute","text":"
execute()\n

Add query tag preaction to Snowflake options

Source code in src/koheesio/spark/snowflake.py
def execute(self):\n    \"\"\"Add query tag preaction to Snowflake options\"\"\"\n    tag_json = json.dumps(self.extra_params, indent=4, sort_keys=True)\n    tag_preaction = f\"ALTER SESSION SET QUERY_TAG = '{tag_json}';\"\n    preactions = self.options.get(\"preactions\", \"\")\n    preactions = f\"{preactions}\\n{tag_preaction}\".strip()\n    updated_options = dict(self.options)\n    updated_options[\"preactions\"] = preactions\n    self.output.options = updated_options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.map_spark_type","title":"koheesio.spark.snowflake.map_spark_type","text":"
map_spark_type(spark_type: DataType)\n

Translates Spark DataFrame Schema type to SnowFlake type

Basic Types Snowflake Type StringType STRING NullType STRING BooleanType BOOLEAN Numeric Types Snowflake Type LongType BIGINT IntegerType INT ShortType SMALLINT DoubleType DOUBLE FloatType FLOAT NumericType FLOAT ByteType BINARY Date / Time Types Snowflake Type DateType DATE TimestampType TIMESTAMP Advanced Types Snowflake Type DecimalType DECIMAL MapType VARIANT ArrayType VARIANT StructType VARIANT References
  • Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
  • Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html

Parameters:

Name Type Description Default spark_type DataType

DataType taken out of the StructField

required

Returns:

Type Description str

The Snowflake data type

Source code in src/koheesio/spark/snowflake.py
def map_spark_type(spark_type: t.DataType):\n    \"\"\"\n    Translates Spark DataFrame Schema type to SnowFlake type\n\n    | Basic Types       | Snowflake Type |\n    |-------------------|----------------|\n    | StringType        | STRING         |\n    | NullType          | STRING         |\n    | BooleanType       | BOOLEAN        |\n\n    | Numeric Types     | Snowflake Type |\n    |-------------------|----------------|\n    | LongType          | BIGINT         |\n    | IntegerType       | INT            |\n    | ShortType         | SMALLINT       |\n    | DoubleType        | DOUBLE         |\n    | FloatType         | FLOAT          |\n    | NumericType       | FLOAT          |\n    | ByteType          | BINARY         |\n\n    | Date / Time Types | Snowflake Type |\n    |-------------------|----------------|\n    | DateType          | DATE           |\n    | TimestampType     | TIMESTAMP      |\n\n    | Advanced Types    | Snowflake Type |\n    |-------------------|----------------|\n    | DecimalType       | DECIMAL        |\n    | MapType           | VARIANT        |\n    | ArrayType         | VARIANT        |\n    | StructType        | VARIANT        |\n\n    References\n    ----------\n    - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html\n    - Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html\n\n    Parameters\n    ----------\n    spark_type : pyspark.sql.types.DataType\n        DataType taken out of the StructField\n\n    Returns\n    -------\n    str\n        The Snowflake data type\n    \"\"\"\n    # StructField means that the entire Field was passed, we need to extract just the dataType before continuing\n    if isinstance(spark_type, t.StructField):\n        spark_type = spark_type.dataType\n\n    # Check if the type is DayTimeIntervalType\n    if isinstance(spark_type, t.DayTimeIntervalType):\n        warn(\n            \"DayTimeIntervalType is being converted to STRING. \"\n            \"Consider converting to a more supported date/time/timestamp type in Snowflake.\"\n        )\n\n    # fmt: off\n    # noinspection PyUnresolvedReferences\n    data_type_map = {\n        # Basic Types\n        t.StringType: \"STRING\",\n        t.NullType: \"STRING\",\n        t.BooleanType: \"BOOLEAN\",\n\n        # Numeric Types\n        t.LongType: \"BIGINT\",\n        t.IntegerType: \"INT\",\n        t.ShortType: \"SMALLINT\",\n        t.DoubleType: \"DOUBLE\",\n        t.FloatType: \"FLOAT\",\n        t.NumericType: \"FLOAT\",\n        t.ByteType: \"BINARY\",\n        t.BinaryType: \"VARBINARY\",\n\n        # Date / Time Types\n        t.DateType: \"DATE\",\n        t.TimestampType: \"TIMESTAMP\",\n        t.DayTimeIntervalType: \"STRING\",\n\n        # Advanced Types\n        t.DecimalType:\n            f\"DECIMAL({spark_type.precision},{spark_type.scale})\"  # pylint: disable=no-member\n            if isinstance(spark_type, t.DecimalType) else \"DECIMAL(38,0)\",\n        t.MapType: \"VARIANT\",\n        t.ArrayType: \"VARIANT\",\n        t.StructType: \"VARIANT\",\n    }\n    return data_type_map.get(type(spark_type), 'STRING')\n
"},{"location":"api_reference/spark/utils.html","title":"Utils","text":"

Spark Utility functions

"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_minor_version","title":"koheesio.spark.utils.spark_minor_version module-attribute","text":"
spark_minor_version: float = get_spark_minor_version()\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype","title":"koheesio.spark.utils.SparkDatatype","text":"

Allowed spark datatypes

The following table lists the data types that are supported by Spark SQL.

Data type SQL name ByteType BYTE, TINYINT ShortType SHORT, SMALLINT IntegerType INT, INTEGER LongType LONG, BIGINT FloatType FLOAT, REAL DoubleType DOUBLE DecimalType DECIMAL, DEC, NUMERIC StringType STRING BinaryType BINARY BooleanType BOOLEAN TimestampType TIMESTAMP, TIMESTAMP_LTZ DateType DATE ArrayType ARRAY MapType MAP NullType VOID Not supported yet
  • TimestampNTZType TIMESTAMP_NTZ
  • YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
  • DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
See Also

https://spark.apache.org/docs/latest/sql-ref-datatypes.html#supported-data-types

"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.ARRAY","title":"ARRAY class-attribute instance-attribute","text":"
ARRAY = 'array'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BIGINT","title":"BIGINT class-attribute instance-attribute","text":"
BIGINT = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BINARY","title":"BINARY class-attribute instance-attribute","text":"
BINARY = 'binary'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BOOLEAN","title":"BOOLEAN class-attribute instance-attribute","text":"
BOOLEAN = 'boolean'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BYTE","title":"BYTE class-attribute instance-attribute","text":"
BYTE = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DATE","title":"DATE class-attribute instance-attribute","text":"
DATE = 'date'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DEC","title":"DEC class-attribute instance-attribute","text":"
DEC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DECIMAL","title":"DECIMAL class-attribute instance-attribute","text":"
DECIMAL = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DOUBLE","title":"DOUBLE class-attribute instance-attribute","text":"
DOUBLE = 'double'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.FLOAT","title":"FLOAT class-attribute instance-attribute","text":"
FLOAT = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INT","title":"INT class-attribute instance-attribute","text":"
INT = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INTEGER","title":"INTEGER class-attribute instance-attribute","text":"
INTEGER = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.LONG","title":"LONG class-attribute instance-attribute","text":"
LONG = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.MAP","title":"MAP class-attribute instance-attribute","text":"
MAP = 'map'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.NUMERIC","title":"NUMERIC class-attribute instance-attribute","text":"
NUMERIC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.REAL","title":"REAL class-attribute instance-attribute","text":"
REAL = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SHORT","title":"SHORT class-attribute instance-attribute","text":"
SHORT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SMALLINT","title":"SMALLINT class-attribute instance-attribute","text":"
SMALLINT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.STRING","title":"STRING class-attribute instance-attribute","text":"
STRING = 'string'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP","title":"TIMESTAMP class-attribute instance-attribute","text":"
TIMESTAMP = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP_LTZ","title":"TIMESTAMP_LTZ class-attribute instance-attribute","text":"
TIMESTAMP_LTZ = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TINYINT","title":"TINYINT class-attribute instance-attribute","text":"
TINYINT = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.VOID","title":"VOID class-attribute instance-attribute","text":"
VOID = 'void'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.spark_type","title":"spark_type property","text":"
spark_type: DataType\n

Returns the spark type for the given enum value

"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.from_string","title":"from_string classmethod","text":"
from_string(value: str) -> SparkDatatype\n

Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive

Source code in src/koheesio/spark/utils.py
@classmethod\ndef from_string(cls, value: str) -> \"SparkDatatype\":\n    \"\"\"Allows for getting the right Enum value by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.get_spark_minor_version","title":"koheesio.spark.utils.get_spark_minor_version","text":"
get_spark_minor_version() -> float\n

Returns the minor version of the spark instance.

For example, if the spark version is 3.3.2, this function would return 3.3

Source code in src/koheesio/spark/utils.py
def get_spark_minor_version() -> float:\n    \"\"\"Returns the minor version of the spark instance.\n\n    For example, if the spark version is 3.3.2, this function would return 3.3\n    \"\"\"\n    return float(\".\".join(spark_version.split(\".\")[:2]))\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.on_databricks","title":"koheesio.spark.utils.on_databricks","text":"
on_databricks() -> bool\n

Retrieve if we're running on databricks or elsewhere

Source code in src/koheesio/spark/utils.py
def on_databricks() -> bool:\n    \"\"\"Retrieve if we're running on databricks or elsewhere\"\"\"\n    dbr_version = os.getenv(\"DATABRICKS_RUNTIME_VERSION\", None)\n    return dbr_version is not None and dbr_version != \"\"\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.schema_struct_to_schema_str","title":"koheesio.spark.utils.schema_struct_to_schema_str","text":"
schema_struct_to_schema_str(schema: StructType) -> str\n

Converts a StructType to a schema str

Source code in src/koheesio/spark/utils.py
def schema_struct_to_schema_str(schema: StructType) -> str:\n    \"\"\"Converts a StructType to a schema str\"\"\"\n    if not schema:\n        return \"\"\n    return \",\\n\".join([f\"{field.name} {field.dataType.typeName().upper()}\" for field in schema.fields])\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_array","title":"koheesio.spark.utils.spark_data_type_is_array","text":"
spark_data_type_is_array(data_type: DataType) -> bool\n

Check if the column's dataType is of type ArrayType

Source code in src/koheesio/spark/utils.py
def spark_data_type_is_array(data_type: DataType) -> bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, ArrayType)\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_numeric","title":"koheesio.spark.utils.spark_data_type_is_numeric","text":"
spark_data_type_is_numeric(data_type: DataType) -> bool\n

Check if the column's dataType is of type ArrayType

Source code in src/koheesio/spark/utils.py
def spark_data_type_is_numeric(data_type: DataType) -> bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, (IntegerType, LongType, FloatType, DoubleType, DecimalType))\n
"},{"location":"api_reference/spark/readers/index.html","title":"Readers","text":"

Readers are a type of Step that read data from a source based on the input parameters and stores the result in self.output.df.

For a comprehensive guide on the usage, examples, and additional features of Reader classes, please refer to the reference/concepts/steps/readers section of the Koheesio documentation.

"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader","title":"koheesio.spark.readers.Reader","text":"

Base class for all Readers

A Reader is a Step that reads data from a source based on the input parameters and stores the result in self.output.df (DataFrame).

When implementing a Reader, the execute() method should be implemented. The execute() method should read from the source and store the result in self.output.df.

The Reader class implements a standard read() method that calls the execute() method and returns the result. This method can be used to read data from a Reader without having to call the execute() method directly. Read method does not need to be implemented in the child class.

Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession.

The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the output.df is None, .execute() will be run first.

"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.df","title":"df property","text":"
df: Optional[DataFrame]\n

Shorthand for accessing self.output.df If the output.df is None, .execute() will be run first

"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.execute","title":"execute abstractmethod","text":"
execute()\n

Execute on a Reader should handle self.output.df (output) as a minimum Read from whichever source -> store result in self.output.df

Source code in src/koheesio/spark/readers/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Reader should handle self.output.df (output) as a minimum\n    Read from whichever source -> store result in self.output.df\n    \"\"\"\n    # self.output.df  # output dataframe\n    ...\n
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.read","title":"read","text":"
read() -> Optional[DataFrame]\n

Read from a Reader without having to call the execute() method directly

Source code in src/koheesio/spark/readers/__init__.py
def read(self) -> Optional[DataFrame]:\n    \"\"\"Read from a Reader without having to call the execute() method directly\"\"\"\n    self.execute()\n    return self.output.df\n
"},{"location":"api_reference/spark/readers/delta.html","title":"Delta","text":"

Read data from a Delta table and return a DataFrame or DataStream

Classes:

Name Description DeltaTableReader

Reads data from a Delta table and returns a DataFrame

DeltaTableStreamReader

Reads data from a Delta table and returns a DataStream

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS","title":"koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS module-attribute","text":"
STREAMING_ONLY_OPTIONS = ['ignore_deletes', 'ignore_changes', 'starting_version', 'starting_timestamp', 'schema_tracking_location']\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING","title":"koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING module-attribute","text":"
STREAMING_SCHEMA_WARNING = '\\nImportant!\\nAlthough you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the data with an incorrect schema.'\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader","title":"koheesio.spark.readers.delta.DeltaTableReader","text":"

Reads data from a Delta table and returns a DataFrame Delta Table can be read in batch or streaming mode It also supports reading change data feed (CDF) in both batch mode and streaming mode

Parameters:

Name Type Description Default table Union[DeltaTableStep, str]

The table to read

required filter_cond Optional[Union[Column, str]]

Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions. For example: f.col('state') == 'Ohio', state = 'Ohio' or (col('col1') > 3) & (col('col2') < 9)

required columns

Columns to select from the table. One or many columns can be provided as strings. For example: ['col1', 'col2'], ['col1'] or 'col1'

required streaming Optional[bool]

Whether to read the table as a Stream or not

required read_change_feed bool

readChangeFeed: Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html

required starting_version str

startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.

required starting_timestamp str

startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)

required ignore_deletes bool

ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes

required ignore_changes bool

ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.

required"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[ListOfColumns] = Field(default=None, description=\"Columns to select from the table. One or many columns can be provided as strings. For example: `['col1', 'col2']`, `['col1']` or `'col1'` \")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.filter_cond","title":"filter_cond class-attribute instance-attribute","text":"
filter_cond: Optional[Union[Column, str]] = Field(default=None, alias='filterCondition', description=\"Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions For example: `f.col('state') == 'Ohio'`, `state = 'Ohio'` or  `(col('col1') > 3) & (col('col2') < 9)`\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_changes","title":"ignore_changes class-attribute instance-attribute","text":"
ignore_changes: bool = Field(default=False, alias='ignoreChanges', description='ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_deletes","title":"ignore_deletes class-attribute instance-attribute","text":"
ignore_deletes: bool = Field(default=False, alias='ignoreDeletes', description='ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.read_change_feed","title":"read_change_feed class-attribute instance-attribute","text":"
read_change_feed: bool = Field(default=False, alias='readChangeFeed', description=\"Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.reader","title":"reader property","text":"
reader: Union[DataStreamReader, DataFrameReader]\n

Return the reader for the DeltaTableReader based on the streaming attribute

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.schema_tracking_location","title":"schema_tracking_location class-attribute instance-attribute","text":"
schema_tracking_location: Optional[str] = Field(default=None, alias='schemaTrackingLocation', description='schemaTrackingLocation: Track the location of source schema. Note: Recommend to enable Delta reader version: 3 and writer version: 7 for this option. For more info see https://docs.delta.io/latest/delta-column-mapping.html' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.skip_change_commits","title":"skip_change_commits class-attribute instance-attribute","text":"
skip_change_commits: bool = Field(default=False, alias='skipChangeCommits', description='skipChangeCommits: Skip processing of change commits. Note: Only supported for streaming tables. (not supported in Open Source Delta Implementation). Prefer using skipChangeCommits over ignoreDeletes and ignoreChanges starting DBR12.1 and above. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#skip-change-commits')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_timestamp","title":"starting_timestamp class-attribute instance-attribute","text":"
starting_timestamp: Optional[str] = Field(default=None, alias='startingTimestamp', description='startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_version","title":"starting_version class-attribute instance-attribute","text":"
starting_version: Optional[str] = Field(default=None, alias='startingVersion', description='startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: Optional[bool] = Field(default=False, description='Whether to read the table as a Stream or not')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.table","title":"table class-attribute instance-attribute","text":"
table: Union[DeltaTableStep, str] = Field(default=..., description='The table to read')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.temp_view_name","title":"temp_view_name property","text":"
temp_view_name\n

Get the temporary view name for the dataframe for SQL queries

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.view","title":"view property","text":"
view\n

Create a temporary view of the dataframe for SQL queries

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/delta.py
def execute(self):\n    df = self.reader.table(self.table.table_name)\n    if self.filter_cond is not None:\n        df = df.filter(f.expr(self.filter_cond) if isinstance(self.filter_cond, str) else self.filter_cond)\n    if self.columns is not None:\n        df = df.select(*self.columns)\n    self.output.df = df\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.get_options","title":"get_options","text":"
get_options() -> Dict[str, Any]\n

Get the options for the DeltaTableReader based on the streaming attribute

Source code in src/koheesio/spark/readers/delta.py
def get_options(self) -> Dict[str, Any]:\n    \"\"\"Get the options for the DeltaTableReader based on the `streaming` attribute\"\"\"\n    options = {\n        # Enable Change Data Feed (CDF) feature\n        \"readChangeFeed\": self.read_change_feed,\n        # Initial position, one of:\n        \"startingVersion\": self.starting_version,\n        \"startingTimestamp\": self.starting_timestamp,\n    }\n\n    # Streaming only options\n    if self.streaming:\n        options = {\n            **options,\n            # Ignore updates and deletes, one of:\n            \"ignoreDeletes\": self.ignore_deletes,\n            \"ignoreChanges\": self.ignore_changes,\n            \"skipChangeCommits\": self.skip_change_commits,\n            \"schemaTrackingLocation\": self.schema_tracking_location,\n        }\n    # Batch only options\n    else:\n        pass  # there are none... for now :)\n\n    def normalize(v: Union[str, bool]):\n        \"\"\"normalize values\"\"\"\n        # True becomes \"true\", False becomes \"false\"\n        v = str(v).lower() if isinstance(v, bool) else v\n        return v\n\n    # Any options with `value == None` are filtered out\n    return {k: normalize(v) for k, v in options.items() if v is not None}\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.set_temp_view_name","title":"set_temp_view_name","text":"
set_temp_view_name()\n

Set a temporary view name for the dataframe for SQL queries

Source code in src/koheesio/spark/readers/delta.py
@model_validator(mode=\"after\")\ndef set_temp_view_name(self):\n    \"\"\"Set a temporary view name for the dataframe for SQL queries\"\"\"\n    table_name = self.table.table\n    vw_name = get_random_string(prefix=f\"tmp_{table_name}\")\n    self.__temp_view_name__ = vw_name\n    return self\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader","title":"koheesio.spark.readers.delta.DeltaTableStreamReader","text":"

Reads data from a Delta table and returns a DataStream

"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: bool = True\n
"},{"location":"api_reference/spark/readers/dummy.html","title":"Dummy","text":"

A simple DummyReader that returns a DataFrame with an id-column of the given range

"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader","title":"koheesio.spark.readers.dummy.DummyReader","text":"

A simple DummyReader that returns a DataFrame with an id-column of the given range

Can be used in place of any Reader without having to read from a real source.

Wraps SparkSession.range(). Output DataFrame will have a single column named \"id\" of type Long and length of the given range.

Parameters:

Name Type Description Default range int

How large to make the Dataframe

required Example
from koheesio.spark.readers.dummy import DummyReader\n\noutput_df = DummyReader(range=100).read()\n

output_df: Output DataFrame will have a single column named \"id\" of type Long containing 100 rows (0-99).

id 0 1 ... 99"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.range","title":"range class-attribute instance-attribute","text":"
range: int = Field(default=100, description='How large to make the Dataframe')\n
"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/dummy.py
def execute(self):\n    self.output.df = self.spark.range(self.range)\n
"},{"location":"api_reference/spark/readers/file_loader.html","title":"File loader","text":"

Generic file Readers for different file formats.

Supported file formats: - CSV - Parquet - Avro - JSON - ORC - Text

Examples:

from koheesio.spark.readers import (\n    CsvReader,\n    ParquetReader,\n    AvroReader,\n    JsonReader,\n    OrcReader,\n)\n\ncsv_reader = CsvReader(path=\"path/to/file.csv\", header=True)\nparquet_reader = ParquetReader(path=\"path/to/file.parquet\")\navro_reader = AvroReader(path=\"path/to/file.avro\")\njson_reader = JsonReader(path=\"path/to/file.json\")\norc_reader = OrcReader(path=\"path/to/file.orc\")\n

For more information about the available options, see Spark's official documentation.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader","title":"koheesio.spark.readers.file_loader.AvroReader","text":"

Reads an Avro file.

This class is a convenience class that sets the format field to FileFormat.avro.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = AvroReader(path=\"path/to/file.avro\", mergeSchema=True)\n

Make sure to have the spark-avro package installed in your environment.

For more information about the available options, see the official documentation.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = avro\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader","title":"koheesio.spark.readers.file_loader.CsvReader","text":"

Reads a CSV file.

This class is a convenience class that sets the format field to FileFormat.csv.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = CsvReader(path=\"path/to/file.csv\", header=True)\n

For more information about the available options, see the official pyspark documentation and read about CSV data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = csv\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat","title":"koheesio.spark.readers.file_loader.FileFormat","text":"

Supported file formats.

This enum represents the supported file formats that can be used with the FileLoader class. The available file formats are: - csv: Comma-separated values format - parquet: Apache Parquet format - avro: Apache Avro format - json: JavaScript Object Notation format - orc: Apache ORC format - text: Plain text format

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.avro","title":"avro class-attribute instance-attribute","text":"
avro = 'avro'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.csv","title":"csv class-attribute instance-attribute","text":"
csv = 'csv'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.json","title":"json class-attribute instance-attribute","text":"
json = 'json'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.orc","title":"orc class-attribute instance-attribute","text":"
orc = 'orc'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.parquet","title":"parquet class-attribute instance-attribute","text":"
parquet = 'parquet'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.text","title":"text class-attribute instance-attribute","text":"
text = 'text'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader","title":"koheesio.spark.readers.file_loader.FileLoader","text":"

Generic file reader.

Available file formats:\n- CSV\n- Parquet\n- Avro\n- JSON\n- ORC\n- Text (default)\n\nExtra parameters can be passed to the reader using the `extra_params` attribute or as keyword arguments.\n\nExample:\n```python\nreader = FileLoader(path=\"path/to/textfile.txt\", format=\"text\", header=True, lineSep=\"\n

\") ```

For more information about the available options, see Spark's\n[official pyspark documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.text.html)\nand [read about text data source](https://spark.apache.org/docs/latest/sql-data-sources-text.html).\n\nAlso see the [data sources generic options](https://spark.apache.org/docs/3.5.0/sql-data-sources-generic-options.html).\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = Field(default=text, description='File format to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.path","title":"path class-attribute instance-attribute","text":"
path: Union[Path, str] = Field(default=..., description='Path to the file to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.schema_","title":"schema_ class-attribute instance-attribute","text":"
schema_: Optional[Union[StructType, str]] = Field(default=None, description='Schema to use when reading the file', validate_default=False, alias='schema')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.ensure_path_is_str","title":"ensure_path_is_str","text":"
ensure_path_is_str(v)\n

Ensure that the path is a string as required by Spark.

Source code in src/koheesio/spark/readers/file_loader.py
@field_validator(\"path\")\ndef ensure_path_is_str(cls, v):\n    \"\"\"Ensure that the path is a string as required by Spark.\"\"\"\n    if isinstance(v, Path):\n        return str(v.absolute().as_posix())\n    return v\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.execute","title":"execute","text":"
execute()\n

Reads the file using the specified format, schema, while applying any extra parameters.

Source code in src/koheesio/spark/readers/file_loader.py
def execute(self):\n    \"\"\"Reads the file using the specified format, schema, while applying any extra parameters.\"\"\"\n    reader = self.spark.read.format(self.format)\n\n    if self.schema_:\n        reader.schema(self.schema_)\n\n    if self.extra_params:\n        reader = reader.options(**self.extra_params)\n\n    self.output.df = reader.load(self.path)\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader","title":"koheesio.spark.readers.file_loader.JsonReader","text":"

Reads a JSON file.

This class is a convenience class that sets the format field to FileFormat.json.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = JsonReader(path=\"path/to/file.json\", allowComments=True)\n

For more information about the available options, see the official pyspark documentation and read about JSON data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = json\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader","title":"koheesio.spark.readers.file_loader.OrcReader","text":"

Reads an ORC file.

This class is a convenience class that sets the format field to FileFormat.orc.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = OrcReader(path=\"path/to/file.orc\", mergeSchema=True)\n

For more information about the available options, see the official documentation and read about ORC data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = orc\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader","title":"koheesio.spark.readers.file_loader.ParquetReader","text":"

Reads a Parquet file.

This class is a convenience class that sets the format field to FileFormat.parquet.

Extra parameters can be passed to the reader using the extra_params attribute or as keyword arguments.

Example:

reader = ParquetReader(path=\"path/to/file.parquet\", mergeSchema=True)\n

For more information about the available options, see the official pyspark documentation and read about Parquet data source.

Also see the data sources generic options.

"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader.format","title":"format class-attribute instance-attribute","text":"
format: FileFormat = parquet\n
"},{"location":"api_reference/spark/readers/hana.html","title":"Hana","text":"

HANA reader.

"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader","title":"koheesio.spark.readers.hana.HanaReader","text":"

Wrapper around JdbcReader for SAP HANA

Notes
  • Refer to JdbcReader for the list of all available parameters.
  • Refer to SAP HANA Client Interface Programming Reference docs for the list of all available connection string parameters: https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/109397c2206a4ab2a5386d494f4cf75e.html
Example

Note: jars should be added to the Spark session manually. This class does not take care of that.

This example depends on the SAP HANA ngdbc JAR. e.g. ngdbc-2.5.49.

from koheesio.spark.readers.hana import HanaReader\njdbc_hana = HanaReader(\n    url=\"jdbc:sap://<domain_or_ip>:<port>/?<options>\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\"\n)\ndf = jdbc_hana.read()\n

Parameters:

Name Type Description Default url str

JDBC connection string. Refer to SAP HANA docs for the list of all available connection string parameters. Example: jdbc:sap://:[/?] required user str required password SecretStr required dbtable str

Database table name, also include schema name

required options Optional[Dict[str, Any]]

Extra options to pass to the SAP HANA JDBC driver. Refer to SAP HANA docs for the list of all available connection string parameters. Example: {\"fetchsize\": 2000, \"numPartitions\": 10}

required query Optional[str]

Query

required format str

The type of format to load. Defaults to 'jdbc'. Should not be changed.

required driver str

Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.

required"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: str = Field(default='com.sap.db.jdbc.Driver', description='Make sure that the necessary JARs are available in the cluster: ngdbc-2-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field(default={'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the SAP HANA JDBC driver')\n
"},{"location":"api_reference/spark/readers/jdbc.html","title":"Jdbc","text":"

Module for reading data from JDBC sources.

Classes:

Name Description JdbcReader

Reader for JDBC tables.

"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader","title":"koheesio.spark.readers.jdbc.JdbcReader","text":"

Reader for JDBC tables.

Wrapper around Spark's jdbc read format

Notes
  • Query has precedence over dbtable. If query and dbtable both are filled in, dbtable will be ignored!
  • Extra options to the spark reader can be passed through the options input. Refer to Spark documentation for details: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
  • Consider using fetchsize as one of the options, as it is greatly increases the performance of the reader
  • Consider using numPartitions, partitionColumn, lowerBound, upperBound together with real or synthetic partitioning column as it will improve the reader performance

When implementing a JDBC reader, the get_options() method should be implemented. The method should return a dict of options required for the specific JDBC driver. The get_options() method can be overridden in the child class. Additionally, the driver parameter should be set to the name of the JDBC driver. Be aware that the driver jar needs to be included in the Spark session; this class does not (and can not) take care of that!

Example

Note: jars should be added to the Spark session manually. This class does not take care of that.

This example depends on the jar for MS SQL: https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre8/mssql-jdbc-9.2.1.jre8.jar

from koheesio.spark.readers.jdbc import JdbcReader\n\njdbc_mssql = JdbcReader(\n    driver=\"com.microsoft.sqlserver.jdbc.SQLServerDriver\",\n    url=\"jdbc:sqlserver://10.xxx.xxx.xxx:1433;databaseName=YOUR_DATABASE\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n    options={\"fetchsize\": 100},\n)\ndf = jdbc_mssql.read()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.dbtable","title":"dbtable class-attribute instance-attribute","text":"
dbtable: Optional[str] = Field(default=None, description='Database table name, also include schema name')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: str = Field(default=..., description='Driver name. Be aware that the driver jar needs to be passed to the task')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='jdbc', description=\"The type of format to load. Defaults to 'jdbc'.\")\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field(default_factory=dict, description='Extra options to pass to spark reader')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.password","title":"password class-attribute instance-attribute","text":"
password: SecretStr = Field(default=..., description='Password belonging to the username')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.query","title":"query class-attribute instance-attribute","text":"
query: Optional[str] = Field(default=None, description='Query')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., description='URL for the JDBC driver. Note, in some environments you need to use the IP Address instead of the hostname of the server.')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.user","title":"user class-attribute instance-attribute","text":"
user: str = Field(default=..., description='User to authenticate to the server')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.execute","title":"execute","text":"
execute()\n

Wrapper around Spark's jdbc read format

Source code in src/koheesio/spark/readers/jdbc.py
def execute(self):\n    \"\"\"Wrapper around Spark's jdbc read format\"\"\"\n\n    # Can't have both dbtable and query empty\n    if not self.dbtable and not self.query:\n        raise ValueError(\"Please do not leave dbtable and query both empty!\")\n\n    if self.query and self.dbtable:\n        self.log.info(\"Both 'query' and 'dbtable' are filled in, 'dbtable' will be ignored!\")\n\n    options = self.get_options()\n\n    if pw := self.password:\n        options[\"password\"] = pw.get_secret_value()\n\n    if query := self.query:\n        options[\"query\"] = query\n        self.log.info(f\"Executing query: {self.query}\")\n    else:\n        options[\"dbtable\"] = self.dbtable\n\n    self.output.df = self.spark.read.format(self.format).options(**options).load()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.get_options","title":"get_options","text":"
get_options()\n

Dictionary of options required for the specific JDBC driver.

Note: override this method if driver requires custom names, e.g. Snowflake: sfUrl, sfUser, etc.

Source code in src/koheesio/spark/readers/jdbc.py
def get_options(self):\n    \"\"\"\n    Dictionary of options required for the specific JDBC driver.\n\n    Note: override this method if driver requires custom names, e.g. Snowflake: `sfUrl`, `sfUser`, etc.\n    \"\"\"\n    return {\n        \"driver\": self.driver,\n        \"url\": self.url,\n        \"user\": self.user,\n        \"password\": self.password,\n        **self.options,\n    }\n
"},{"location":"api_reference/spark/readers/kafka.html","title":"Kafka","text":"

Module for KafkaReader and KafkaStreamReader.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader","title":"koheesio.spark.readers.kafka.KafkaReader","text":"

Reader for Kafka topics.

Wrapper around Spark's kafka read format. Supports both batch and streaming reads.

Parameters:

Name Type Description Default read_broker str

Kafka brokers to read from. Should be passed as a single string with multiple brokers passed in a comma separated list

required topic str

Kafka topic to consume.

required streaming Optional[bool]

Whether to read the kafka topic as a stream or not.

required params Optional[Dict[str, str]]

Arbitrary options to be applied when creating NSP Reader. If a user provides values for subscribe or kafka.bootstrap.servers, they will be ignored in favor of configuration passed through topic and read_broker respectively. Defaults to an empty dictionary.

required Notes
  • The read_broker and topic parameters are required.
  • The streaming parameter defaults to False.
  • The params parameter defaults to an empty dictionary. This parameter is also aliased as kafka_options.
  • Any extra kafka options can also be passed as key-word arguments; these will be merged with the params parameter
Example
from koheesio.spark.readers.kafka import KafkaReader\n\nkafka_reader = KafkaReader(\n    read_broker=\"kafka-broker-1:9092,kafka-broker-2:9092\",\n    topic=\"my-topic\",\n    streaming=True,\n    # extra kafka options can be passed as key-word arguments\n    startingOffsets=\"earliest\",\n)\n

In the example above, the KafkaReader will read from the my-topic Kafka topic, using the brokers kafka-broker-1:9092 and kafka-broker-2:9092. The reader will read the topic as a stream and will start reading from the earliest available offset.

The stream can be started by calling the read or execute method on the kafka_reader object.

Note: The KafkaStreamReader could be used in the example above to achieve the same result. streaming would default to True in that case and could be omitted from the parameters.

See Also
  • Official Spark Documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.batch_reader","title":"batch_reader property","text":"
batch_reader\n

Returns the Spark read object for batch processing.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.logged_option_keys","title":"logged_option_keys property","text":"
logged_option_keys\n

Keys that are allowed to be logged for the options.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.options","title":"options property","text":"
options\n

Merge fixed parameters with arbitrary options provided by user.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, str]] = Field(default_factory=dict, alias='kafka_options', description=\"Arbitrary options to be applied when creating NSP Reader. If a user provides values for 'subscribe' or 'kafka.bootstrap.servers', they will be ignored in favor of configuration passed through 'topic' and 'read_broker' respectively.\")\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.read_broker","title":"read_broker class-attribute instance-attribute","text":"
read_broker: str = Field(..., description='Kafka brokers to read from, should be passed as a single string with multiple brokers passed in a comma separated list')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.reader","title":"reader property","text":"
reader\n

Returns the appropriate reader based on the streaming flag.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.stream_reader","title":"stream_reader property","text":"
stream_reader\n

Returns the Spark readStream object.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: Optional[bool] = Field(default=False, description='Whether to read the kafka topic as a stream or not. Defaults to False.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.topic","title":"topic class-attribute instance-attribute","text":"
topic: str = Field(default=..., description='Kafka topic to consume.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/kafka.py
def execute(self):\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self.output.df = self.reader.format(\"kafka\").options(**self.options).load()\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader","title":"koheesio.spark.readers.kafka.KafkaStreamReader","text":"

KafkaStreamReader is a KafkaReader that reads data as a stream

This class is identical to KafkaReader, with the streaming parameter defaulting to True.

"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader.streaming","title":"streaming class-attribute instance-attribute","text":"
streaming: bool = True\n
"},{"location":"api_reference/spark/readers/memory.html","title":"Memory","text":"

Create Spark DataFrame directly from the data stored in a Python variable

"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat","title":"koheesio.spark.readers.memory.DataFormat","text":"

Data formats supported by the InMemoryDataReader

"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.CSV","title":"CSV class-attribute instance-attribute","text":"
CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'json'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader","title":"koheesio.spark.readers.memory.InMemoryDataReader","text":"

Directly read data from a Python variable and convert it to a Spark DataFrame.

Read the data, that is stored in one of the supported formats (see DataFormat) directly from the variable and convert it to the Spark DataFrame. The use cases include converting JSON output of the API into the dataframe; reading the CSV data via the API (e.g. Box API).

The advantage of using this reader is that it allows to read the data directly from the Python variable, without the need to store it on the disk. This can be useful when the data is small and does not need to be stored permanently.

Parameters:

Name Type Description Default data Union[str, list, dict, bytes]

Source data

required format DataFormat

File / data format

required schema_ Optional[StructType]

Schema that will be applied during the creation of Spark DataFrame

None params Optional[Dict[str, Any]]

Set of extra parameters that should be passed to the appropriate reader (csv / json). Optionally, the user can pass the parameters that are specific to the reader (e.g. multiLine for JSON reader) as key-word arguments. These will be merged with the params parameter.

dict Example
# Read CSV data from a string\ndf1 = InMemoryDataReader(format=DataFormat.CSV, data='foo,bar\\nA,1\\nB,2')\n\n# Read JSON data from a string\ndf2 = InMemoryDataReader(format=DataFormat.JSON, data='{\"foo\": A, \"bar\": 1}'\ndf3 = InMemoryDataReader(format=DataFormat.JSON, data=['{\"foo\": \"A\", \"bar\": 1}', '{\"foo\": \"B\", \"bar\": 2}']\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.data","title":"data class-attribute instance-attribute","text":"
data: Union[str, list, dict, bytes] = Field(default=..., description='Source data')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.format","title":"format class-attribute instance-attribute","text":"
format: DataFormat = Field(default=..., description='File / data format')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the appropriate reader (csv / json)')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.schema_","title":"schema_ class-attribute instance-attribute","text":"
schema_: Optional[StructType] = Field(default=None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.execute","title":"execute","text":"
execute()\n

Execute method appropriate to the specific data format

Source code in src/koheesio/spark/readers/memory.py
def execute(self):\n    \"\"\"\n    Execute method appropriate to the specific data format\n    \"\"\"\n    _func = getattr(InMemoryDataReader, f\"_{self.format}\")\n    _df = partial(_func, self, self._rdd)()\n    self.output.df = _df\n
"},{"location":"api_reference/spark/readers/metastore.html","title":"Metastore","text":"

Create Spark DataFrame from table in Metastore

"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader","title":"koheesio.spark.readers.metastore.MetastoreReader","text":"

Reader for tables/views from Spark Metastore

Parameters:

Name Type Description Default table str

Table name in spark metastore

required"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.table","title":"table class-attribute instance-attribute","text":"
table: str = Field(default=..., description='Table name in spark metastore')\n
"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/metastore.py
def execute(self):\n    self.output.df = self.spark.table(self.table)\n
"},{"location":"api_reference/spark/readers/rest_api.html","title":"Rest api","text":"

This module provides the RestApiReader class for interacting with RESTful APIs.

The RestApiReader class is designed to fetch data from RESTful APIs and store the response in a DataFrame. It supports different transports, e.g. Paginated Http or Async HTTP. The main entry point is the execute method, which performs transport.execute() call and provide data from the API calls.

For more details on how to use this class and its methods, refer to the class docstring.

"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader","title":"koheesio.spark.readers.rest_api.RestApiReader","text":"

A reader class that executes an API call and stores the response in a DataFrame.

Parameters:

Name Type Description Default transport Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]

The HTTP transport step.

required spark_schema Union[str, StructType, List[str], Tuple[str, ...], AtomicType]

The pyspark schema of the response.

required

Attributes:

Name Type Description transport Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]

The HTTP transport step.

spark_schema Union[str, StructType, List[str], Tuple[str, ...], AtomicType]

The pyspark schema of the response.

Returns:

Type Description Output

The output of the reader, which includes the DataFrame.

Examples:

Here are some examples of how to use this class:

Example 1: Paginated Transport

import requests\nfrom urllib3 import Retry\n\nfrom koheesio.steps.http import HttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = requests.Session()\nretry_logic = Retry(total=max_retries, status_forcelist=[503])\nsession.mount(\"https://\", HTTPAdapter(max_retries=retry_logic))\nsession.mount(\"http://\", HTTPAdapter(max_retries=retry_logic))\n\ntransport = PaginatedHtppGetStep(\n    url=\"https://api.example.com/data?page={page}\",\n    paginate=True,\n    pages=3,\n    session=session,\n)\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n

Example 2: Async Transport

from aiohttp import ClientSession, TCPConnector\nfrom aiohttp_retry import ExponentialRetry\nfrom yarl import URL\n\nfrom koheesio.steps.asyncio.http import AsyncHttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = ClientSession()\nurls = [URL(\"http://httpbin.org/get\"), URL(\"http://httpbin.org/get\")]\nretry_options = ExponentialRetry()\nconnector = TCPConnector(limit=10)\ntransport = AsyncHttpGetStep(\n    client_session=session,\n    url=urls,\n    retry_options=retry_options,\n    connector=connector,\n)\n\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n

"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.spark_schema","title":"spark_schema class-attribute instance-attribute","text":"
spark_schema: Union[str, StructType, List[str], Tuple[str, ...], AtomicType] = Field(..., description='The pyspark schema of the response')\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.transport","title":"transport class-attribute instance-attribute","text":"
transport: Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]] = Field(..., description='HTTP transport step', exclude=True)\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.execute","title":"execute","text":"
execute() -> Output\n

Executes the API call and stores the response in a DataFrame.

Returns:

Type Description Output

The output of the reader, which includes the DataFrame.

Source code in src/koheesio/spark/readers/rest_api.py
def execute(self) -> Reader.Output:\n    \"\"\"\n    Executes the API call and stores the response in a DataFrame.\n\n    Returns\n    -------\n    Reader.Output\n        The output of the reader, which includes the DataFrame.\n    \"\"\"\n    raw_data = self.transport.execute()\n\n    if isinstance(raw_data, HttpGetStep.Output):\n        data = raw_data.response_json\n    elif isinstance(raw_data, AsyncHttpGetStep.Output):\n        data = [d for d, _ in raw_data.responses_urls]  # type: ignore\n\n    if data:\n        self.output.df = self.spark.createDataFrame(data=data, schema=self.spark_schema)  # type: ignore\n
"},{"location":"api_reference/spark/readers/snowflake.html","title":"Snowflake","text":"

Module containing Snowflake reader classes.

This module contains classes for reading data from Snowflake. The classes are used to create a Spark DataFrame from a Snowflake table or a query.

Classes:

Name Description SnowflakeReader

Reader for Snowflake tables.

Query

Reader for Snowflake queries.

DbTableQuery

Reader for Snowflake queries that return a single row.

Notes

The classes are defined in the koheesio.steps.integrations.snowflake module; this module simply inherits from the classes defined there.

See Also
  • koheesio.spark.readers.Reader Base class for all Readers.
  • koheesio.steps.integrations.snowflake Module containing Snowflake classes.

More detailed class descriptions can be found in the class docstrings.

"},{"location":"api_reference/spark/readers/spark_sql_reader.html","title":"Spark sql reader","text":"

This module contains the SparkSqlReader class which reads the SparkSQL compliant query and returns the dataframe.

"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader","title":"koheesio.spark.readers.spark_sql_reader.SparkSqlReader","text":"

SparkSqlReader reads the SparkSQL compliant query and returns the dataframe.

This SQL can originate from a string or a file and may contain placeholder (parameters) for templating. - Placeholders are identified with ${placeholder}. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).

Example

SQL script (example.sql):

SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n

Python code:

from koheesio.spark.readers import SparkSqlReader\n\nreader = SparkSqlReader(\n    sql_path=\"example.sql\",\n    # params can also be passed as kwargs\n    dynamic_column\"=\"name\",\n    \"table_name\"=\"my_table\"\n)\nreader.execute()\n

In this example, the SQL script is read from a file and the placeholders are replaced with the given params. The resulting SQL query is:

SELECT id, id + 1 AS incremented_id, name AS extra_column\nFROM my_table\n

The query is then executed and the resulting DataFrame is stored in the output.df attribute.

Parameters:

Name Type Description Default sql_path str or Path

Path to a SQL file

required sql str

SQL query to execute

required params dict

Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.

required Notes

Any arbitrary kwargs passed to the class will be added to params.

"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/readers/spark_sql_reader.py
def execute(self):\n    self.output.df = self.spark.sql(self.query)\n
"},{"location":"api_reference/spark/readers/teradata.html","title":"Teradata","text":"

Teradata reader.

"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader","title":"koheesio.spark.readers.teradata.TeradataReader","text":"

Wrapper around JdbcReader for Teradata.

Notes
  • Consider using synthetic partitioning column when using partitioned read: MOD(HASHBUCKET(HASHROW(<TABLE>.<COLUMN>)), <NUM_PARTITIONS>)
  • Relevant jars should be added to the Spark session manually. This class does not take care of that.
See Also
  • Refer to JdbcReader for the list of all available parameters.
  • Refer to Teradata docs for the list of all available connection string parameters: https://teradata-docs.s3.amazonaws.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_2.html#BABJIHBJ
Example

This example depends on the Teradata terajdbc4 JAR. e.g. terajdbc4-17.20.00.15. Keep in mind that older versions of terajdbc4 drivers also require tdgssconfig JAR.

from koheesio.spark.readers.teradata import TeradataReader\n\ntd = TeradataReader(\n    url=\"jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n)\n

Parameters:

Name Type Description Default url str

JDBC connection string. Refer to Teradata docs for the list of all available connection string parameters. Example: jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on

required user str

Username

required password SecretStr

Password

required dbtable str

Database table name, also include schema name

required options Optional[Dict[str, Any]]

Extra options to pass to the Teradata JDBC driver. Refer to Teradata docs for the list of all available connection string parameters.

{\"fetchsize\": 2000, \"numPartitions\": 10} query Optional[str]

Query

None format str

The type of format to load. Defaults to 'jdbc'. Should not be changed.

required driver str

Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.

required"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.driver","title":"driver class-attribute instance-attribute","text":"
driver: str = Field('com.teradata.jdbc.TeraDriver', description='Make sure that the necessary JARs are available in the cluster: terajdbc4-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, Any]] = Field({'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the Teradata JDBC driver')\n
"},{"location":"api_reference/spark/readers/databricks/index.html","title":"Databricks","text":""},{"location":"api_reference/spark/readers/databricks/autoloader.html","title":"Autoloader","text":"

Read from a location using Databricks' autoloader

Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader","title":"koheesio.spark.readers.databricks.autoloader.AutoLoader","text":"

Read from a location using Databricks' autoloader

Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

Notes

autoloader is a Spark Structured Streaming function!

Although most transformations are compatible with Spark Structured Streaming, not all of them are. As a result, be mindful with your downstream transformations.

Parameters:

Name Type Description Default format Union[str, AutoLoaderFormat]

The file format, used in cloudFiles.format. Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

required location str

The location where the files are located, used in cloudFiles.location

required schema_location str

The location for storing inferred schema and supporting schema evolution, used in cloudFiles.schemaLocation.

required options Optional[Dict[str, str]]

Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html

{} Example
from koheesio.spark.readers.databricks import AutoLoader, AutoLoaderFormat\n\nresult_df = AutoLoader(\n    format=AutoLoaderFormat.JSON,\n    location=\"some_s3_path\",\n    schema_location=\"other_s3_path\",\n    options={\"multiLine\": \"true\"},\n).read()\n
See Also

Some other useful documentation:

  • autoloader: https://docs.databricks.com/ingestion/auto-loader/index.html
  • Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.format","title":"format class-attribute instance-attribute","text":"
format: Union[str, AutoLoaderFormat] = Field(default=..., description=__doc__)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.location","title":"location class-attribute instance-attribute","text":"
location: str = Field(default=..., description='The location where the files are located, used in `cloudFiles.location`')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.options","title":"options class-attribute instance-attribute","text":"
options: Optional[Dict[str, str]] = Field(default_factory=dict, description='Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.schema_location","title":"schema_location class-attribute instance-attribute","text":"
schema_location: str = Field(default=..., alias='schemaLocation', description='The location for storing inferred schema and supporting schema evolution, used in `cloudFiles.schemaLocation`.')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.execute","title":"execute","text":"
execute()\n

Reads from the given location with the given options using Autoloader

Source code in src/koheesio/spark/readers/databricks/autoloader.py
def execute(self):\n    \"\"\"Reads from the given location with the given options using Autoloader\"\"\"\n    self.output.df = self.reader().load(self.location)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.get_options","title":"get_options","text":"
get_options()\n

Get the options for the autoloader

Source code in src/koheesio/spark/readers/databricks/autoloader.py
def get_options(self):\n    \"\"\"Get the options for the autoloader\"\"\"\n    self.options.update(\n        {\n            \"cloudFiles.format\": self.format,\n            \"cloudFiles.schemaLocation\": self.schema_location,\n        }\n    )\n    return self.options\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.reader","title":"reader","text":"
reader()\n

Return the reader for the autoloader

Source code in src/koheesio/spark/readers/databricks/autoloader.py
def reader(self):\n    \"\"\"Return the reader for the autoloader\"\"\"\n    return self.spark.readStream.format(\"cloudFiles\").options(**self.get_options())\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.validate_format","title":"validate_format","text":"
validate_format(format_specified)\n

Validate format value

Source code in src/koheesio/spark/readers/databricks/autoloader.py
@field_validator(\"format\")\ndef validate_format(cls, format_specified):\n    \"\"\"Validate `format` value\"\"\"\n    if isinstance(format_specified, str):\n        if format_specified.upper() in [f.value.upper() for f in AutoLoaderFormat]:\n            format_specified = getattr(AutoLoaderFormat, format_specified.upper())\n    return str(format_specified.value)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","title":"koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","text":"

The file format, used in cloudFiles.format Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.

"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.AVRO","title":"AVRO class-attribute instance-attribute","text":"
AVRO = 'avro'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.BINARYFILE","title":"BINARYFILE class-attribute instance-attribute","text":"
BINARYFILE = 'binaryfile'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.CSV","title":"CSV class-attribute instance-attribute","text":"
CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.JSON","title":"JSON class-attribute instance-attribute","text":"
JSON = 'json'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.ORC","title":"ORC class-attribute instance-attribute","text":"
ORC = 'orc'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.PARQUET","title":"PARQUET class-attribute instance-attribute","text":"
PARQUET = 'parquet'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.TEXT","title":"TEXT class-attribute instance-attribute","text":"
TEXT = 'text'\n
"},{"location":"api_reference/spark/transformations/index.html","title":"Transformations","text":"

This module contains the base classes for all transformations.

See class docstrings for more information.

References

For a comprehensive guide on the usage, examples, and additional features of Transformation classes, please refer to the reference/concepts/steps/transformations section of the Koheesio documentation.

Classes:

Name Description Transformation

Base class for all transformations

ColumnsTransformation

Extended Transformation class with a preset validator for handling column(s) data

ColumnsTransformationWithTarget

Extended ColumnsTransformation class with an additional target_column field

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation","title":"koheesio.spark.transformations.ColumnsTransformation","text":"

Extended Transformation class with a preset validator for handling column(s) data with a standardized input for a single column or multiple columns.

Concept

A ColumnsTransformation is a Transformation with a standardized input for column or columns. The columns are stored as a list. Either a single string, or a list of strings can be passed to enter the columns. column and columns are aliases to one another - internally the name columns should be used though.

  • columns are stored as a list
  • either a single string, or a list of strings can be passed to enter the columns
  • column and columns are aliases to one another - internally the name columns should be used though.

If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns

Configuring the ColumnsTransformation

The ColumnsTransformation class has a ColumnConfig class that can be used to configure the behavior of the class. This class has the following fields: - run_for_all_data_type allows to run the transformation for all columns of a given type.

  • limit_data_type allows to limit the transformation to a specific data type.

  • data_type_strict_mode Toggles strict mode for data type validation. Will only work if limit_data_type is set.

Note that Data types need to be specified as a SparkDatatype enum.

See the docstrings of the ColumnConfig class for more information. See the SparkDatatype enum for a list of available data types.

Users should not have to interact with the ColumnConfig class directly.

Parameters:

Name Type Description Default columns

The column (or list of columns) to apply the transformation to. Alias: column

required Example
from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.columns","title":"columns class-attribute instance-attribute","text":"
columns: ListOfColumns = Field(default='', alias='column', description='The column (or list of columns) to apply the transformation to. Alias: column')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.data_type_strict_mode_is_set","title":"data_type_strict_mode_is_set property","text":"
data_type_strict_mode_is_set: bool\n

Returns True if data_type_strict_mode is set

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.limit_data_type_is_set","title":"limit_data_type_is_set property","text":"
limit_data_type_is_set: bool\n

Returns True if limit_data_type is set

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.run_for_all_is_set","title":"run_for_all_is_set property","text":"
run_for_all_is_set: bool\n

Returns True if the transformation should be run for all columns of a given type

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig","title":"ColumnConfig","text":"

Koheesio ColumnsTransformation specific Config

Parameters:

Name Type Description Default run_for_all_data_type

allows to run the transformation for all columns of a given type. A user can trigger this behavior by either omitting the columns parameter or by passing a single * as a column name. In both cases, the run_for_all_data_type will be used to determine the data type. Value should be be passed as a SparkDatatype enum. (default: [None])

required limit_data_type

allows to limit the transformation to a specific data type. Value should be passed as a SparkDatatype enum. (default: [None])

required data_type_strict_mode

Toggles strict mode for data type validation. Will only work if limit_data_type is set. - when True, a ValueError will be raised if any column does not adhere to the limit_data_type - when False, a warning will be thrown and the column will be skipped instead (default: False)

required"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute instance-attribute","text":"
data_type_strict_mode: bool = False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.column_type_of_col","title":"column_type_of_col","text":"
column_type_of_col(col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True) -> Union[DataType, str]\n

Returns the dataType of a Column object as a string.

The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type based on the column name. We retrieve the name of the column from the Column object by calling toString() from the JVM.

Examples:

input_df: | str_column | int_column | |------------|------------| | hello | 1 | | world | 2 |

# using the AddOne transformation from the example above\nadd_one = AddOne(\n    columns=[\"str_column\", \"int_column\"],\n    df=input_df,\n)\nadd_one.column_type_of_col(\"str_column\")  # returns \"string\"\nadd_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n# returns IntegerType\nadd_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n

Parameters:

Name Type Description Default col Union[str, Column]

The column to check the type of

required df Optional[DataFrame]

The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor will be used.

None simple_return_mode bool

If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.

True

Returns:

Name Type Description datatype str

The type of the column as a string

Source code in src/koheesio/spark/transformations/__init__.py
def column_type_of_col(\n    self, col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True\n) -> Union[DataType, str]:\n    \"\"\"\n    Returns the dataType of a Column object as a string.\n\n    The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type\n    based on the column name. We retrieve the name of the column from the Column object by calling toString() from\n    the JVM.\n\n    Examples\n    --------\n    __input_df:__\n    | str_column | int_column |\n    |------------|------------|\n    | hello      | 1          |\n    | world      | 2          |\n\n    ```python\n    # using the AddOne transformation from the example above\n    add_one = AddOne(\n        columns=[\"str_column\", \"int_column\"],\n        df=input_df,\n    )\n    add_one.column_type_of_col(\"str_column\")  # returns \"string\"\n    add_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n    # returns IntegerType\n    add_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n    ```\n\n    Parameters\n    ----------\n    col: Union[str, Column]\n        The column to check the type of\n\n    df: Optional[DataFrame]\n        The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    simple_return_mode: bool\n        If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.\n\n    Returns\n    -------\n    datatype: str\n        The type of the column as a string\n    \"\"\"\n    df = df or self.df\n    if not df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n\n    if not isinstance(col, Column):\n        col = f.col(col)\n\n    # ask the JVM for the name of the column\n    # noinspection PyProtectedMember\n    col_name = col._jc.toString()\n\n    # In order to check the datatype of the column, we have to ask the DataFrame its schema\n    df_col = [c for c in df.schema if c.name == col_name][0]\n\n    if simple_return_mode:\n        return SparkDatatype(df_col.dataType.typeName()).value\n\n    return df_col.dataType\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_all_columns_of_specific_type","title":"get_all_columns_of_specific_type","text":"
get_all_columns_of_specific_type(data_type: Union[str, SparkDatatype]) -> List[str]\n

Get all columns from the dataframe of a given type

A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will be raised.

Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you have to call this method multiple times.

Parameters:

Name Type Description Default data_type Union[str, SparkDatatype]

The data type to get the columns for

required

Returns:

Type Description List[str]

A list of column names of the given data type

Source code in src/koheesio/spark/transformations/__init__.py
def get_all_columns_of_specific_type(self, data_type: Union[str, SparkDatatype]) -> List[str]:\n    \"\"\"Get all columns from the dataframe of a given type\n\n    A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will\n    be raised.\n\n    Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you\n    have to call this method multiple times.\n\n    Parameters\n    ----------\n    data_type: Union[str, SparkDatatype]\n        The data type to get the columns for\n\n    Returns\n    -------\n    List[str]\n        A list of column names of the given data type\n    \"\"\"\n    if not self.df:\n        raise ValueError(\"No dataframe available - cannot get columns\")\n\n    expected_data_type = (SparkDatatype.from_string(data_type) if isinstance(data_type, str) else data_type).value\n\n    columns_of_given_type: List[str] = [\n        col for col in self.df.columns if self.df.schema[col].dataType.typeName() == expected_data_type\n    ]\n    return columns_of_given_type\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_columns","title":"get_columns","text":"
get_columns() -> iter\n

Return an iterator of the columns

Source code in src/koheesio/spark/transformations/__init__.py
def get_columns(self) -> iter:\n    \"\"\"Return an iterator of the columns\"\"\"\n    # If `run_for_all_is_set` is True, we want to run the transformation for all columns of a given type\n    if self.run_for_all_is_set:\n        columns = []\n        for data_type in self.ColumnConfig.run_for_all_data_type:\n            columns += self.get_all_columns_of_specific_type(data_type)\n    else:\n        columns = self.columns\n\n    for column in columns:\n        if self.is_column_type_correct(column):\n            yield column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_limit_data_types","title":"get_limit_data_types","text":"
get_limit_data_types()\n

Get the limit_data_type as a list of strings

Source code in src/koheesio/spark/transformations/__init__.py
def get_limit_data_types(self):\n    \"\"\"Get the limit_data_type as a list of strings\"\"\"\n    return [dt.value for dt in self.ColumnConfig.limit_data_type]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.is_column_type_correct","title":"is_column_type_correct","text":"
is_column_type_correct(column)\n

Check if column type is correct and handle it if not, when limit_data_type is set

Source code in src/koheesio/spark/transformations/__init__.py
def is_column_type_correct(self, column):\n    \"\"\"Check if column type is correct and handle it if not, when limit_data_type is set\"\"\"\n    if not self.limit_data_type_is_set:\n        return True\n\n    if self.column_type_of_col(column) in (limit_data_types := self.get_limit_data_types()):\n        return True\n\n    # Raises a ValueError if the Column object is not of a given type and data_type_strict_mode is set\n    if self.data_type_strict_mode_is_set:\n        raise ValueError(\n            f\"Critical error: {column} is not of type {limit_data_types}. Exception is raised because \"\n            f\"`data_type_strict_mode` is set to True for {self.name}.\"\n        )\n\n    # Otherwise, throws a warning that the Column object is not of a given type\n    self.log.warning(f\"Column `{column}` is not of type `{limit_data_types}` and will be skipped.\")\n    return False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.set_columns","title":"set_columns","text":"
set_columns(columns_value)\n

Validate columns through the columns configuration provided

Source code in src/koheesio/spark/transformations/__init__.py
@field_validator(\"columns\", mode=\"before\")\ndef set_columns(cls, columns_value):\n    \"\"\"Validate columns through the columns configuration provided\"\"\"\n    columns = columns_value\n    run_for_all_data_type = cls.ColumnConfig.run_for_all_data_type\n\n    if run_for_all_data_type and len(columns) == 0:\n        columns = [\"*\"]\n\n    if columns[0] == \"*\" and not run_for_all_data_type:\n        raise ValueError(\"Cannot use '*' as a column name when no run_for_all_data_type is set\")\n\n    return columns\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget","title":"koheesio.spark.transformations.ColumnsTransformationWithTarget","text":"

Extended ColumnsTransformation class with an additional target_column field

Using this class makes implementing Transformations significantly easier.

Concept

A ColumnsTransformationWithTarget is a ColumnsTransformation with an additional target_column field. This field can be used to store the result of the transformation in a new column.

If the target_column is not provided, the result will be stored in the source column.

If more than one column is passed, the behavior of the Class changes this way:

  • the transformation will be run in a loop against all the given columns
  • automatically handles the renaming of the columns when more than one column is passed
  • the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed

The func method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target method to loop over all the columns and apply this function to transform the DataFrame.

Parameters:

Name Type Description Default columns ListOfColumns

The column (or list of columns) to apply the transformation to. Alias: column. If not provided, the run_for_all_data_type will be used to determine the data type. If run_for_all_data_type is not set, the transformation will be run for all columns of a given type.

* target_column Optional[str]

The name of the column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this input will be used as a suffix instead.

None Example

Writing your own transformation using the ColumnsTransformationWithTarget class:

from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n

In the above example, the func method is implemented to add 1 to the values of a given column.

In order to use this transformation, we can call the transform method:

from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOneWithTarget(column=\"id\", target_column=\"new_id\").transform(df)\n

The output_df will now contain the original DataFrame with an additional column called new_id with the values of id + 1.

output_df:

id new_id 0 1 1 2 2 3

Note: The target_column will be used as a suffix when more than one column is given as source. Leaving this blank will result in the original columns being renamed.

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: Optional[str] = Field(default=None, alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.execute","title":"execute","text":"
execute()\n

Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output) This can be left unchanged, and hence should not be implemented in the child class.

Source code in src/koheesio/spark/transformations/__init__.py
def execute(self):\n    \"\"\"Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output)\n    This can be left unchanged, and hence should not be implemented in the child class.\n    \"\"\"\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n    self.output.df = df\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.func","title":"func abstractmethod","text":"
func(column: Column) -> Column\n

The function that will be run on a single Column of the DataFrame

The func method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target method to loop over all the columns and apply this function to transform the DataFrame.

Parameters:

Name Type Description Default column Column

The column to apply the transformation to

required

Returns:

Type Description Column

The transformed column

Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef func(self, column: Column) -> Column:\n    \"\"\"The function that will be run on a single Column of the DataFrame\n\n    The `func` method should be implemented in the child class. This method should return the transformation that\n    will be applied to the column(s). The execute method (already preset) will use the `get_columns_with_target`\n    method to loop over all the columns and apply this function to transform the DataFrame.\n\n    Parameters\n    ----------\n    column: Column\n        The column to apply the transformation to\n\n    Returns\n    -------\n    Column\n        The transformed column\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.get_columns_with_target","title":"get_columns_with_target","text":"
get_columns_with_target() -> iter\n

Return an iterator of the columns

Works just like in get_columns from the ColumnsTransformation class except that it handles the target_column as well.

If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns - the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.

Returns:

Type Description iter

An iterator of tuples containing the target column name and the original column name

Source code in src/koheesio/spark/transformations/__init__.py
def get_columns_with_target(self) -> iter:\n    \"\"\"Return an iterator of the columns\n\n    Works just like in get_columns from the  ColumnsTransformation class except that it handles the `target_column`\n    as well.\n\n    If more than one column is passed, the behavior of the Class changes this way:\n    - the transformation will be run in a loop against all the given columns\n    - the target_column will be used as a suffix. Leaving this blank will result in the original columns being\n        renamed.\n\n    Returns\n    -------\n    iter\n        An iterator of tuples containing the target column name and the original column name\n    \"\"\"\n    columns = [*self.get_columns()]\n\n    for column in columns:\n        # ensures that we at least use the original column name\n        target_column = self.target_column or column\n\n        if len(columns) > 1:  # target_column becomes a suffix when more than 1 column is given\n            # dict.fromkeys is used to avoid duplicates in the name while maintaining order\n            _cols = [column, target_column]\n            target_column = \"_\".join(list(dict.fromkeys(_cols)))\n\n        yield target_column, column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation","title":"koheesio.spark.transformations.Transformation","text":"

Base class for all transformations

Concept

A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is transformed based on the logic implemented in the execute method. Any additional parameters that are needed for the transformation can be passed to the constructor.

Parameters:

Name Type Description Default df

The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the transform-method.

required Example
from koheesio.steps.transformations import Transformation\nfrom pyspark.sql import functions as f\n\n\nclass AddOne(Transformation):\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n

In the example above, the execute method is implemented to add 1 to the values of the old_column and store the result in a new column called new_column.

In order to use this transformation, we can call the transform method:

from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOne().transform(df)\n

The output_df will now contain the original DataFrame with an additional column called new_column with the values of old_column + 1.

output_df:

id new_column 0 1 1 2 2 3 ...

Alternatively, we can pass the DataFrame to the constructor and call the execute or transform method without any arguments:

output_df = AddOne(df).transform()\n# or\noutput_df = AddOne(df).execute().output.df\n

Note: that the transform method was not implemented explicitly in the AddOne class. This is because the transform method is already implemented in the Transformation class. This means that all classes that inherit from the Transformation class will have the transform method available. Only the execute method needs to be implemented.

"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.execute","title":"execute abstractmethod","text":"
execute() -> Output\n

Execute on a Transformation should handle self.df (input) and set self.output.df (output)

This method should be implemented in the child class. The input DataFrame is available as self.df and the output DataFrame should be stored in self.output.df.

For example:

def execute(self):\n    self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n

The transform method will call this method and return the output DataFrame.

Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef execute(self) -> SparkStep.Output:\n    \"\"\"Execute on a Transformation should handle self.df (input) and set self.output.df (output)\n\n    This method should be implemented in the child class. The input DataFrame is available as `self.df` and the\n    output DataFrame should be stored in `self.output.df`.\n\n    For example:\n    ```python\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n    ```\n\n    The transform method will call this method and return the output DataFrame.\n    \"\"\"\n    # self.df  # input dataframe\n    # self.output.df # output dataframe\n    self.output.df = ...  # implement the transformation logic\n    raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.transform","title":"transform","text":"
transform(df: Optional[DataFrame] = None) -> DataFrame\n

Execute the transformation and return the output DataFrame

Note: when creating a child from this, don't implement this transform method. Instead, implement execute!

See Also

Transformation.execute

Parameters:

Name Type Description Default df Optional[DataFrame]

The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor will be used.

None

Returns:

Type Description DataFrame

The transformed DataFrame

Source code in src/koheesio/spark/transformations/__init__.py
def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n    \"\"\"Execute the transformation and return the output DataFrame\n\n    Note: when creating a child from this, don't implement this transform method. Instead, implement execute!\n\n    See Also\n    --------\n    `Transformation.execute`\n\n    Parameters\n    ----------\n    df: Optional[DataFrame]\n        The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    Returns\n    -------\n    DataFrame\n        The transformed DataFrame\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output.df\n
"},{"location":"api_reference/spark/transformations/arrays.html","title":"Arrays","text":"

A collection of classes for performing various transformations on arrays in PySpark.

These transformations include operations such as removing duplicates, exploding arrays into separate rows, reversing the order of elements, sorting elements, removing certain values, and calculating aggregate statistics like minimum, maximum, sum, mean, and median.

Concept
  • Every transformation in this module is implemented as a class that inherits from the ArrayTransformation class.
  • The ArrayTransformation class is a subclass of ColumnsTransformationWithTarget
  • The ArrayTransformation class implements the func method, which is used to define the transformation logic.
  • The func method takes a column as input and returns a Column object.
  • The Column object is a PySpark column that can be used to perform transformations on a DataFrame column.
  • The ArrayTransformation limits the data type of the transformation to array by setting the ColumnConfig class to run_for_all_data_type = [SparkDatatype.ARRAY] and limit_data_type = [SparkDatatype.ARRAY].
See Also
  • koheesio.spark.transformations Module containing all transformation classes.
  • koheesio.spark.transformations.ColumnsTransformationWithTarget Base class for all transformations that operate on columns and have a target column.
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortAsc","title":"koheesio.spark.transformations.arrays.ArraySortAsc module-attribute","text":"
ArraySortAsc = ArraySort\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct","title":"koheesio.spark.transformations.arrays.ArrayDistinct","text":"

Remove duplicates from array

Example
ArrayDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.filter_empty","title":"filter_empty class-attribute instance-attribute","text":"
filter_empty: bool = Field(default=True, description='Remove null, nan, and empty values from array. Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    _fn = F.array_distinct(column)\n\n    # noinspection PyUnresolvedReferences\n    element_type = self.column_type_of_col(column, None, False).elementType\n    is_numeric = spark_data_type_is_numeric(element_type)\n\n    if self.filter_empty:\n        # Remove null values from array\n        if spark_minor_version >= 3.4:\n            # Run array_compact if spark version is 3.4 or higher\n            # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_compact.html\n            # pylint: disable=E0611\n            from pyspark.sql.functions import array_compact as _array_compact\n\n            _fn = _array_compact(_fn)\n            # pylint: enable=E0611\n        else:\n            # Otherwise, remove null from array using array_except\n            _fn = F.array_except(_fn, F.array(F.lit(None)))\n\n        # Remove nan or empty values from array (depends on the type of the elements in array)\n        if is_numeric:\n            # Remove nan from array (float/int/numbers)\n            _fn = F.array_except(_fn, F.array(F.lit(float(\"nan\")).cast(element_type)))\n        else:\n            # Remove empty values from array (string/text)\n            _fn = F.array_except(_fn, F.array(F.lit(\"\"), F.lit(\" \")))\n\n    return _fn\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax","title":"koheesio.spark.transformations.arrays.ArrayMax","text":"

Return the maximum value in the array

Example
ArrayMax(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    # Call for processing of nan values\n    column = super().func(column)\n\n    return F.array_max(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean","title":"koheesio.spark.transformations.arrays.ArrayMean","text":"

Return the mean of the values in the array.

Note: Only numeric values are supported for calculating the mean.

Example
ArrayMean(column=\"array_column\", target_column=\"average\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean.func","title":"func","text":"
func(column: Column) -> Column\n

Calculate the mean of the values in the array

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"Calculate the mean of the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(col=column, df=None, simple_return_mode=False).elementType\n\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for calculating a mean.\"\n        )\n\n    _sum = ArraySum.from_step(self).func(column)\n    # Call for processing of nan values\n    column = super().func(column)\n    _size = F.size(column)\n    # return 0 if the size of the array is 0 to avoid division by zero\n    return F.when(_size == 0, F.lit(0)).otherwise(_sum / _size)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian","title":"koheesio.spark.transformations.arrays.ArrayMedian","text":"

Return the median of the values in the array.

The median is the middle value in a sorted, ascending or descending, list of numbers.

  • If the size of the array is even, the median is the average of the two middle numbers.
  • If the size of the array is odd, the median is the middle number.

Note: Only numeric values are supported for calculating the median.

Example
ArrayMedian(column=\"array_column\", target_column=\"median\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian.func","title":"func","text":"
func(column: Column) -> Column\n

Calculate the median of the values in the array

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"Calculate the median of the values in the array\"\"\"\n    # Call for processing of nan values\n    column = super().func(column)\n\n    sorted_array = ArraySort.from_step(self).func(column)\n    _size: Column = F.size(sorted_array)\n\n    # Calculate the middle index. If the size is odd, PySpark discards the fractional part.\n    # Use floor function to ensure the result is an integer\n    middle: Column = F.floor((_size + 1) / 2).cast(\"int\")\n\n    # Define conditions\n    is_size_zero: Column = _size == 0\n    is_column_null: Column = column.isNull()\n    is_size_even: Column = _size % 2 == 0\n\n    # Define actions / responses\n    # For even-sized arrays, calculate the average of the two middle elements\n    average_of_middle_elements = (F.element_at(sorted_array, middle) + F.element_at(sorted_array, middle + 1)) / 2\n    # For odd-sized arrays, select the middle element\n    middle_element = F.element_at(sorted_array, middle)\n    # In case the array is empty, return either None or 0\n    none_value = F.lit(None)\n    zero_value = F.lit(0)\n\n    median = (\n        # Check if the size of the array is 0\n        F.when(\n            is_size_zero,\n            # If the size of the array is 0 and the column is null, return None\n            # If the size of the array is 0 and the column is not null, return 0\n            F.when(is_column_null, none_value).otherwise(zero_value),\n        ).otherwise(\n            # If the size of the array is not 0, calculate the median\n            F.when(is_size_even, average_of_middle_elements).otherwise(middle_element)\n        )\n    )\n\n    return median\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin","title":"koheesio.spark.transformations.arrays.ArrayMin","text":"

Return the minimum value in the array

Example
ArrayMin(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    return F.array_min(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess","title":"koheesio.spark.transformations.arrays.ArrayNullNanProcess","text":"

Process an array by removing NaN and/or NULL values from elements.

Parameters:

Name Type Description Default keep_nan bool

Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.

False keep_null bool

Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.

False

Returns:

Name Type Description column Column

The processed column with NaN and/or NULL values removed from elements.

Examples:

>>> input_data = [(1, [1.1, 2.1, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=False)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1]\n\n>>> input_data = [(1, [1.1, 2.2, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=True)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1, nan]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_nan","title":"keep_nan class-attribute instance-attribute","text":"
keep_nan: bool = Field(False, description='Whether to keep nan values in the array. Default is False. If set to True, the nan values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_null","title":"keep_null class-attribute instance-attribute","text":"
keep_null: bool = Field(False, description='Whether to keep null values in the array. Default is False. If set to True, the null values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.func","title":"func","text":"
func(column: Column) -> Column\n

Process the given column by removing NaN and/or NULL values from elements.

Parameters:

column : Column The column to be processed.

Returns:

column : Column The processed column with NaN and/or NULL values removed from elements.

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"\n    Process the given column by removing NaN and/or NULL values from elements.\n\n    Parameters:\n    -----------\n    column : Column\n        The column to be processed.\n\n    Returns:\n    --------\n    column : Column\n        The processed column with NaN and/or NULL values removed from elements.\n    \"\"\"\n\n    def apply_logic(x: Column):\n        if self.keep_nan is False and self.keep_null is False:\n            logic = x.isNotNull() & ~F.isnan(x)\n        elif self.keep_nan is False:\n            logic = ~F.isnan(x)\n        elif self.keep_null is False:\n            logic = x.isNotNull()\n\n        return logic\n\n    if self.keep_nan is False or self.keep_null is False:\n        column = F.filter(column, apply_logic)\n\n    return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove","title":"koheesio.spark.transformations.arrays.ArrayRemove","text":"

Remove a certain value from the array

Parameters:

Name Type Description Default keep_nan bool

Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.

False keep_null bool

Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.

False Example
ArrayRemove(column=\"array_column\", value=\"value_to_remove\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.make_distinct","title":"make_distinct class-attribute instance-attribute","text":"
make_distinct: bool = Field(default=False, description='Whether to remove duplicates from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.value","title":"value class-attribute instance-attribute","text":"
value: Any = Field(default=None, description='The value to remove from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    value = self.value\n\n    column = super().func(column)\n\n    def filter_logic(x: Column, _val: Any):\n        if self.keep_null and self.keep_nan:\n            logic = (x != F.lit(_val)) | x.isNull() | F.isnan(x)\n        elif self.keep_null:\n            logic = (x != F.lit(_val)) | x.isNull()\n        elif self.keep_nan:\n            logic = (x != F.lit(_val)) | F.isnan(x)\n        else:\n            logic = x != F.lit(_val)\n\n        return logic\n\n    # Check if the value is iterable (i.e., a list, tuple, or set)\n    if isinstance(value, (list, tuple, set)):\n        result = reduce(lambda res, val: F.filter(res, lambda x: filter_logic(x, val)), value, column)\n    else:\n        # If the value is not iterable, simply remove the value from the array\n        result = F.filter(column, lambda x: filter_logic(x, value))\n\n    if self.make_distinct:\n        result = F.array_distinct(result)\n\n    return result\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse","title":"koheesio.spark.transformations.arrays.ArrayReverse","text":"

Reverse the order of elements in the array

Example
ArrayReverse(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    return F.reverse(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort","title":"koheesio.spark.transformations.arrays.ArraySort","text":"

Sort the elements in the array

By default, the elements are sorted in ascending order. To sort the elements in descending order, set the reverse parameter to True.

Example
ArraySort(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.reverse","title":"reverse class-attribute instance-attribute","text":"
reverse: bool = Field(default=False, description='Sort the elements in the array in a descending order. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    column = F.array_sort(column)\n    if self.reverse:\n        # Reverse the order of elements in the array\n        column = ArrayReverse.from_step(self).func(column)\n    return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc","title":"koheesio.spark.transformations.arrays.ArraySortDesc","text":"

Sort the elements in the array in descending order

"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc.reverse","title":"reverse class-attribute instance-attribute","text":"
reverse: bool = True\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum","title":"koheesio.spark.transformations.arrays.ArraySum","text":"

Return the sum of the values in the array

Parameters:

Name Type Description Default keep_nan bool

Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.

False keep_null bool

Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.

False Example
ArraySum(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum.func","title":"func","text":"
func(column: Column) -> Column\n

Using the aggregate function to sum the values in the array

Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    \"\"\"Using the `aggregate` function to sum the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(column, None, False).elementType\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for summing.\"\n        )\n\n    # remove na values from array.\n    column = super().func(column)\n\n    # Using the `aggregate` function to sum the values in the array by providing the initial value as 0.0 and the\n    # lambda function to add the elements together. Pyspark will automatically infer the type of the initial value\n    # making 0.0 valid for both integer and float types.\n    initial_value = F.lit(0.0)\n    return F.aggregate(column, initial_value, lambda accumulator, x: accumulator + x)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation","title":"koheesio.spark.transformations.arrays.ArrayTransformation","text":"

Base class for array transformations

"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig","title":"ColumnConfig","text":"

Set the data type of the Transformation to array

"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    raise NotImplementedError(\"This is an abstract class\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode","title":"koheesio.spark.transformations.arrays.Explode","text":"

Explode the array into separate rows

Example
Explode(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.distinct","title":"distinct class-attribute instance-attribute","text":"
distinct: bool = Field(False, description='Remove duplicates from the exploded array. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.preserve_nulls","title":"preserve_nulls class-attribute instance-attribute","text":"
preserve_nulls: bool = Field(True, description='Preserve rows with null values in the exploded array by using explode_outer instead of explode.Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n    if self.distinct:\n        column = ArrayDistinct.from_step(self).func(column)\n    return F.explode_outer(column) if self.preserve_nulls else F.explode(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct","title":"koheesio.spark.transformations.arrays.ExplodeDistinct","text":"

Explode the array into separate rows while removing duplicates and empty values

Example
ExplodeDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct.distinct","title":"distinct class-attribute instance-attribute","text":"
distinct: bool = True\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html","title":"Camel to snake","text":"

Class for converting DataFrame column names from camel case to snake case.

"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.camel_to_snake_re","title":"koheesio.spark.transformations.camel_to_snake.camel_to_snake_re module-attribute","text":"
camel_to_snake_re = compile('([a-z0-9])([A-Z])')\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","title":"koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","text":"

Converts column names from camel case to snake cases

Parameters:

Name Type Description Default columns Optional[ListOfColumns]

The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: [\"column1\", \"column2\"] or \"column1\"

None Example

input_df:

camelCaseColumn snake_case_column ... ...
output_df = CamelToSnakeTransformation(column=\"camelCaseColumn\").transform(input_df)\n

output_df:

camel_case_column snake_case_column ... ...

In this example, the column camelCaseColumn is converted to camel_case_column.

Note: the data in the columns is not changed, only the column names.

"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[ListOfColumns] = Field(default='', alias='column', description=\"The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'` \")\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/camel_to_snake.py
def execute(self):\n    _df = self.df\n\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n\n    for column in columns:\n        _df = _df.withColumnRenamed(column, convert_camel_to_snake(column))\n\n    self.output.df = _df\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","title":"koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","text":"
convert_camel_to_snake(name: str)\n

Converts a string from camelCase to snake_case.

Parameters:

name : str The string to be converted.

Returns:

str The converted string in snake_case.

Source code in src/koheesio/spark/transformations/camel_to_snake.py
def convert_camel_to_snake(name: str):\n    \"\"\"\n    Converts a string from camelCase to snake_case.\n\n    Parameters:\n    ----------\n    name : str\n        The string to be converted.\n\n    Returns:\n    --------\n    str\n        The converted string in snake_case.\n    \"\"\"\n    return camel_to_snake_re.sub(r\"\\1_\\2\", name).lower()\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html","title":"Cast to datatype","text":"

Transformations to cast a column or set of columns to a given datatype.

Each one of these have been vetted to throw warnings when wrong datatypes are passed (to skip erroring any job or pipeline).

Furthermore, detailed tests have been added to ensure that types are actually compatible as prescribed.

Concept
  • One can use the CastToDataType class directly, or use one of the more specific subclasses.
  • Each subclass is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
  • Compatible Data types are specified in the ColumnConfig class of each subclass, and are documented in the docstring of each subclass.

See class docstrings for more information

Note

Dates, Arrays and Maps are not supported by this module.

  • for dates, use the koheesio.spark.transformations.date_time module
  • for arrays, use the koheesio.spark.transformations.arrays module

Classes:

Name Description CastToDatatype:

Cast a column or set of columns to a given datatype

CastToByte

Cast to Byte (a.k.a. tinyint)

CastToShort

Cast to Short (a.k.a. smallint)

CastToInteger

Cast to Integer (a.k.a. int)

CastToLong

Cast to Long (a.k.a. bigint)

CastToFloat

Cast to Float (a.k.a. real)

CastToDouble

Cast to Double

CastToDecimal

Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)

CastToString

Cast to String

CastToBinary

Cast to Binary (a.k.a. byte array)

CastToBoolean

Cast to Boolean

CastToTimestamp

Cast to Timestamp

Note

The following parameters are common to all classes in this module:

Parameters:

Name Type Description Default columns ListOfColumns

Name of the source column(s). Alias: column

required target_column str

Name of the target column or alias if more than one column is specified. Alias: target_alias

required datatype str or SparkDatatype

Datatype to cast to. Choose from SparkDatatype Enum (only needs to be specified in CastToDatatype, all other classes have a fixed datatype)

required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary","title":"koheesio.spark.transformations.cast_to_datatype.CastToBinary","text":"

Cast to Binary (a.k.a. byte array)

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • float
  • double
  • decimal
  • boolean
  • timestamp
  • date
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • string

Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = BINARY\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToBinary class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, STRING]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean","title":"koheesio.spark.transformations.cast_to_datatype.CastToBoolean","text":"

Cast to Boolean

Unsupported datatypes:

Following casts are not supported

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • decimal
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = BOOLEAN\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToBoolean class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte","title":"koheesio.spark.transformations.cast_to_datatype.CastToByte","text":"

Cast to Byte (a.k.a. tinyint)

Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • boolean
  • timestamp
  • decimal
  • double
  • float
  • long
  • integer
  • short

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • timestamp range of values too small for timestamp to have any meaning
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = BYTE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToByte class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype","title":"koheesio.spark.transformations.cast_to_datatype.CastToDatatype","text":"

Cast a column or set of columns to a given datatype

Wrapper around pyspark.sql.Column.cast

Concept

This class acts as the base class for all the other CastTo* classes. It is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.

Example

input_df:

c1 c2 1 2 3 4
output_df = CastToDatatype(\n    column=\"c1\",\n    datatype=\"string\",\n    target_alias=\"c1\",\n).transform(input_df)\n

output_df:

c1 c2 \"1\" 2 \"3\" 4

In the example above, the column c1 is cast to a string datatype. The column c2 is not affected.

Parameters:

Name Type Description Default columns ListOfColumns

Name of the source column(s). Alias: column

required datatype str or SparkDatatype

Datatype to cast to. Choose from SparkDatatype Enum

required target_column str

Name of the target column or alias if more than one column is specified. Alias: target_alias

required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = Field(default=..., description='Datatype. Choose from SparkDatatype Enum')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n    # This is to let the IDE explicitly know that the datatype is not a string, but a `SparkDatatype` Enum\n    datatype: SparkDatatype = self.datatype\n    return column.cast(datatype.spark_type())\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.validate_datatype","title":"validate_datatype","text":"
validate_datatype(datatype_value) -> SparkDatatype\n

Validate the datatype.

Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@field_validator(\"datatype\")\ndef validate_datatype(cls, datatype_value) -> SparkDatatype:\n    \"\"\"Validate the datatype.\"\"\"\n    # handle string input\n    try:\n        if isinstance(datatype_value, str):\n            datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value)\n            return datatype_value\n\n        # and let SparkDatatype handle the rest\n        datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value.value)\n\n    except AttributeError as e:\n        raise AttributeError(f\"Invalid datatype: {datatype_value}\") from e\n\n    return datatype_value\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal","title":"koheesio.spark.transformations.cast_to_datatype.CastToDecimal","text":"

Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)

Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.

The DecimalType must have fixed precision (the maximum total number of digits) and scale (the number of digits on the right of dot). For example, (5, 2) can support the value from [-999.99 to 999.99].

The precision can be up to 38, the scale must be less or equal to precision.

Spark sets the default precision and scale to (10, 0) when creating a DecimalType. However, when inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).

For compatibility reasons, Koheesio sets the default precision and scale to (38, 18) when creating a DecimalType.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • boolean
  • timestamp
  • date
  • string
  • void
  • decimal spark will convert existing decimals to null if the precision and scale doesn't fit the data

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default

Parameters:

Name Type Description Default columns ListOfColumns

Name of the source column(s). Alias: column

* target_column str

Name of the target column or alias if more than one column is specified. Alias: target_alias

required precision conint(gt=0, le=38)

the maximum (i.e. total) number of digits (default: 38). Must be > 0.

38 scale conint(ge=0, le=18)

the number of digits on right side of dot. (default: 18). Must be >= 0.

18"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = DECIMAL\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.precision","title":"precision class-attribute instance-attribute","text":"
precision: conint(gt=0, le=38) = Field(default=38, description='The maximum total number of digits (precision) of the decimal. Must be > 0. Default is 38')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.scale","title":"scale class-attribute instance-attribute","text":"
scale: conint(ge=0, le=18) = Field(default=18, description='The number of digits to the right of the decimal point (scale). Must be >= 0. Default is 18')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToDecimal class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n    return column.cast(self.datatype.spark_type(precision=self.precision, scale=self.scale))\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.validate_scale_and_precisions","title":"validate_scale_and_precisions","text":"
validate_scale_and_precisions()\n

Validate the precision and scale values.

Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@model_validator(mode=\"after\")\ndef validate_scale_and_precisions(self):\n    \"\"\"Validate the precision and scale values.\"\"\"\n    precision_value = self.precision\n    scale_value = self.scale\n\n    if scale_value == precision_value:\n        self.log.warning(\"scale and precision are equal, this will result in a null value\")\n    if scale_value > precision_value:\n        raise ValueError(\"scale must be < precision\")\n\n    return self\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble","title":"koheesio.spark.transformations.cast_to_datatype.CastToDouble","text":"

Cast to Double

Represents 8-byte double-precision floating point numbers. The range of numbers is from -1.7976931348623157E308 to 1.7976931348623157E308.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • decimal
  • boolean
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = DOUBLE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToDouble class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat","title":"koheesio.spark.transformations.cast_to_datatype.CastToFloat","text":"

Cast to Float (a.k.a. real)

Represents 4-byte single-precision floating point numbers. The range of numbers is from -3.402823E38 to 3.402823E38.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • double
  • decimal
  • boolean

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • timestamp precision is lost (use CastToDouble instead)
  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = FLOAT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToFloat class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger","title":"koheesio.spark.transformations.cast_to_datatype.CastToInteger","text":"

Cast to Integer (a.k.a. int)

Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • long
  • float
  • double
  • decimal
  • boolean
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = INTEGER\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToInteger class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong","title":"koheesio.spark.transformations.cast_to_datatype.CastToLong","text":"

Cast to Long (a.k.a. bigint)

Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • long
  • float
  • double
  • decimal
  • boolean
  • timestamp

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = LONG\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToLong class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort","title":"koheesio.spark.transformations.cast_to_datatype.CastToShort","text":"

Cast to Short (a.k.a. smallint)

Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • integer
  • long
  • float
  • double
  • decimal
  • string
  • boolean
  • timestamp
  • date
  • void

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • string converts to null
  • timestamp range of values too small for timestamp to have any meaning
  • date converts to null
  • void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = SHORT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToShort class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString","title":"koheesio.spark.transformations.cast_to_datatype.CastToString","text":"

Cast to String

Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • decimal
  • binary
  • boolean
  • timestamp
  • date
  • array
  • map
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = STRING\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToString class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BINARY, BOOLEAN, TIMESTAMP, DATE, ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","title":"koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","text":"

Cast to Timestamp

Numeric time stamp is the number of seconds since January 1, 1970 00:00:00.000000000 UTC. Not advised to be used on small integers, as the range of values is too small for timestamp to have any meaning.

For more fine-grained control over the timestamp format, use the date_time module. This allows for parsing strings to timestamps and vice versa.

See Also
  • koheesio.spark.transformations.date_time
  • https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#timestamp-pattern
Unsupported datatypes:

Following casts are not supported

  • binary
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • integer
  • long
  • float
  • double
  • decimal
  • date

Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:

  • boolean: range of values too small for timestamp to have any meaning
  • byte: range of values too small for timestamp to have any meaning
  • string: converts to null in most cases, use date_time module instead
  • short: range of values too small for timestamp to have any meaning
  • void: skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.datatype","title":"datatype class-attribute instance-attribute","text":"
datatype: Union[str, SparkDatatype] = TIMESTAMP\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig","title":"ColumnConfig","text":"

Set the data types that are compatible with the CastToTimestamp class.

"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, BOOLEAN, BYTE, SHORT, STRING, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, DATE]\n
"},{"location":"api_reference/spark/transformations/drop_column.html","title":"Drop column","text":"

This module defines the DropColumn class, a subclass of ColumnsTransformation.

"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn","title":"koheesio.spark.transformations.drop_column.DropColumn","text":"

Drop one or more columns

The DropColumn class is used to drop one or more columns from a PySpark DataFrame. It wraps the pyspark.DataFrame.drop function and can handle either a single string or a list of strings as input.

If a specified column does not exist in the DataFrame, no error or warning is thrown, and all existing columns will remain.

Expected behavior
  • When the column does not exist, all columns will remain (no error or warning is thrown)
  • Either a single string, or a list of strings can be specified
Example

df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = DropColumn(column=\"product\").transform(df)\n

output_df:

amount country 1000 USA 1500 USA 1600 USA

In this example, the product column is dropped from the DataFrame df.

"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/drop_column.py
def execute(self):\n    self.log.info(f\"{self.column=}\")\n    self.output.df = self.df.drop(*self.columns)\n
"},{"location":"api_reference/spark/transformations/dummy.html","title":"Dummy","text":"

Dummy transformation for testing purposes.

"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation","title":"koheesio.spark.transformations.dummy.DummyTransformation","text":"

Dummy transformation for testing purposes.

This transformation adds a new column hello to the DataFrame with the value world.

It is intended for testing purposes or for use in examples or reference documentation.

Example

input_df:

id 1
output_df = DummyTransformation().transform(input_df)\n

output_df:

id hello 1 world

In this example, the hello column is added to the DataFrame input_df.

"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/dummy.py
def execute(self):\n    self.output.df = self.df.withColumn(\"hello\", lit(\"world\"))\n
"},{"location":"api_reference/spark/transformations/get_item.html","title":"Get item","text":"

Transformation to wrap around the pyspark getItem function

"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem","title":"koheesio.spark.transformations.get_item.GetItem","text":"

Get item from list or map (dictionary)

Wrapper around pyspark.sql.functions.getItem

GetItem is strict about the data type of the column. If the column is not a list or a map, an error will be raised.

Note

Only MapType and ArrayType are supported.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to get the item from. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None key Union[int, str]

The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index

required Example"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-list-arraytype","title":"Example with list (ArrayType)","text":"

By specifying an integer for the parameter \"key\", getItem knows to get the element at index n of a list (index starts at 0).

input_df:

id content 1 [1, 2, 3] 2 [4, 5] 3 [6] 4 []
output_df = GetItem(\n    column=\"content\",\n    index=1,  # get the second element of the list\n    target_column=\"item\",\n).transform(input_df)\n

output_df:

id content item 1 [1, 2, 3] 2 2 [4, 5] 5 3 [6] null 4 [] null"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-a-dict-maptype","title":"Example with a dict (MapType)","text":"

input_df:

id content 1 {key1 -> value1} 2 {key1 -> value2} 3 {key2 -> hello} 4 {key2 -> world}

output_df = GetItem(\n    column= \"content\",\n    key=\"key2,\n    target_column=\"item\",\n).transform(input_df)\n
As we request the key to be \"key2\", the first 2 rows will be null, because it does not have \"key2\".

output_df:

id content item 1 {key1 -> value1} null 2 {key1 -> value2} null 3 {key2 -> hello} hello 4 {key2 -> world} world"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.key","title":"key class-attribute instance-attribute","text":"
key: Union[int, str] = Field(default=..., alias='index', description='The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index')\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig","title":"ColumnConfig","text":"

Limit the data types to ArrayType and MapType.

"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute instance-attribute","text":"
data_type_strict_mode = True\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = run_for_all_data_type\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/get_item.py
def func(self, column: Column) -> Column:\n    return get_item(column, self.key)\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.get_item","title":"koheesio.spark.transformations.get_item.get_item","text":"
get_item(column: Column, key: Union[str, int])\n

Wrapper around pyspark.sql.functions.getItem

Parameters:

Name Type Description Default column Column

The column to get the item from

required key Union[str, int]

The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string.

required

Returns:

Type Description Column

The column with the item

Source code in src/koheesio/spark/transformations/get_item.py
def get_item(column: Column, key: Union[str, int]):\n    \"\"\"\n    Wrapper around pyspark.sql.functions.getItem\n\n    Parameters\n    ----------\n    column : Column\n        The column to get the item from\n    key : Union[str, int]\n        The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer.\n        If the column is a dict (MapType), this should be a string.\n\n    Returns\n    -------\n    Column\n        The column with the item\n    \"\"\"\n    return column.getItem(key)\n
"},{"location":"api_reference/spark/transformations/hash.html","title":"Hash","text":"

Module for hashing data using SHA-2 family of hash functions

See the docstring of the Sha2Hash class for more information.

"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.HASH_ALGORITHM","title":"koheesio.spark.transformations.hash.HASH_ALGORITHM module-attribute","text":"
HASH_ALGORITHM = Literal[224, 256, 384, 512]\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.STRING","title":"koheesio.spark.transformations.hash.STRING module-attribute","text":"
STRING = STRING\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash","title":"koheesio.spark.transformations.hash.Sha2Hash","text":"

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

  • https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).

Note

This function allows concatenating the values of multiple columns together prior to hashing.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to hash. Alias: column

required delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

| num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

256 target_column str

The generated hash will be written to the column name specified here

required"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.delimiter","title":"delimiter class-attribute instance-attribute","text":"
delimiter: Optional[str] = Field(default='|', description=\"Optional separator for the string that will eventually be hashed. Defaults to '|'\")\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.num_bits","title":"num_bits class-attribute instance-attribute","text":"
num_bits: Optional[HASH_ALGORITHM] = Field(default=256, description='Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: str = Field(default=..., description='The generated hash will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/hash.py
def execute(self):\n    columns = list(self.get_columns())\n    self.output.df = (\n        self.df.withColumn(\n            self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)\n        )\n        if columns\n        else self.df\n    )\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.sha2_hash","title":"koheesio.spark.transformations.hash.sha2_hash","text":"
sha2_hash(columns: List[str], delimiter: Optional[str] = '|', num_bits: Optional[HASH_ALGORITHM] = 256)\n

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

  • https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.

If a null is passed, the result will also be null.

Parameters:

Name Type Description Default columns List[str]

The columns to hash

required delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

| num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

256 Source code in src/koheesio/spark/transformations/hash.py
def sha2_hash(columns: List[str], delimiter: Optional[str] = \"|\", num_bits: Optional[HASH_ALGORITHM] = 256):\n    \"\"\"\n    hash the value of 1 or more columns using SHA-2 family of hash functions\n\n    Mild wrapper around pyspark.sql.functions.sha2\n\n    - https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html\n\n    Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).\n    This function allows concatenating the values of multiple columns together prior to hashing.\n\n    If a null is passed, the result will also be null.\n\n    Parameters\n    ----------\n    columns : List[str]\n        The columns to hash\n    delimiter : Optional[str], optional, default=|\n        Optional separator for the string that will eventually be hashed. Defaults to '|'\n    num_bits : Optional[HASH_ALGORITHM], optional, default=256\n        Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512\n    \"\"\"\n    # make sure all columns are of type pyspark.sql.Column and cast to string\n    _columns = []\n    for c in columns:\n        if isinstance(c, str):\n            c: Column = col(c)\n        _columns.append(c.cast(STRING.spark_type()))\n\n    # concatenate columns if more than 1 column is provided\n    if len(_columns) > 1:\n        column = concat_ws(delimiter, *_columns)\n    else:\n        column = _columns[0]\n\n    return sha2(column, num_bits)\n
"},{"location":"api_reference/spark/transformations/lookup.html","title":"Lookup","text":"

Lookup transformation for joining two dataframes together

Classes:

Name Description JoinMapping TargetColumn JoinType JoinHint DataframeLookup"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup","title":"koheesio.spark.transformations.lookup.DataframeLookup","text":"

Lookup transformation for joining two dataframes together

Parameters:

Name Type Description Default df DataFrame

The left Spark DataFrame

required other DataFrame

The right Spark DataFrame

required on List[JoinMapping] | JoinMapping

List of join mappings. If only one mapping is passed, it can be passed as a single object.

required targets List[TargetColumn] | TargetColumn

List of target columns. If only one target is passed, it can be passed as a single object.

required how JoinType

What type of join to perform. Defaults to left. See JoinType for more information.

required hint JoinHint

What type of join hint to use. Defaults to None. See JoinHint for more information.

required Example
from pyspark.sql import SparkSession\nfrom koheesio.spark.transformations.lookup import (\n    DataframeLookup,\n    JoinMapping,\n    TargetColumn,\n    JoinType,\n)\n\nspark = SparkSession.builder.getOrCreate()\n\n# create the dataframes\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\n# perform the lookup\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", joined_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.transform()\n

output_df:

id value right_value 1 A A 2 B null

In this example, the left_df and right_df dataframes are joined together using the id column. The value column from the right_df is aliased as right_value in the output dataframe.

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.df","title":"df class-attribute instance-attribute","text":"
df: DataFrame = Field(default=None, description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.hint","title":"hint class-attribute instance-attribute","text":"
hint: Optional[JoinHint] = Field(default=None, description='What type of join hint to use. Defaults to None. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.how","title":"how class-attribute instance-attribute","text":"
how: Optional[JoinType] = Field(default=LEFT, description='What type of join to perform. Defaults to left. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.on","title":"on class-attribute instance-attribute","text":"
on: Union[List[JoinMapping], JoinMapping] = Field(default=..., alias='join_mapping', description='List of join mappings. If only one mapping is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.other","title":"other class-attribute instance-attribute","text":"
other: DataFrame = Field(default=None, description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.targets","title":"targets class-attribute instance-attribute","text":"
targets: Union[List[TargetColumn], TargetColumn] = Field(default=..., alias='target_columns', description='List of target columns. If only one target is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output","title":"Output","text":"

Output for the lookup transformation

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.left_df","title":"left_df class-attribute instance-attribute","text":"
left_df: DataFrame = Field(default=..., description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.right_df","title":"right_df class-attribute instance-attribute","text":"
right_df: DataFrame = Field(default=..., description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.execute","title":"execute","text":"
execute() -> Output\n

Execute the lookup transformation

Source code in src/koheesio/spark/transformations/lookup.py
def execute(self) -> Output:\n    \"\"\"Execute the lookup transformation\"\"\"\n    # prepare the right dataframe\n    prepared_right_df = self.get_right_df().select(\n        *[join_mapping.column for join_mapping in self.on],\n        *[target.column for target in self.targets],\n    )\n    if self.hint:\n        prepared_right_df = prepared_right_df.hint(self.hint)\n\n    # generate the output\n    self.output.left_df = self.df\n    self.output.right_df = prepared_right_df\n    self.output.df = self.df.join(\n        prepared_right_df,\n        on=[join_mapping.source_column for join_mapping in self.on],\n        how=self.how,\n    )\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.get_right_df","title":"get_right_df","text":"
get_right_df() -> DataFrame\n

Get the right side dataframe

Source code in src/koheesio/spark/transformations/lookup.py
def get_right_df(self) -> DataFrame:\n    \"\"\"Get the right side dataframe\"\"\"\n    return self.other\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.set_list","title":"set_list","text":"
set_list(value)\n

Ensure that we can pass either a single object, or a list of objects

Source code in src/koheesio/spark/transformations/lookup.py
@field_validator(\"on\", \"targets\")\ndef set_list(cls, value):\n    \"\"\"Ensure that we can pass either a single object, or a list of objects\"\"\"\n    return [value] if not isinstance(value, list) else value\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint","title":"koheesio.spark.transformations.lookup.JoinHint","text":"

Supported join hints

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.BROADCAST","title":"BROADCAST class-attribute instance-attribute","text":"
BROADCAST = 'broadcast'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.MERGE","title":"MERGE class-attribute instance-attribute","text":"
MERGE = 'merge'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping","title":"koheesio.spark.transformations.lookup.JoinMapping","text":"

Mapping for joining two dataframes together

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.column","title":"column property","text":"
column: Column\n

Get the join mapping as a pyspark.sql.Column object

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.other_column","title":"other_column instance-attribute","text":"
other_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.source_column","title":"source_column instance-attribute","text":"
source_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType","title":"koheesio.spark.transformations.lookup.JoinType","text":"

Supported join types

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.ANTI","title":"ANTI class-attribute instance-attribute","text":"
ANTI = 'anti'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.CROSS","title":"CROSS class-attribute instance-attribute","text":"
CROSS = 'cross'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.FULL","title":"FULL class-attribute instance-attribute","text":"
FULL = 'full'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.INNER","title":"INNER class-attribute instance-attribute","text":"
INNER = 'inner'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.LEFT","title":"LEFT class-attribute instance-attribute","text":"
LEFT = 'left'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.RIGHT","title":"RIGHT class-attribute instance-attribute","text":"
RIGHT = 'right'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.SEMI","title":"SEMI class-attribute instance-attribute","text":"
SEMI = 'semi'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn","title":"koheesio.spark.transformations.lookup.TargetColumn","text":"

Target column for the joined dataframe

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.column","title":"column property","text":"
column: Column\n

Get the target column as a pyspark.sql.Column object

"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column","title":"target_column instance-attribute","text":"
target_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column_alias","title":"target_column_alias instance-attribute","text":"
target_column_alias: str\n
"},{"location":"api_reference/spark/transformations/repartition.html","title":"Repartition","text":"

Repartition Transformation

"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition","title":"koheesio.spark.transformations.repartition.Repartition","text":"

Wrapper around DataFrame.repartition

With repartition, the number of partitions can be given as an optional value. If this is not provided, a default value is used. The default number of partitions is defined by the spark config 'spark.sql.shuffle.partitions', for which the default value is 200 and will never exceed the number or rows in the DataFrame (whichever is value is lower).

If columns are omitted, the entire DataFrame is repartitioned without considering the particular values in the columns.

Parameters:

Name Type Description Default column Optional[Union[str, List[str]]]

Name of the source column(s). If omitted, the entire DataFrame is repartitioned without considering the particular values in the columns. Alias: columns

None num_partitions Optional[int]

The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.

None Example
Repartition(column=[\"c1\", \"c2\"], num_partitions=3)  # results in 3 partitions\nRepartition(column=\"c1\", num_partitions=2)  # results in 2 partitions\nRepartition(column=[\"c1\", \"c2\"])  # results in <= 200 partitions\nRepartition(num_partitions=5)  # results in 5 partitions\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[ListOfColumns] = Field(default='', alias='column', description='Name of the source column(s)')\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.numPartitions","title":"numPartitions class-attribute instance-attribute","text":"
numPartitions: Optional[int] = Field(default=None, alias='num_partitions', description=\"The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.\")\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/repartition.py
def execute(self):\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n    # Prepare repartition input:\n    #  num_partitions comes first, but if it is not provided it should not be included as None.\n    repartition_inputs = [i for i in [self.numPartitions, *columns] if i]\n    self.output.df = self.df.repartition(*repartition_inputs)\n
"},{"location":"api_reference/spark/transformations/replace.html","title":"Replace","text":"

Transformation to replace a particular value in a column with another one

"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace","title":"koheesio.spark.transformations.replace.Replace","text":"

Replace a particular value in a column with another one

Can handle empty strings (\"\") as well as NULL / None values.

Unsupported datatypes:

Following casts are not supported

will raise an error in Spark:

  • binary
  • boolean
  • array<...>
  • map<...,...>
Supported datatypes:

Following casts are supported:

  • byte
  • short
  • integer
  • long
  • float
  • double
  • decimal
  • timestamp
  • date
  • string
  • void skipped by default

Any supported none-string datatype will be cast to string before the replacement is done.

Example

input_df:

id string 1 hello 2 world 3
output_df = Replace(\n    column=\"string\",\n    from_value=\"hello\",\n    to_value=\"programmer\",\n).transform(input_df)\n

output_df:

id string 1 programmer 2 world 3

In this example, the value \"hello\" in the column \"string\" is replaced with \"programmer\".

"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.from_value","title":"from_value class-attribute instance-attribute","text":"
from_value: Optional[str] = Field(default=None, alias='from', description=\"The original value that needs to be replaced. If no value is given, all 'null' values will be replaced with the to_value\")\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.to_value","title":"to_value class-attribute instance-attribute","text":"
to_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig","title":"ColumnConfig","text":"

Column type configurations for the column to be replaced

"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP, DATE]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/replace.py
def func(self, column: Column) -> Column:\n    return replace(column=column, from_value=self.from_value, to_value=self.to_value)\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.replace","title":"koheesio.spark.transformations.replace.replace","text":"
replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None)\n

Function to replace a particular value in a column with another one

Source code in src/koheesio/spark/transformations/replace.py
def replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None):\n    \"\"\"Function to replace a particular value in a column with another one\"\"\"\n    # make sure we have a Column object\n    if isinstance(column, str):\n        column = col(column)\n\n    if not from_value:\n        condition = column.isNull()\n    else:\n        condition = column == from_value\n\n    return when(condition, lit(to_value)).otherwise(column)\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html","title":"Row number dedup","text":"

This module contains the RowNumberDedup class, which performs a row_number deduplication operation on a DataFrame.

See the docstring of the RowNumberDedup class for more information.

"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup","title":"koheesio.spark.transformations.row_number_dedup.RowNumberDedup","text":"

A class used to perform a row_number deduplication operation on a DataFrame.

This class is a specialized transformation that extends the ColumnsTransformation class. It sorts the DataFrame based on the provided sort columns and assigns a row_number to each row. It then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row can be stored in a specified target column or a default column named \"meta_row_number_column\". The class also provides an option to preserve meta columns (like the row_numberk column) in the output DataFrame.

Attributes:

Name Type Description columns list

List of columns to apply the transformation to. If a single '*' is passed as a column name or if the columns parameter is omitted, the transformation will be applied to all columns of the data types specified in run_for_all_data_type of the ColumnConfig. (inherited from ColumnsTransformation)

sort_columns list

List of columns that the DataFrame will be sorted by.

target_column (str, optional)

Column where the row_number of each row will be stored.

preserve_meta (bool, optional)

Flag that determines whether the meta columns should be kept in the output DataFrame.

"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.preserve_meta","title":"preserve_meta class-attribute instance-attribute","text":"
preserve_meta: bool = Field(default=False, description=\"If true, meta columns are kept in output dataframe. Defaults to 'False'\")\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.sort_columns","title":"sort_columns class-attribute instance-attribute","text":"
sort_columns: conlist(Union[str, Column], min_length=0) = Field(default_factory=list, alias='sort_column', description='List of orderBy columns. If only one column is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: Optional[Union[str, Column]] = Field(default='meta_row_number_column', alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.window_spec","title":"window_spec property","text":"
window_spec: WindowSpec\n

Builds a WindowSpec object based on the columns defined in the configuration.

The WindowSpec object is used to define a window frame over which functions are applied in Spark. This method partitions the data by the columns returned by the get_columns method and then orders the partitions by the columns specified in sort_columns.

Notes

The order of the columns in the WindowSpec object is preserved. If a column is passed as a string, it is converted to a Column object with DESC ordering.

Returns:

Type Description WindowSpec

A WindowSpec object that can be used to define a window frame in Spark.

"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.execute","title":"execute","text":"
execute() -> Output\n

Performs the row_number deduplication operation on the DataFrame.

This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row, and then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row is stored in the target column. If preserve_meta is False, the method also drops the target column from the DataFrame.

Source code in src/koheesio/spark/transformations/row_number_dedup.py
def execute(self) -> RowNumberDedup.Output:\n    \"\"\"\n    Performs the row_number deduplication operation on the DataFrame.\n\n    This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row,\n    and then filters the DataFrame to keep only the top-row_number row for each group of duplicates.\n    The row_number of each row is stored in the target column. If preserve_meta is False,\n    the method also drops the target column from the DataFrame.\n    \"\"\"\n    df = self.df\n    window_spec = self.window_spec\n\n    # if target_column is a string, convert it to a Column object\n    if isinstance((target_column := self.target_column), str):\n        target_column = col(target_column)\n\n    # dedup the dataframe based on the window spec\n    df = df.withColumn(self.target_column, row_number().over(window_spec)).filter(target_column == 1).select(\"*\")\n\n    if not self.preserve_meta:\n        df = df.drop(target_column)\n\n    self.output.df = df\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.set_sort_columns","title":"set_sort_columns","text":"
set_sort_columns(columns_value)\n

Validates and optimizes the sort_columns parameter.

This method ensures that sort_columns is a list (or single object) of unique strings or Column objects. It removes any empty strings or None values from the list and deduplicates the columns.

Parameters:

Name Type Description Default columns_value Union[str, Column, List[Union[str, Column]]]

The value of the sort_columns parameter.

required

Returns:

Type Description List[Union[str, Column]]

The optimized and deduplicated list of sort columns.

Source code in src/koheesio/spark/transformations/row_number_dedup.py
@field_validator(\"sort_columns\", mode=\"before\")\ndef set_sort_columns(cls, columns_value):\n    \"\"\"\n    Validates and optimizes the sort_columns parameter.\n\n    This method ensures that sort_columns is a list (or single object) of unique strings or Column objects.\n    It removes any empty strings or None values from the list and deduplicates the columns.\n\n    Parameters\n    ----------\n    columns_value : Union[str, Column, List[Union[str, Column]]]\n        The value of the sort_columns parameter.\n\n    Returns\n    -------\n    List[Union[str, Column]]\n        The optimized and deduplicated list of sort columns.\n    \"\"\"\n    # Convert single string or Column object to a list\n    columns = [columns_value] if isinstance(columns_value, (str, Column)) else [*columns_value]\n\n    # Remove empty strings, None, etc.\n    columns = [c for c in columns if (isinstance(c, Column) and c is not None) or (isinstance(c, str) and c)]\n\n    dedup_columns = []\n    seen = set()\n\n    # Deduplicate the columns while preserving the order\n    for column in columns:\n        if str(column) not in seen:\n            dedup_columns.append(column)\n            seen.add(str(column))\n\n    return dedup_columns\n
"},{"location":"api_reference/spark/transformations/sql_transform.html","title":"Sql transform","text":"

SQL Transform module

SQL Transform module provides an easy interface to transform a dataframe using SQL. This SQL can originate from a string or a file and may contain placeholders for templating.

"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform","title":"koheesio.spark.transformations.sql_transform.SqlTransform","text":"

SQL Transform module provides an easy interface to transform a dataframe using SQL.

This SQL can originate from a string or a file and may contain placeholder (parameters) for templating.

  • Placeholders are identified with ${placeholder}.
  • Placeholders can be passed as explicit params (params) or as implicit params (kwargs).

Example sql script:

SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n
"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/sql_transform.py
def execute(self):\n    table_name = get_random_string(prefix=\"sql_transform\")\n    self.params = {**self.params, \"table_name\": table_name}\n\n    df = self.df\n    df.createOrReplaceTempView(table_name)\n    query = self.query\n\n    self.output.df = self.spark.sql(query)\n
"},{"location":"api_reference/spark/transformations/transform.html","title":"Transform","text":"

Transform module

Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.

"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform","title":"koheesio.spark.transformations.transform.Transform","text":"
Transform(func: Callable, params: Dict = None, df: DataFrame = None, **kwargs)\n

Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.

The implementation is inspired by and based upon: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html

Parameters:

Name Type Description Default func Callable

The function to be called on the DataFrame.

required params Dict

The keyword arguments to be passed to the function. Defaults to None. Alternatively, keyword arguments can be passed directly as keyword arguments - they will be merged with the params dictionary.

None Example Source code in src/koheesio/spark/transformations/transform.py
def __init__(self, func: Callable, params: Dict = None, df: DataFrame = None, **kwargs):\n    params = {**(params or {}), **kwargs}\n    super().__init__(func=func, params=params, df=df)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--a-function-compatible-with-transform","title":"a function compatible with Transform:","text":"
def some_func(df, a: str, b: str):\n    return df.withColumn(a, f.lit(b))\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--verbose-style-input-in-transform","title":"verbose style input in Transform","text":"
Transform(func=some_func, params={\"a\": \"foo\", \"b\": \"bar\"})\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--shortened-style-notation-easier-to-read","title":"shortened style notation (easier to read)","text":"
Transform(some_func, a=\"foo\", b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--when-too-much-input-is-given-transform-will-ignore-extra-input","title":"when too much input is given, Transform will ignore extra input","text":"
Transform(\n    some_func,\n    a=\"foo\",\n    # ignored input\n    c=\"baz\",\n    title=42,\n    author=\"Adams\",\n    # order of params input should not matter\n    b=\"bar\",\n)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--using-the-from_func-classmethod","title":"using the from_func classmethod","text":"
SomeFunc = Transform.from_func(some_func, a=\"foo\")\nsome_func = SomeFunc(b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.func","title":"func class-attribute instance-attribute","text":"
func: Callable = Field(default=None, description='The function to be called on the DataFrame.')\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.execute","title":"execute","text":"
execute()\n

Call the function on the DataFrame with the given keyword arguments.

Source code in src/koheesio/spark/transformations/transform.py
def execute(self):\n    \"\"\"Call the function on the DataFrame with the given keyword arguments.\"\"\"\n    func, kwargs = get_args_for_func(self.func, self.params)\n    self.output.df = self.df.transform(func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.from_func","title":"from_func classmethod","text":"
from_func(func: Callable, **kwargs) -> Callable[..., Transform]\n

Create a Transform class from a function. Useful for creating a new class with a different name.

This method uses the functools.partial function to create a new class with the given function and keyword arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for the specific use case.

Example
CustomTransform = Transform.from_func(some_func, a=\"foo\")\nsome_func = CustomTransform(b=\"bar\")\n

In this example, CustomTransform is a Transform class with the function some_func and the keyword argument a set to \"foo\". When calling some_func(b=\"bar\"), the function some_func will be called with the keyword arguments a=\"foo\" and b=\"bar\".

Source code in src/koheesio/spark/transformations/transform.py
@classmethod\ndef from_func(cls, func: Callable, **kwargs) -> Callable[..., Transform]:\n    \"\"\"Create a Transform class from a function. Useful for creating a new class with a different name.\n\n    This method uses the `functools.partial` function to create a new class with the given function and keyword\n    arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for\n    the specific use case.\n\n    Example\n    -------\n    ```python\n    CustomTransform = Transform.from_func(some_func, a=\"foo\")\n    some_func = CustomTransform(b=\"bar\")\n    ```\n\n    In this example, `CustomTransform` is a Transform class with the function `some_func` and the keyword argument\n    `a` set to \"foo\". When calling `some_func(b=\"bar\")`, the function `some_func` will be called with the keyword\n    arguments `a=\"foo\"` and `b=\"bar\"`.\n    \"\"\"\n    return partial(cls, func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/uuid5.html","title":"Uuid5","text":"

Ability to generate UUID5 using native pyspark (no udf)

"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5","title":"koheesio.spark.transformations.uuid5.HashUUID5","text":"

Generate a UUID with the UUID5 algorithm

Spark does not provide inbuilt API to generate version 5 UUID, hence we have to use a custom implementation to provide this capability.

Prerequisites: this function has no side effects. But be aware that in most cases, the expectation is that your data is clean (e.g. trimmed of leading and trailing spaces)

Concept

UUID5 is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). https://docs.python.org/3/library/uuid.html#uuid.uuid5

Based on https://github.com/MrPowers/quinn/pull/96 with the difference that since Spark 3.0.0 an OVERLAY function from ANSI SQL 2016 is available which saves coding space and string allocation(s) in place of CONCAT + SUBSTRING.

For more info on OVERLAY, see: https://docs.databricks.com/sql/language-manual/functions/overlay.html

Example

Input is a DataFrame with two columns:

id string 1 hello 2 world 3

Input parameters:

  • source_columns = [\"id\", \"string\"]
  • target_column = \"uuid5\"

Result:

id string uuid5 1 hello f3e99bbd-85ae-5dc3-bf6e-cd0022a0ebe6 2 world b48e880f-c289-5c94-b51f-b9d21f9616c0 3 2193a99d-222e-5a0c-a7d6-48fbe78d2708

In code:

HashUUID5(source_columns=[\"id\", \"string\"], target_column=\"uuid5\").transform(input_df)\n

In this example, the id and string columns are concatenated and hashed using the UUID5 algorithm. The result is stored in the uuid5 column.

"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.delimiter","title":"delimiter class-attribute instance-attribute","text":"
delimiter: Optional[str] = Field(default='|', description='Separator for the string that will eventually be hashed')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.description","title":"description class-attribute instance-attribute","text":"
description: str = 'Generate a UUID with the UUID5 algorithm'\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.extra_string","title":"extra_string class-attribute instance-attribute","text":"
extra_string: Optional[str] = Field(default='', description='In case of collisions, one can pass an extra string to hash on.')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.namespace","title":"namespace class-attribute instance-attribute","text":"
namespace: Optional[Union[str, UUID]] = Field(default='', description='Namespace DNS')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.source_columns","title":"source_columns class-attribute instance-attribute","text":"
source_columns: ListOfColumns = Field(default=..., description=\"List of columns that should be hashed. Should contain the name of at least 1 column. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'`\")\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: str = Field(default=..., description='The generated UUID will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.execute","title":"execute","text":"
execute() -> None\n
Source code in src/koheesio/spark/transformations/uuid5.py
def execute(self) -> None:\n    ns = f.lit(uuid5_namespace(self.namespace).bytes)\n    self.log.info(f\"UUID5 namespace '{ns}' derived from '{self.namespace}'\")\n    cols_to_hash = f.concat_ws(self.delimiter, *self.source_columns)\n    cols_to_hash = f.concat(f.lit(self.extra_string), cols_to_hash)\n    cols_to_hash = f.encode(cols_to_hash, \"utf-8\")\n    cols_to_hash = f.concat(ns, cols_to_hash)\n    source_columns_sha1 = f.sha1(cols_to_hash)\n    variant_part = f.substring(source_columns_sha1, 17, 4)\n    variant_part = f.conv(variant_part, 16, 2)\n    variant_part = f.lpad(variant_part, 16, \"0\")\n    variant_part = f.overlay(variant_part, f.lit(\"10\"), 1, 2)  # RFC 4122 variant.\n    variant_part = f.lower(f.conv(variant_part, 2, 16))\n    target_col_uuid = f.concat_ws(\n        \"-\",\n        f.substring(source_columns_sha1, 1, 8),\n        f.substring(source_columns_sha1, 9, 4),\n        f.concat(f.lit(\"5\"), f.substring(source_columns_sha1, 14, 3)),  # Set version.\n        variant_part,\n        f.substring(source_columns_sha1, 21, 12),\n    )\n    # Applying the transformation to the input df, storing the result in the column specified in `target_column`.\n    self.output.df = self.df.withColumn(self.target_column, target_col_uuid)\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.hash_uuid5","title":"koheesio.spark.transformations.uuid5.hash_uuid5","text":"
hash_uuid5(input_value: str, namespace: Optional[Union[str, UUID]] = '', extra_string: Optional[str] = '')\n

pure python implementation of HashUUID5

See: https://docs.python.org/3/library/uuid.html#uuid.uuid5

Parameters:

Name Type Description Default input_value str

value that will be hashed

required namespace Optional[str | UUID]

namespace DNS

'' extra_string Optional[str]

optional extra string that will be prepended to the input_value

''

Returns:

Type Description str

uuid.UUID (uuid5) cast to string

Source code in src/koheesio/spark/transformations/uuid5.py
def hash_uuid5(\n    input_value: str,\n    namespace: Optional[Union[str, uuid.UUID]] = \"\",\n    extra_string: Optional[str] = \"\",\n):\n    \"\"\"pure python implementation of HashUUID5\n\n    See: https://docs.python.org/3/library/uuid.html#uuid.uuid5\n\n    Parameters\n    ----------\n    input_value : str\n        value that will be hashed\n    namespace : Optional[str | uuid.UUID]\n        namespace DNS\n    extra_string : Optional[str]\n        optional extra string that will be prepended to the input_value\n\n    Returns\n    -------\n    str\n        uuid.UUID (uuid5) cast to string\n    \"\"\"\n    if not isinstance(namespace, uuid.UUID):\n        hashed_namespace = uuid5_namespace(namespace)\n    else:\n        hashed_namespace = namespace\n    return str(uuid.uuid5(hashed_namespace, (extra_string + input_value)))\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.uuid5_namespace","title":"koheesio.spark.transformations.uuid5.uuid5_namespace","text":"
uuid5_namespace(ns: Optional[Union[str, UUID]]) -> UUID\n

Helper function used to provide a UUID5 hashed namespace based on the passed str

Parameters:

Name Type Description Default ns Optional[Union[str, UUID]]

A str, an empty string (or None), or an existing UUID can be passed

required

Returns:

Type Description UUID

UUID5 hashed namespace

Source code in src/koheesio/spark/transformations/uuid5.py
def uuid5_namespace(ns: Optional[Union[str, uuid.UUID]]) -> uuid.UUID:\n    \"\"\"Helper function used to provide a UUID5 hashed namespace based on the passed str\n\n    Parameters\n    ----------\n    ns : Optional[Union[str, uuid.UUID]]\n        A str, an empty string (or None), or an existing UUID can be passed\n\n    Returns\n    -------\n    uuid.UUID\n        UUID5 hashed namespace\n    \"\"\"\n    # if we already have a UUID, we just return it\n    if isinstance(ns, uuid.UUID):\n        return ns\n\n    # if ns is empty or none, we simply return the default NAMESPACE_DNS\n    if not ns:\n        ns = uuid.NAMESPACE_DNS\n        return ns\n\n    # else we hash the string against the NAMESPACE_DNS\n    ns = uuid.uuid5(uuid.NAMESPACE_DNS, ns)\n    return ns\n
"},{"location":"api_reference/spark/transformations/date_time/index.html","title":"Date time","text":"

Module that holds the transformations that can be used for date and time related operations.

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone","title":"koheesio.spark.transformations.date_time.ChangeTimeZone","text":"

Allows for the value of a column to be changed from one timezone to another

Adding useful metadata

When add_target_timezone is enabled (default), an additional column is created documenting which timezone a field has been converted to. Additionally, the suffix added to this column can be customized (default value is _timezone).

Example

Input:

target_column = \"some_column_name\"\ntarget_timezone = \"EST\"\nadd_target_timezone = True  # default value\ntimezone_column_suffix = \"_timezone\"  # default value\n

Output:

column name  = \"some_column_name_timezone\"  # notice the suffix\ncolumn value = \"EST\"\n

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.add_target_timezone","title":"add_target_timezone class-attribute instance-attribute","text":"
add_target_timezone: bool = Field(default=True, description='Toggles whether the target timezone is added as a column. True by default.')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.from_timezone","title":"from_timezone class-attribute instance-attribute","text":"
from_timezone: str = Field(default=..., alias='source_timezone', description='Timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.target_timezone_column_suffix","title":"target_timezone_column_suffix class-attribute instance-attribute","text":"
target_timezone_column_suffix: Optional[str] = Field(default='_timezone', alias='suffix', description=\"Allows to customize the suffix that is added to the target_timezone column. Defaults to '_timezone'. Note: this will be ignored if 'add_target_timezone' is set to False\")\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.to_timezone","title":"to_timezone class-attribute instance-attribute","text":"
to_timezone: str = Field(default=..., alias='target_timezone', description='Target timezone. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def execute(self):\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n        # document which timezone a field has been converted to\n        if self.add_target_timezone:\n            df = df.withColumn(f\"{target_column}{self.target_timezone_column_suffix}\", f.lit(self.to_timezone))\n\n    self.output.df = df\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n    return change_timezone(column=column, source_timezone=self.from_timezone, target_timezone=self.to_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_no_duplicate_timezones","title":"validate_no_duplicate_timezones","text":"
validate_no_duplicate_timezones(values)\n

Validate that source and target timezone are not the same

Source code in src/koheesio/spark/transformations/date_time/__init__.py
@model_validator(mode=\"before\")\ndef validate_no_duplicate_timezones(cls, values):\n    \"\"\"Validate that source and target timezone are not the same\"\"\"\n    from_timezone_value = values.get(\"from_timezone\")\n    to_timezone_value = values.get(\"o_timezone\")\n\n    if from_timezone_value == to_timezone_value:\n        raise ValueError(\"Timezone conversions from and to the same timezones are not valid.\")\n\n    return values\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_timezone","title":"validate_timezone","text":"
validate_timezone(timezone_value)\n

Validate that the timezone is a valid timezone.

Source code in src/koheesio/spark/transformations/date_time/__init__.py
@field_validator(\"from_timezone\", \"to_timezone\")\ndef validate_timezone(cls, timezone_value):\n    \"\"\"Validate that the timezone is a valid timezone.\"\"\"\n    if timezone_value not in all_timezones_set:\n        raise ValueError(\n            \"Not a valid timezone. Refer to the `TZ database name` column here: \"\n            \"https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\"\n        )\n    return timezone_value\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat","title":"koheesio.spark.transformations.date_time.DateFormat","text":"

wrapper around pyspark.sql.functions.date_format

See Also
  • https://spark.apache.org/docs/3.3.2/api/python/reference/pyspark.sql/api/pyspark.sql.functions.date_format.html
  • https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html
Concept

This Transformation allows to convert a date/timestamp/string to a value of string in the format specified by the date format given.

A pattern could be for instance dd.MM.yyyy and could return a string like \u201818.03.1993\u2019. All pattern letters of datetime pattern can be used, see: https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html

How to use

If more than one column is passed, the behavior of the Class changes this way

  • the transformation will be run in a loop against all the given columns
  • the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.
Example
source_column value: datetime.date(2020, 1, 1)\ntarget: \"yyyyMMdd HH:mm\"\noutput: \"20200101 00:00\"\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(..., description='The format for the resulting string. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n    return date_format(column, self.format)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp","title":"koheesio.spark.transformations.date_time.ToTimestamp","text":"

wrapper around pyspark.sql.functions.to_timestamp

Converts a Column (or set of Columns) into pyspark.sql.types.TimestampType using the specified format. Specify formats according to datetime pattern https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html_.

Functionally equivalent to col.cast(\"timestamp\").

See Also

Related Koheesio classes:

  • koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
  • koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field

pyspark.sql.functions:

  • datetime pattern : https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Example"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--basic-usage-example","title":"Basic usage example:","text":"

input_df:

t \"1997-02-28 10:30:00\"

t is a string

tts = ToTimestamp(\n    # since the source column is the same as the target in this example, 't' will be overwritten\n    column=\"t\",\n    target_column=\"t\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df)\n

output_df:

t datetime.datetime(1997, 2, 28, 10, 30)

Now t is a timestamp

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--multiple-columns-at-once","title":"Multiple columns at once:","text":"

input_df:

t1 t2 \"1997-02-28 10:30:00\" \"2007-03-31 11:40:10\"

t1 and t2 are strings

tts = ToTimestamp(\n    columns=[\"t1\", \"t2\"],\n    # 'target_suffix' is synonymous with 'target_column'\n    target_suffix=\"new\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df).select(\"t1_new\", \"t2_new\")\n

output_df:

t1_new t2_new datetime.datetime(1997, 2, 28, 10, 30) datetime.datetime(2007, 3, 31, 11, 40)

Now t1_new and t2_new are both timestamps

"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default=..., description='The date format for of the timestamp field. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.func","title":"func","text":"
func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n    # convert string to timestamp\n    converted_col = to_timestamp(column, self.format)\n    return when(column.isNull(), lit(None)).otherwise(converted_col)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.change_timezone","title":"koheesio.spark.transformations.date_time.change_timezone","text":"
change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str)\n

Helper function to change from one timezone to another

wrapper around pyspark.sql.functions.from_utc_timestamp and to_utc_timestamp

Parameters:

Name Type Description Default column Union[str, Column]

The column to change the timezone of

required source_timezone str

The timezone of the source_column value. Timezone fields are validated against the TZ database name column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

required target_timezone str

The target timezone. Timezone fields are validated against the TZ database name column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

required Source code in src/koheesio/spark/transformations/date_time/__init__.py
def change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str):\n    \"\"\"Helper function to change from one timezone to another\n\n    wrapper around `pyspark.sql.functions.from_utc_timestamp` and `to_utc_timestamp`\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column to change the timezone of\n    source_timezone : str\n        The timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in\n        this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n    target_timezone : str\n        The target timezone. Timezone fields are validated against the `TZ database name` column in this list:\n        https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n\n    \"\"\"\n    column = col(column) if isinstance(column, str) else column\n    return from_utc_timestamp((to_utc_timestamp(column, source_timezone)), target_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html","title":"Interval","text":"

This module provides a DateTimeColumn class that extends the Column class from PySpark. It allows for adding or subtracting an interval value from a datetime column.

This can be used to reflect a change in a given date / time column in a more human-readable way.

Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal

Background

The aim is to easily add or subtract an 'interval' value to a datetime column. An interval value is a string that represents a time interval. For example, '1 day', '1 month', '5 years', '1 minute 30 seconds', '10 milliseconds', etc. These can be used to reflect a change in a given date / time column in a more human-readable way.

Typically, this can be done using the date_add() and date_sub() functions in Spark SQL. However, these functions only support adding or subtracting a single unit of time measured in days. Using an interval gives us much more flexibility; however, Spark SQL does not provide a function to add or subtract an interval value from a datetime column through the python API directly, so we have to use the expr() function to do this to be able to directly use SQL.

This module provides a DateTimeColumn class that extends the Column class from PySpark. It allows for adding or subtracting an interval value from a datetime column using the + and - operators.

Additionally, this module provides two transformation classes that can be used as a transformation step in a pipeline:

  • DateTimeAddInterval: adds an interval value to a datetime column
  • DateTimeSubtractInterval: subtracts an interval value from a datetime column

These classes are subclasses of ColumnsTransformationWithTarget and hence can be used to perform transformations on multiple columns at once.

The above transformations both use the provided asjust_time() function to perform the actual transformation.

See also:

Related Koheesio classes:

  • koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
  • koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field

pyspark.sql.functions:

  • https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
  • https://spark.apache.org/docs/latest/api/sql/index.html
  • https://spark.apache.org/docs/latest/api/sql/#try_add
  • https://spark.apache.org/docs/latest/api/sql/#try_subtract

Classes:

Name Description DateTimeColumn

A datetime column that can be adjusted by adding or subtracting an interval value using the + and - operators.

DateTimeAddInterval

A transformation that adds an interval value to a datetime column. This class is a subclass of ColumnsTransformationWithTarget and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget for more information.

DateTimeSubtractInterval

A transformation that subtracts an interval value from a datetime column. This class is a subclass of ColumnsTransformationWithTarget and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget for more information.

Note

the DateTimeAddInterval and DateTimeSubtractInterval classes are very similar. The only difference is that one adds an interval value to a datetime column, while the other subtracts an interval value from a datetime column.

Functions:

Name Description dt_column

Converts a column to a DateTimeColumn. This function aims to be a drop-in replacement for pyspark.sql.functions.col that returns a DateTimeColumn instead of a Column.

adjust_time

Adjusts a datetime column by adding or subtracting an interval value.

validate_interval

Validates a given interval string.

Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--various-ways-to-create-and-interact-with-datetimecolumn","title":"Various ways to create and interact with DateTimeColumn:","text":"
  • Create a DateTimeColumn from a string: dt_column(\"my_column\")
  • Create a DateTimeColumn from a Column: dt_column(df.my_column)
  • Use the + and - operators to add or subtract an interval value from a DateTimeColumn:
    • dt_column(\"my_column\") + \"1 day\"
    • dt_column(\"my_column\") - \"1 month\"
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--functional-examples-using-adjust_time","title":"Functional examples using adjust_time():","text":"
  • Add 1 day to a column: adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")
  • Subtract 1 month from a column: adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--as-a-transformation-step","title":"As a transformation step:","text":"

from koheesio.spark.transformations.date_time.interval import (\n    DateTimeAddInterval,\n)\n\ninput_df = spark.createDataFrame([(1, \"2022-01-01 00:00:00\")], [\"id\", \"my_column\"])\n\n# add 1 day to my_column and store the result in a new column called 'one_day_later'\noutput_df = DateTimeAddInterval(column=\"my_column\", target_column=\"one_day_later\", interval=\"1 day\").transform(input_df)\n
output_df:

id my_column one_day_later 1 2022-01-01 00:00:00 2022-01-02 00:00:00

DateTimeSubtractInterval works in a similar way, but subtracts an interval value from a datetime column.

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.Operations","title":"koheesio.spark.transformations.date_time.interval.Operations module-attribute","text":"
Operations = Literal['add', 'subtract']\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","text":"

A transformation that adds or subtracts a specified interval from a datetime column.

See also:

pyspark.sql.functions:

  • https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
  • https://spark.apache.org/docs/latest/api/sql/index.html#interval

Parameters:

Name Type Description Default interval str

The interval to add to the datetime column.

required operation Operations

The operation to perform. Must be either 'add' or 'subtract'.

add Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--add-1-day-to-a-column","title":"add 1 day to a column","text":"
DateTimeAddInterval(\n    column=\"my_column\",\n    interval=\"1 day\",\n).transform(df)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--subtract-1-month-from-my_column-and-store-the-result-in-a-new-column-called-one_month_earlier","title":"subtract 1 month from my_column and store the result in a new column called one_month_earlier","text":"
DateTimeSubtractInterval(\n    column=\"my_column\",\n    target_column=\"one_month_earlier\",\n    interval=\"1 month\",\n)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.interval","title":"interval class-attribute instance-attribute","text":"
interval: str = Field(default=..., description='The interval to add to the datetime column.', examples=['1 day', '5 years', '3 months'])\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.operation","title":"operation class-attribute instance-attribute","text":"
operation: Operations = Field(default='add', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.validate_interval","title":"validate_interval class-attribute instance-attribute","text":"
validate_interval = field_validator('interval')(validate_interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/date_time/interval.py
def func(self, column: Column):\n    return adjust_time(column, operation=self.operation, interval=self.interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn","title":"koheesio.spark.transformations.date_time.interval.DateTimeColumn","text":"

A datetime column that can be adjusted by adding or subtracting an interval value using the + and - operators.

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn.from_column","title":"from_column classmethod","text":"
from_column(column: Column)\n

Create a DateTimeColumn from an existing Column

Source code in src/koheesio/spark/transformations/date_time/interval.py
@classmethod\ndef from_column(cls, column: Column):\n    \"\"\"Create a DateTimeColumn from an existing Column\"\"\"\n    return cls(column._jc)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","text":"

Subtracts a specified interval from a datetime column.

Works in the same way as DateTimeAddInterval, but subtracts the specified interval from the datetime column. See DateTimeAddInterval for more information.

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval.operation","title":"operation class-attribute instance-attribute","text":"
operation: Operations = Field(default='subtract', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time","title":"koheesio.spark.transformations.date_time.interval.adjust_time","text":"
adjust_time(column: Column, operation: Operations, interval: str) -> Column\n

Adjusts a datetime column by adding or subtracting an interval value.

This can be used to reflect a change in a given date / time column in a more human-readable way.

See also

Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal

Example

Parameters:

Name Type Description Default column Column

The datetime column to adjust.

required operation Operations

The operation to perform. Must be either 'add' or 'subtract'.

required interval str

The value to add or subtract. Must be a valid interval string.

required

Returns:

Type Description Column

The adjusted datetime column.

Source code in src/koheesio/spark/transformations/date_time/interval.py
def adjust_time(column: Column, operation: Operations, interval: str) -> Column:\n    \"\"\"\n    Adjusts a datetime column by adding or subtracting an interval value.\n\n    This can be used to reflect a change in a given date / time column in a more human-readable way.\n\n\n    See also\n    --------\n    Please refer to the Spark SQL documentation for a list of valid interval values:\n    https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal\n\n    ### pyspark.sql.functions:\n\n    * https://spark.apache.org/docs/latest/api/sql/index.html#interval\n    * https://spark.apache.org/docs/latest/api/sql/#try_add\n    * https://spark.apache.org/docs/latest/api/sql/#try_subtract\n\n    Example\n    --------\n    ### add 1 day to a column\n    ```python\n    adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n    ```\n\n    ### subtract 1 month from a column\n    ```python\n    adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n    ```\n\n    ### or, a much more complicated example\n\n    In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called `my_column`.\n    ```python\n    adjust_time(\n        \"my_column\",\n        operation=\"add\",\n        interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n    )\n    ```\n\n    Parameters\n    ----------\n    column : Column\n        The datetime column to adjust.\n    operation : Operations\n        The operation to perform. Must be either 'add' or 'subtract'.\n    interval : str\n        The value to add or subtract. Must be a valid interval string.\n\n    Returns\n    -------\n    Column\n        The adjusted datetime column.\n    \"\"\"\n\n    # check that value is a valid interval\n    interval = validate_interval(interval)\n\n    column_name = column._jc.toString()\n\n    # determine the operation to perform\n    try:\n        operation = {\n            \"add\": \"try_add\",\n            \"subtract\": \"try_subtract\",\n        }[operation]\n    except KeyError as e:\n        raise ValueError(f\"Operation '{operation}' is not valid. Must be either 'add' or 'subtract'.\") from e\n\n    # perform the operation\n    _expression = f\"{operation}({column_name}, interval '{interval}')\"\n    column = expr(_expression)\n\n    return column\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--pysparksqlfunctions","title":"pyspark.sql.functions:","text":"
  • https://spark.apache.org/docs/latest/api/sql/index.html#interval
  • https://spark.apache.org/docs/latest/api/sql/#try_add
  • https://spark.apache.org/docs/latest/api/sql/#try_subtract
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--add-1-day-to-a-column","title":"add 1 day to a column","text":"
adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--subtract-1-month-from-a-column","title":"subtract 1 month from a column","text":"
adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--or-a-much-more-complicated-example","title":"or, a much more complicated example","text":"

In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called my_column.

adjust_time(\n    \"my_column\",\n    operation=\"add\",\n    interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n)\n

"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column","title":"koheesio.spark.transformations.date_time.interval.dt_column","text":"
dt_column(column: Union[str, Column]) -> DateTimeColumn\n

Convert a column to a DateTimeColumn

Aims to be a drop-in replacement for pyspark.sql.functions.col that returns a DateTimeColumn instead of a Column.

Example

Parameters:

Name Type Description Default column Union[str, Column]

The column (or name of the column) to convert to a DateTimeColumn

required Source code in src/koheesio/spark/transformations/date_time/interval.py
def dt_column(column: Union[str, Column]) -> DateTimeColumn:\n    \"\"\"Convert a column to a DateTimeColumn\n\n    Aims to be a drop-in replacement for `pyspark.sql.functions.col` that returns a DateTimeColumn instead of a Column.\n\n    Example\n    --------\n    ### create a DateTimeColumn from a string\n    ```python\n    dt_column(\"my_column\")\n    ```\n\n    ### create a DateTimeColumn from a Column\n    ```python\n    dt_column(df.my_column)\n    ```\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column (or name of the column) to convert to a DateTimeColumn\n    \"\"\"\n    if isinstance(column, str):\n        column = col(column)\n    elif not isinstance(column, Column):\n        raise TypeError(f\"Expected column to be of type str or Column, got {type(column)} instead.\")\n    return DateTimeColumn.from_column(column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-string","title":"create a DateTimeColumn from a string","text":"
dt_column(\"my_column\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-column","title":"create a DateTimeColumn from a Column","text":"
dt_column(df.my_column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.validate_interval","title":"koheesio.spark.transformations.date_time.interval.validate_interval","text":"
validate_interval(interval: str)\n

Validate an interval string

Parameters:

Name Type Description Default interval str

The interval string to validate

required

Raises:

Type Description ValueError

If the interval string is invalid

Source code in src/koheesio/spark/transformations/date_time/interval.py
def validate_interval(interval: str):\n    \"\"\"Validate an interval string\n\n    Parameters\n    ----------\n    interval : str\n        The interval string to validate\n\n    Raises\n    ------\n    ValueError\n        If the interval string is invalid\n    \"\"\"\n    try:\n        expr(f\"interval '{interval}'\")\n    except ParseException as e:\n        raise ValueError(f\"Value '{interval}' is not a valid interval.\") from e\n    return interval\n
"},{"location":"api_reference/spark/transformations/strings/index.html","title":"Strings","text":"

Adds a number of Transformations that are intended to be used with StringType column input. Some will work with other types however, but will output StringType or an array of StringType.

These Transformations take full advantage of Koheesio's ColumnsTransformationWithTarget class, allowing a user to apply column transformations to multiple columns at once. See the class docstrings for more information.

The following Transformations are included:

change_case:

  • Lower Converts a string column to lower case.
  • Upper Converts a string column to upper case.
  • TitleCase or InitCap Converts a string column to title case, where each word starts with a capital letter.

concat:

  • Concat Concatenates multiple input columns together into a single column, optionally using the given separator.

pad:

  • Pad Pads the values of source_column with the character up until it reaches length of characters
  • LPad Pad with a character on the left side of the string.
  • RPad Pad with a character on the right side of the string.

regexp:

  • RegexpExtract Extract a specific group matched by a Java regexp from the specified string column.
  • RegexpReplace Searches for the given regexp and replaces all instances with what is in 'replacement'.

replace:

  • Replace Replace all instances of a string in a column with another string.

split:

  • SplitAll Splits the contents of a column on basis of a split_pattern.
  • SplitAtFirstMatch Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.

substring:

  • Substring Extracts a substring from a string column starting at the given position.

trim:

  • Trim Trim whitespace from the beginning and/or end of a string.
  • LTrim Trim whitespace from the beginning of a string.
  • RTrim Trim whitespace from the end of a string.
"},{"location":"api_reference/spark/transformations/strings/change_case.html","title":"Change case","text":"

Convert the case of a string column to upper case, lower case, or title case

Classes:

Name Description `Lower`

Converts a string column to lower case.

`Upper`

Converts a string column to upper case.

`TitleCase` or `InitCap`

Converts a string column to title case, where each word starts with a capital letter.

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.InitCap","title":"koheesio.spark.transformations.strings.change_case.InitCap module-attribute","text":"
InitCap = TitleCase\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase","title":"koheesio.spark.transformations.strings.change_case.LowerCase","text":"

This function makes the contents of a column lower case.

Wraps the pyspark.sql.functions.lower function.

Warnings

If the type of the column is not string, LowerCase will not be run. A Warning will be thrown indicating this.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The name of the column or columns to convert to lower case. Alias: column. Lower case will be applied to all columns in the list. Column is required to be of string type.

required target_column

The name of the column to store the result in. If None, the result will be stored in the same column as the input.

required Example

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = LowerCase(column=\"product\", target_column=\"product_lower\").transform(df)\n

output_df:

product amount country product_lower Banana lemon orange 1000 USA banana lemon orange Carrots Blueberries 1500 USA carrots blueberries Beans 1600 USA beans

In this example, the column product is converted to product_lower and the contents of this column are converted to lower case.

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig","title":"ColumnConfig","text":"

Limit data type to string

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n    return lower(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase","title":"koheesio.spark.transformations.strings.change_case.TitleCase","text":"

This function makes the contents of a column title case. This means that every word starts with an upper case.

Wraps the pyspark.sql.functions.initcap function.

Warnings

If the type of the column is not string, TitleCase will not be run. A Warning will be thrown indicating this.

Parameters:

Name Type Description Default columns

The name of the column or columns to convert to title case. Alias: column. Title case will be applied to all columns in the list. Column is required to be of string type.

required target_column

The name of the column to store the result in. If None, the result will be stored in the same column as the input.

required Example

input_df:

product amount country Banana lemon orange 1000 USA Carrots blueberries 1500 USA Beans 1600 USA
output_df = TitleCase(column=\"product\", target_column=\"product_title\").transform(df)\n

output_df:

product amount country product_title Banana lemon orange 1000 USA Banana Lemon Orange Carrots blueberries 1500 USA Carrots Blueberries Beans 1600 USA Beans

In this example, the column product is converted to product_title and the contents of this column are converted to title case (each word now starts with an upper case).

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n    return initcap(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase","title":"koheesio.spark.transformations.strings.change_case.UpperCase","text":"

This function makes the contents of a column upper case.

Wraps the pyspark.sql.functions.upper function.

Warnings

If the type of the column is not string, UpperCase will not be run. A Warning will be thrown indicating this.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The name of the column or columns to convert to upper case. Alias: column. Upper case will be applied to all columns in the list. Column is required to be of string type.

required target_column

The name of the column to store the result in. If None, the result will be stored in the same column as the input.

required

Examples:

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = UpperCase(column=\"product\", target_column=\"product_upper\").transform(df)\n

output_df:

product amount country product_upper Banana lemon orange 1000 USA BANANA LEMON ORANGE Carrots Blueberries 1500 USA CARROTS BLUEBERRIES Beans 1600 USA BEANS

In this example, the column product is converted to product_upper and the contents of this column are converted to upper case.

"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n    return upper(column)\n
"},{"location":"api_reference/spark/transformations/strings/concat.html","title":"Concat","text":"

Concatenates multiple input columns together into a single column, optionally using a given separator.

"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat","title":"koheesio.spark.transformations.strings.concat.Concat","text":"

This is a wrapper around PySpark concat() and concat_ws() functions

Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.

Concept

When working with arrays, the function will return the result of the concatenation of the elements in the array.

  • If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.
  • If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.

When working with date/timestamps, the function will return the result of the concatenation of the elements in the array. The timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first.

required target_column Optional[str]

Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.

None spacer Optional[str]

Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used

None Example"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-a-string-column-and-a-timestamp-column","title":"Example using a string column and a timestamp column","text":"

input_df:

column_a column_b text 1997-02-28 10:30:00
output_df = Concat(\n    columns=[\"column_a\", \"column_b\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n

output_df:

column_a column_b concatenated_column text 1997-02-28 10:30:00 text--1997-02-28 10:30:00

In the example above, the resulting column is a string column.

If we had left out the spacer, the resulting column would have had the value of text1997-02-28 10:30:00 (a string). Note that the timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss.

"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-two-array-columns","title":"Example using two array columns","text":"

input_df:

array_col_1 array_col_2 [text1, text2] [text3, text4]
output_df = Concat(\n    columns=[\"array_col_1\", \"array_col_2\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n

output_df:

array_col_1 array_col_2 concatenated_column [text1, text2] [text3, text4] \"text1--text2--text3\"

Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would have been an array with the values of [\"text1\", \"text2\", \"text3\"].

Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.

"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.spacer","title":"spacer class-attribute instance-attribute","text":"
spacer: Optional[str] = Field(default=None, description='Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used', alias='sep')\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.target_column","title":"target_column class-attribute instance-attribute","text":"
target_column: Optional[str] = Field(default=None, description=\"Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.\")\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.execute","title":"execute","text":"
execute() -> DataFrame\n
Source code in src/koheesio/spark/transformations/strings/concat.py
def execute(self) -> DataFrame:\n    columns = [col(s) for s in self.get_columns()]\n    self.output.df = self.df.withColumn(\n        self.target_column, concat_ws(self.spacer, *columns) if self.spacer else concat(*columns)\n    )\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.get_target_column","title":"get_target_column","text":"
get_target_column(target_column_value, values)\n

Get the target column name if it is not provided.

If not provided, a name will be generated by concatenating the names of the source columns with an '_'.

Source code in src/koheesio/spark/transformations/strings/concat.py
@field_validator(\"target_column\")\ndef get_target_column(cls, target_column_value, values):\n    \"\"\"Get the target column name if it is not provided.\n\n    If not provided, a name will be generated by concatenating the names of the source columns with an '_'.\"\"\"\n    if not target_column_value:\n        columns_value: List = values[\"columns\"]\n        columns = list(dict.fromkeys(columns_value))  # dict.fromkeys is used to dedup while maintaining order\n        return \"_\".join(columns)\n\n    return target_column_value\n
"},{"location":"api_reference/spark/transformations/strings/pad.html","title":"Pad","text":"

Pad the values of a column with a character up until it reaches a certain length.

Classes:

Name Description Pad

Pads the values of source_column with the character up until it reaches length of characters

LPad

Pad with a character on the left side of the string.

RPad

Pad with a character on the right side of the string.

"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.LPad","title":"koheesio.spark.transformations.strings.pad.LPad module-attribute","text":"
LPad = Pad\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.pad_directions","title":"koheesio.spark.transformations.strings.pad.pad_directions module-attribute","text":"
pad_directions = Literal['left', 'right']\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad","title":"koheesio.spark.transformations.strings.pad.Pad","text":"

Pads the values of source_column with the character up until it reaches length of characters The direction param can be changed to apply either a left or a right pad. Defaults to left pad.

Wraps the lpad and rpad functions from PySpark.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to pad. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None character constr(min_length=1)

The character to use for padding

required length PositiveInt

Positive integer to indicate the intended length

required direction Optional[pad_directions]

On which side to add the characters. Either \"left\" or \"right\". Defaults to \"left\"

left Example

input_df:

column hello world
output_df = Pad(\n    column=\"column\",\n    target_column=\"padded_column\",\n    character=\"*\",\n    length=10,\n    direction=\"right\",\n).transform(input_df)\n

output_df:

column padded_column hello hello***** world world*****

Note: in the example above, we could have used the RPad class instead of Pad with direction=\"right\" to achieve the same result.

"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.character","title":"character class-attribute instance-attribute","text":"
character: constr(min_length=1) = Field(default=..., description='The character to use for padding')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.direction","title":"direction class-attribute instance-attribute","text":"
direction: Optional[pad_directions] = Field(default='left', description='On which side to add the characters . Either \"left\" or \"right\". Defaults to \"left\"')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.length","title":"length class-attribute instance-attribute","text":"
length: PositiveInt = Field(default=..., description='Positive integer to indicate the intended length')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/pad.py
def func(self, column: Column):\n    func = lpad if self.direction == \"left\" else rpad\n    return func(column, self.length, self.character)\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad","title":"koheesio.spark.transformations.strings.pad.RPad","text":"

Pad with a character on the right side of the string.

See Pad class docstring for more information.

"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad.direction","title":"direction class-attribute instance-attribute","text":"
direction: Optional[pad_directions] = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html","title":"Regexp","text":"

String transformations using regular expressions.

This module contains transformations that use regular expressions to transform strings.

Classes:

Name Description RegexpExtract

Extract a specific group matched by a Java regexp from the specified string column.

RegexpReplace

Searches for the given regexp and replaces all instances with what is in 'replacement'.

"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract","title":"koheesio.spark.transformations.strings.regexp.RegexpExtract","text":"

Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.

A wrapper around the pyspark regexp_extract function

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to extract from. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None regexp str

The Java regular expression to extract

required index Optional[int]

When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.

0 Example"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--extracting-the-year-and-week-number-from-a-string","title":"Extracting the year and week number from a string","text":"

Let's say we have a column containing the year and week in a format like Y## W# and we would like to extract the week numbers.

input_df:

YWK 2020 W1 2021 WK2
output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"week_number\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=2,  # remember that this is 1-indexed! So 2 will get the week number in this example.\n).transform(input_df)\n

output_df:

YWK week_number 2020 W1 1 2021 WK2 2"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--using-the-same-example-but-extracting-the-year-instead","title":"Using the same example, but extracting the year instead","text":"

If you want to extract the year, you can use index=1.

output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"year\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=1,  # remember that this is 1-indexed! So 1 will get the year in this example.\n).transform(input_df)\n

output_df:

YWK year 2020 W1 2020 2021 WK2 2021"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.index","title":"index class-attribute instance-attribute","text":"
index: Optional[int] = Field(default=0, description='When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.regexp","title":"regexp class-attribute instance-attribute","text":"
regexp: str = Field(default=..., description='The Java regular expression to extract')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n    return regexp_extract(column, self.regexp, self.index)\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace","title":"koheesio.spark.transformations.strings.regexp.RegexpReplace","text":"

Searches for the given regexp and replaces all instances with what is in 'replacement'.

A wrapper around the pyspark regexp_replace function

Parameters:

Name Type Description Default columns

The column (or list of columns) to replace in. Alias: column

required target_column

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

required regexp

The regular expression to replace

required replacement

String to replace matched pattern with.

required

Examples:

input_df: | content | |------------| | hello world|

Let's say you want to replace 'hello'.

output_df = RegexpReplace(\n    column=\"content\",\n    target_column=\"replaced\",\n    regexp=\"hello\",\n    replacement=\"gutentag\",\n).transform(input_df)\n

output_df: | content | replaced | |------------|---------------| | hello world| gutentag world|

"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.regexp","title":"regexp class-attribute instance-attribute","text":"
regexp: str = Field(default=..., description='The regular expression to replace')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.replacement","title":"replacement class-attribute instance-attribute","text":"
replacement: str = Field(default=..., description='String to replace matched pattern with.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n    return regexp_replace(column, self.regexp, self.replacement)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html","title":"Replace","text":"

String replacements without using regular expressions.

"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace","title":"koheesio.spark.transformations.strings.replace.Replace","text":"

Replace all instances of a string in a column with another string.

This transformation uses PySpark when().otherwise() functions.

Notes
  • If original_value is not set, the transformation will replace all null values with new_value
  • If original_value is set, the transformation will replace all values matching original_value with new_value
  • Numeric values are supported, but will be cast to string in the process
  • Replace is meant for simple string replacements. If more advanced replacements are needed, use the RegexpReplace transformation instead.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to replace values in. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None original_value Optional[str]

The original value that needs to be replaced. Alias: from

None new_value str

The new value to replace this with. Alias: to

required

Examples:

input_df:

column hello world None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-null-values-with-a-new-value","title":"Replace all null values with a new value","text":"
output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=None,  # This is the default value, so it can be omitted\n    new_value=\"programmer\",\n).transform(input_df)\n

output_df:

column replaced_column hello hello world world None programmer"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-instances-of-a-string-in-a-column-with-another-string","title":"Replace all instances of a string in a column with another string","text":"
output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=\"world\",\n    new_value=\"programmer\",\n).transform(input_df)\n

output_df:

column replaced_column hello hello world programmer None None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.new_value","title":"new_value class-attribute instance-attribute","text":"
new_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.original_value","title":"original_value class-attribute instance-attribute","text":"
original_value: Optional[str] = Field(default=None, alias='from', description='The original value that needs to be replaced')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.cast_values_to_str","title":"cast_values_to_str","text":"
cast_values_to_str(value)\n

Cast values to string if they are not None

Source code in src/koheesio/spark/transformations/strings/replace.py
@field_validator(\"original_value\", \"new_value\", mode=\"before\")\ndef cast_values_to_str(cls, value):\n    \"\"\"Cast values to string if they are not None\"\"\"\n    if value:\n        return str(value)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/replace.py
def func(self, column: Column):\n    when_statement = (\n        when(column.isNull(), lit(self.new_value))\n        if not self.original_value\n        else when(\n            column == self.original_value,\n            lit(self.new_value),\n        )\n    )\n    return when_statement.otherwise(column)\n
"},{"location":"api_reference/spark/transformations/strings/split.html","title":"Split","text":"

Splits the contents of a column on basis of a split_pattern

Classes:

Name Description SplitAll

Splits the contents of a column on basis of a split_pattern.

SplitAtFirstMatch

Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.

"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll","title":"koheesio.spark.transformations.strings.split.SplitAll","text":"

This function splits the contents of a column on basis of a split_pattern.

It splits at al the locations the pattern is found. The new column will be of ArrayType.

Wraps the pyspark.sql.functions.split function.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to split. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None split_pattern str

This is the pattern that will be used to split the column contents.

required Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = SplitColumn(column=\"product\", target_column=\"split\", split_pattern=\" \").transform(input_df)\n

output_df:

product amount country split Banana lemon orange 1000 USA [\"Banana\", \"lemon\" \"orange\"] Carrots Blueberries 1500 USA [\"Carrots\", \"Blueberries\"] Beans 1600 USA [\"Beans\"]"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.split_pattern","title":"split_pattern class-attribute instance-attribute","text":"
split_pattern: str = Field(default=..., description='The pattern to split the column contents.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n    return split(column, pattern=self.split_pattern)\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch","title":"koheesio.spark.transformations.strings.split.SplitAtFirstMatch","text":"

Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..

Note
  • SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you can toggle the parameter retrieve_first_part.
  • The new column will be of StringType.
  • If you want to split a column more than once, you should call this function multiple times.

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to split. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None split_pattern str

This is the pattern that will be used to split the column contents.

required retrieve_first_part Optional[bool]

Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.

True Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"

input_df:

product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA
output_df = SplitColumn(column=\"product\", target_column=\"split_first\", split_pattern=\"an\").transform(input_df)\n

output_df:

product amount country split_first Banana lemon orange 1000 USA B Carrots Blueberries 1500 USA Carrots Blueberries Beans 1600 USA Be"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.retrieve_first_part","title":"retrieve_first_part class-attribute instance-attribute","text":"
retrieve_first_part: Optional[bool] = Field(default=True, description='Takes the first part of the split when true, the second part when False. Other parts are ignored.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n    split_func = split(column, pattern=self.split_pattern)\n\n    # first part\n    if self.retrieve_first_part:\n        return split_func.getItem(0)\n\n    # or, second part\n    return coalesce(split_func.getItem(1), lit(\"\"))\n
"},{"location":"api_reference/spark/transformations/strings/substring.html","title":"Substring","text":"

Extracts a substring from a string column starting at the given position.

"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring","title":"koheesio.spark.transformations.strings.substring.Substring","text":"

Extracts a substring from a string column starting at the given position.

This is a wrapper around PySpark substring() function

Notes
  • Numeric columns will be cast to string
  • start is 1-indexed, not 0-indexed!

Parameters:

Name Type Description Default columns Union[str, List[str]]

The column (or list of columns) to substring. Alias: column

required target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None start PositiveInt

Positive int. Defines where to begin the substring from. The first character of the field has index 1!

required length Optional[int]

Optional. If not provided, the substring will go until end of string.

-1 Example"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring--extract-a-substring-from-a-string-column-starting-at-the-given-position","title":"Extract a substring from a string column starting at the given position.","text":"

input_df:

column skyscraper
output_df = Substring(\n    column=\"column\",\n    target_column=\"substring_column\",\n    start=3,  # 1-indexed! So this will start at the 3rd character\n    length=4,\n).transform(input_df)\n

output_df:

column substring_column skyscraper yscr"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.length","title":"length class-attribute instance-attribute","text":"
length: Optional[int] = Field(default=-1, description='The target length for the string. use -1 to perform until end')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.start","title":"start class-attribute instance-attribute","text":"
start: PositiveInt = Field(default=..., description='The starting position')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/substring.py
def func(self, column: Column):\n    return when(column.isNull(), None).otherwise(substring(column, self.start, self.length)).cast(StringType())\n
"},{"location":"api_reference/spark/transformations/strings/trim.html","title":"Trim","text":"

Trim whitespace from the beginning and/or end of a string.

Classes:

Name Description - `Trim`

Trim whitespace from the beginning and/or end of a string.

- `LTrim`

Trim whitespace from the beginning of a string.

- `RTrim`

Trim whitespace from the end of a string.

See class docstrings for more information."},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.trim_type","title":"koheesio.spark.transformations.strings.trim.trim_type module-attribute","text":"
trim_type = Literal['left', 'right', 'left-right']\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim","title":"koheesio.spark.transformations.strings.trim.LTrim","text":"

Trim whitespace from the beginning of a string. Alias: LeftTrim

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim.direction","title":"direction class-attribute instance-attribute","text":"
direction: trim_type = 'left'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim","title":"koheesio.spark.transformations.strings.trim.RTrim","text":"

Trim whitespace from the end of a string. Alias: RightTrim

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim.direction","title":"direction class-attribute instance-attribute","text":"
direction: trim_type = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim","title":"koheesio.spark.transformations.strings.trim.Trim","text":"

Trim whitespace from the beginning and/or end of a string.

This is a wrapper around PySpark ltrim() and rtrim() functions

The direction parameter can be changed to apply either a left or a right trim. Defaults to left AND right trim.

Note: If the type of the column is not string, Trim will not be run. A Warning will be thrown indicating this

Parameters:

Name Type Description Default columns

The column (or list of columns) to trim. Alias: column If no columns are provided, all string columns will be trimmed.

required target_column

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

required direction

On which side to remove the spaces. Either \"left\", \"right\" or \"left-right\". Defaults to \"left-right\"

required

Examples:

input_df: | column | |-----------| | \" hello \" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-beginning-of-a-string","title":"Trim whitespace from the beginning of a string","text":"
output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"left\").transform(input_df)\n

output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello \" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-both-sides-of-a-string","title":"Trim whitespace from both sides of a string","text":"
output_df = Trim(\n    column=\"column\",\n    target_column=\"trimmed_column\",\n    direction=\"left-right\",  # default value\n).transform(input_df)\n

output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello\" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-end-of-a-string","title":"Trim whitespace from the end of a string","text":"
output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"right\").transform(input_df)\n

output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \" hello\" |

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.columns","title":"columns class-attribute instance-attribute","text":"
columns: ListOfColumns = Field(default='*', alias='column', description='The column (or list of columns) to trim. Alias: column. If no columns are provided, all stringcolumns will be trimmed.')\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.direction","title":"direction class-attribute instance-attribute","text":"
direction: trim_type = Field(default='left-right', description=\"On which side to remove the spaces. Either 'left', 'right' or 'left-right'\")\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig","title":"ColumnConfig","text":"

Limit data types to string only.

"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute instance-attribute","text":"
limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute instance-attribute","text":"
run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.func","title":"func","text":"
func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/trim.py
def func(self, column: Column):\n    if self.direction == \"left\":\n        return f.ltrim(column)\n\n    if self.direction == \"right\":\n        return f.rtrim(column)\n\n    # both (left-right)\n    return f.rtrim(f.ltrim(column))\n
"},{"location":"api_reference/spark/writers/index.html","title":"Writers","text":"

The Writer class is used to write the DataFrame to a target.

"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode","title":"koheesio.spark.writers.BatchOutputMode","text":"

For Batch:

  • append: Append the contents of the DataFrame to the output table, default option in Koheesio.
  • overwrite: overwrite the existing data.
  • ignore: ignore the operation (i.e. no-op).
  • error or errorifexists: throw an exception at runtime.
  • merge: update matching data in the table and insert rows that do not exist.
  • merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERROR","title":"ERROR class-attribute instance-attribute","text":"
ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute instance-attribute","text":"
ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.IGNORE","title":"IGNORE class-attribute instance-attribute","text":"
IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE","title":"MERGE class-attribute instance-attribute","text":"
MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute instance-attribute","text":"
MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute instance-attribute","text":"
MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute instance-attribute","text":"
OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode","title":"koheesio.spark.writers.StreamingOutputMode","text":"

For Streaming:

  • append: only the new rows in the streaming DataFrame will be written to the sink.
  • complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.
  • update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.COMPLETE","title":"COMPLETE class-attribute instance-attribute","text":"
COMPLETE = 'complete'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.UPDATE","title":"UPDATE class-attribute instance-attribute","text":"
UPDATE = 'update'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer","title":"koheesio.spark.writers.Writer","text":"

The Writer class is used to write the DataFrame to a target.

"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.df","title":"df class-attribute instance-attribute","text":"
df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.format","title":"format class-attribute instance-attribute","text":"
format: str = Field(default='delta', description='The format of the output')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.streaming","title":"streaming property","text":"
streaming: bool\n

Check if the DataFrame is a streaming DataFrame or not.

"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.execute","title":"execute abstractmethod","text":"
execute()\n

Execute on a Writer should handle writing of the self.df (input) as a minimum

Source code in src/koheesio/spark/writers/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Writer should handle writing of the self.df (input) as a minimum\"\"\"\n    # self.df  # input dataframe\n    ...\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.write","title":"write","text":"
write(df: Optional[DataFrame] = None) -> Output\n

Write the DataFrame to the output using execute() and return the output.

If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.

Source code in src/koheesio/spark/writers/__init__.py
def write(self, df: Optional[DataFrame] = None) -> SparkStep.Output:\n    \"\"\"Write the DataFrame to the output using execute() and return the output.\n\n    If no DataFrame is passed, the self.df will be used.\n    If no self.df is set, a RuntimeError will be thrown.\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html","title":"Buffer","text":"

This module contains classes for writing data to a buffer before writing to the final destination.

The BufferWriter class is a base class for writers that write to a buffer first. It provides methods for writing, reading, and resetting the buffer, as well as checking if the buffer is compressed and compressing the buffer.

The PandasCsvBufferWriter class is a subclass of BufferWriter that writes a Spark DataFrame to CSV file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).

The PandasJsonBufferWriter class is a subclass of BufferWriter that writes a Spark DataFrame to JSON file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter","title":"koheesio.spark.writers.buffer.BufferWriter","text":"

Base class for writers that write to a buffer first, before writing to the final destination.

execute() method should implement how the incoming DataFrame is written to the buffer object (e.g. BytesIO) in the output.

The default implementation uses a SpooledTemporaryFile as the buffer. This is a file-like object that starts off stored in memory and automatically rolls over to a temporary file on disk if it exceeds a certain size. A SpooledTemporaryFile behaves similar to BytesIO, but with the added benefit of being able to handle larger amounts of data.

This approach provides a balance between speed and memory usage, allowing for fast in-memory operations for smaller amounts of data while still being able to handle larger amounts of data that would not otherwise fit in memory.

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output","title":"Output","text":"

Output class for BufferWriter

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.buffer","title":"buffer class-attribute instance-attribute","text":"
buffer: InstanceOf[SpooledTemporaryFile] = Field(default_factory=partial(SpooledTemporaryFile, mode='w+b', max_size=0), exclude=True)\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.compress","title":"compress","text":"
compress()\n

Compress the file_buffer in place using GZIP

Source code in src/koheesio/spark/writers/buffer.py
def compress(self):\n    \"\"\"Compress the file_buffer in place using GZIP\"\"\"\n    # check if the buffer is already compressed\n    if self.is_compressed():\n        self.logger.warn(\"Buffer is already compressed. Nothing to compress...\")\n        return self\n\n    # compress the file_buffer\n    file_buffer = self.buffer\n    compressed = gzip.compress(file_buffer.read())\n\n    # write the compressed content back to the buffer\n    self.reset_buffer()\n    self.buffer.write(compressed)\n\n    return self  # to allow for chaining\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.is_compressed","title":"is_compressed","text":"
is_compressed()\n

Check if the buffer is compressed.

Source code in src/koheesio/spark/writers/buffer.py
def is_compressed(self):\n    \"\"\"Check if the buffer is compressed.\"\"\"\n    self.rewind_buffer()\n    magic_number_present = self.buffer.read(2) == b\"\\x1f\\x8b\"\n    self.rewind_buffer()\n    return magic_number_present\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.read","title":"read","text":"
read()\n

Read the buffer

Source code in src/koheesio/spark/writers/buffer.py
def read(self):\n    \"\"\"Read the buffer\"\"\"\n    self.rewind_buffer()\n    data = self.buffer.read()\n    self.rewind_buffer()\n    return data\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.reset_buffer","title":"reset_buffer","text":"
reset_buffer()\n

Reset the buffer

Source code in src/koheesio/spark/writers/buffer.py
def reset_buffer(self):\n    \"\"\"Reset the buffer\"\"\"\n    self.buffer.truncate(0)\n    self.rewind_buffer()\n    return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.rewind_buffer","title":"rewind_buffer","text":"
rewind_buffer()\n

Rewind the buffer

Source code in src/koheesio/spark/writers/buffer.py
def rewind_buffer(self):\n    \"\"\"Rewind the buffer\"\"\"\n    self.buffer.seek(0)\n    return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.write","title":"write","text":"
write(df=None) -> Output\n

Write the DataFrame to the buffer

Source code in src/koheesio/spark/writers/buffer.py
def write(self, df=None) -> Output:\n    \"\"\"Write the DataFrame to the buffer\"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.output.reset_buffer()\n    self.execute()\n    return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter","title":"koheesio.spark.writers.buffer.PandasCsvBufferWriter","text":"

Write a Spark DataFrame to CSV file(s) using Pandas.

Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

See also: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option

Note

This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).

Pyspark vs Pandas

The following table shows the mapping between Pyspark, Pandas, and Koheesio properties. Note that the default values are mostly the same as Pyspark's DataFrameWriter implementation, with some exceptions (see below).

This class implements the most commonly used properties. If a property is not explicitly implemented, it can be accessed through params.

PySpark Property Default PySpark Pandas Property Default Pandas Koheesio Property Default Koheesio Notes maxRecordsPerFile ... chunksize None max_records_per_file ... Spark property name: spark.sql.files.maxRecordsPerFile sep , sep , sep , lineSep \\n line_terminator os.linesep lineSep (alias=line_terminator) \\n N/A ... index True index False Determines whether row labels (index) are included in the output header False header True header True quote \" quotechar \" quote (alias=quotechar) \" quoteAll False doublequote True quoteAll (alias=doublequote) False escape \\ escapechar None escapechar (alias=escape) \\ escapeQuotes True N/A N/A N/A ... Not available in Pandas ignoreLeadingWhiteSpace True N/A N/A N/A ... Not available in Pandas ignoreTrailingWhiteSpace True N/A N/A N/A ... Not available in Pandas charToEscapeQuoteEscaping escape or \u0000 N/A N/A N/A ... Not available in Pandas dateFormat yyyy-MM-dd N/A N/A N/A ... Pandas implements Timestamp, not Date timestampFormat yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] date_format N/A timestampFormat (alias=date_format) yyyy-MM-dd'T'HHss.SSS Follows PySpark defaults timestampNTZFormat yyyy-MM-dd'T'HH:mm:ss[.SSS] N/A N/A N/A ... Pandas implements Timestamp, see above compression None compression infer compression None encoding utf-8 encoding utf-8 N/A ... Not explicitly implemented nullValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented emptyValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented N/A ... float_format N/A N/A ... Not explicitly implemented N/A ... decimal N/A N/A ... Not explicitly implemented N/A ... index_label None N/A ... Not explicitly implemented N/A ... columns N/A N/A ... Not explicitly implemented N/A ... mode N/A N/A ... Not explicitly implemented N/A ... quoting N/A N/A ... Not explicitly implemented N/A ... errors N/A N/A ... Not explicitly implemented N/A ... storage_options N/A N/A ... Not explicitly implemented differences with Pyspark:
  • dateFormat -> Pandas implements Timestamp, not just Date. Hence, Koheesio sets the default to the python equivalent of PySpark's default.
  • compression -> Spark does not compress by default, hence Koheesio does not compress by default. Compression can be provided though.

Parameters:

Name Type Description Default header bool

Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.

True sep str

Field delimiter for the output file. Default is ','.

, quote str

String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'. Default is '\"'.

\" quoteAll bool

A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio sets the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'. Default is False.

False escape str

String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to \\ to match Pyspark's default behavior. In Pandas, this field is called 'escapechar', and defaults to None. Default is '\\'.

\\ timestampFormat str

Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] which mimics the iso8601 format (datetime.isoformat()). Default is '%Y-%m-%dT%H:%M:%S.%f'.

yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] lineSep str, optional, default=

String of length 1. Defines the character used as line separator that should be used for writing. Default is os.linesep.

required compression Optional[Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', 'tar']]

A string representing the compression to use for on-the-fly compression of the output data. Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.

None"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.compression","title":"compression class-attribute instance-attribute","text":"
compression: Optional[CompressionOptions] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.escape","title":"escape class-attribute instance-attribute","text":"
escape: constr(max_length=1) = Field(default='\\\\', description=\"String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to `\\\\` to match Pyspark's default behavior. In Pandas, this is called 'escapechar', and defaults to None.\", alias='escapechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.header","title":"header class-attribute instance-attribute","text":"
header: bool = Field(default=True, description=\"Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.index","title":"index class-attribute instance-attribute","text":"
index: bool = Field(default=False, description='Toggles whether to write row names (index). Default False in Koheesio - pandas default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.lineSep","title":"lineSep class-attribute instance-attribute","text":"
lineSep: Optional[constr(max_length=1)] = Field(default=linesep, description='String of length 1. Defines the character used as line separator that should be used for writing.', alias='line_terminator')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quote","title":"quote class-attribute instance-attribute","text":"
quote: constr(max_length=1) = Field(default='\"', description=\"String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'.\", alias='quotechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quoteAll","title":"quoteAll class-attribute instance-attribute","text":"
quoteAll: bool = Field(default=False, description=\"A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio set the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'.\", alias='doublequote')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.sep","title":"sep class-attribute instance-attribute","text":"
sep: constr(max_length=1) = Field(default=',', description='Field delimiter for the output file')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.timestampFormat","title":"timestampFormat class-attribute instance-attribute","text":"
timestampFormat: str = Field(default='%Y-%m-%dT%H:%M:%S.%f', description=\"Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` which mimics the iso8601 format (`datetime.isoformat()`).\", alias='date_format')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output","title":"Output","text":"

Output class for PandasCsvBufferWriter

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output.pandas_df","title":"pandas_df class-attribute instance-attribute","text":"
pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.execute","title":"execute","text":"
execute()\n

Write the DataFrame to the buffer using Pandas to_csv() method. Compression is handled by pandas to_csv() method.

Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_csv() method.\n    Compression is handled by pandas to_csv() method.\n    \"\"\"\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = self.df.toPandas()\n\n    # create csv file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_csv(file_buffer, **self.get_options(options_type=\"spark\"))\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.get_options","title":"get_options","text":"
get_options(options_type: str = 'csv')\n

Returns the options to pass to Pandas' to_csv() method.

Source code in src/koheesio/spark/writers/buffer.py
def get_options(self, options_type: str = \"csv\"):\n    \"\"\"Returns the options to pass to Pandas' to_csv() method.\"\"\"\n    try:\n        import pandas as _pd\n\n        # Get the pandas version as a tuple of integers\n        pandas_version = tuple(int(i) for i in _pd.__version__.split(\".\"))\n    except ImportError:\n        raise ImportError(\"Pandas is required to use this writer\")\n\n    # Use line_separator for pandas 2.0.0 and later\n    line_sep_option_naming = \"line_separator\" if pandas_version >= (2, 0, 0) else \"line_terminator\"\n\n    csv_options = {\n        \"header\": self.header,\n        \"sep\": self.sep,\n        \"quotechar\": self.quote,\n        \"doublequote\": self.quoteAll,\n        \"escapechar\": self.escape,\n        \"na_rep\": self.emptyValue or self.nullValue,\n        line_sep_option_naming: self.lineSep,\n        \"index\": self.index,\n        \"date_format\": self.timestampFormat,\n        \"compression\": self.compression,\n        **self.params,\n    }\n\n    if options_type == \"spark\":\n        csv_options[\"lineterminator\"] = csv_options.pop(line_sep_option_naming)\n    elif options_type == \"kohesio_pandas_buffer_writer\":\n        csv_options[\"line_terminator\"] = csv_options.pop(line_sep_option_naming)\n\n    return csv_options\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter","title":"koheesio.spark.writers.buffer.PandasJsonBufferWriter","text":"

Write a Spark DataFrame to JSON file(s) using Pandas.

Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

Note

This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).

Parameters:

Name Type Description Default orient

Format of the resulting JSON string. Default is 'records'.

required lines

Format output as one JSON object per line. Only used when orient='records'. Default is True. - If true, the output will be formatted as one JSON object per line. - If false, the output will be written as a single JSON object. Note: this value is only used when orient='records' and will be ignored otherwise.

required date_format

Type of date conversion. Default is 'iso'. See Date and Timestamp Formats for a detailed description and more information.

required double_precision

Number of decimal places for encoding floating point values. Default is 10.

required force_ascii

Force encoded string to be ASCII. Default is True.

required compression

A string representing the compression to use for on-the-fly compression of the output data. Koheesio sets this default to 'None' leaving the data uncompressed. Can be set to gzip' optionally. Other compression options are currently not supported by Koheesio for JSON output.

required The required dates required The required different required original required Note required then required References required"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.columns","title":"columns class-attribute instance-attribute","text":"
columns: Optional[list[str]] = Field(default=None, description='The columns to write. If None, all columns will be written.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.compression","title":"compression class-attribute instance-attribute","text":"
compression: Optional[Literal['gzip']] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to 'gzip' optionally.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.date_format","title":"date_format class-attribute instance-attribute","text":"
date_format: Literal['iso', 'epoch'] = Field(default='iso', description=\"Type of date conversion. Default is 'iso'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.double_precision","title":"double_precision class-attribute instance-attribute","text":"
double_precision: int = Field(default=10, description='Number of decimal places for encoding floating point values. Default is 10.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.force_ascii","title":"force_ascii class-attribute instance-attribute","text":"
force_ascii: bool = Field(default=True, description='Force encoded string to be ASCII. Default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.lines","title":"lines class-attribute instance-attribute","text":"
lines: bool = Field(default=True, description=\"Format output as one JSON object per line. Only used when orient='records'. Default is True.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.orient","title":"orient class-attribute instance-attribute","text":"
orient: Literal['split', 'records', 'index', 'columns', 'values', 'table'] = Field(default='records', description=\"Format of the resulting JSON string. Default is 'records'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output","title":"Output","text":"

Output class for PandasJsonBufferWriter

"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output.pandas_df","title":"pandas_df class-attribute instance-attribute","text":"
pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.execute","title":"execute","text":"
execute()\n

Write the DataFrame to the buffer using Pandas to_json() method.

Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_json() method.\"\"\"\n    df = self.df\n    if self.columns:\n        df = df[self.columns]\n\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = df.toPandas()\n\n    # create json file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_json(file_buffer, **self.get_options())\n\n    # compress the buffer if compression is set\n    if self.compression:\n        self.output.compress()\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.get_options","title":"get_options","text":"
get_options()\n

Returns the options to pass to Pandas' to_json() method.

Source code in src/koheesio/spark/writers/buffer.py
def get_options(self):\n    \"\"\"Returns the options to pass to Pandas' to_json() method.\"\"\"\n    json_options = {\n        \"orient\": self.orient,\n        \"date_format\": self.date_format,\n        \"double_precision\": self.double_precision,\n        \"force_ascii\": self.force_ascii,\n        \"lines\": self.lines,\n        **self.params,\n    }\n\n    # ignore the 'lines' parameter if orient is not 'records'\n    if self.orient != \"records\":\n        del json_options[\"lines\"]\n\n    return json_options\n
"},{"location":"api_reference/spark/writers/dummy.html","title":"Dummy","text":"

Module for the DummyWriter class.

"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter","title":"koheesio.spark.writers.dummy.DummyWriter","text":"

A simple DummyWriter that performs the equivalent of a df.show() on the given DataFrame and returns the first row of data as a dict.

This Writer does not actually write anything to a source/destination, but is useful for debugging or testing purposes.

Parameters:

Name Type Description Default n PositiveInt

Number of rows to show.

20 truncate bool | PositiveInt

If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.

True vertical bool

If set to True, print output rows vertically (one line per column value).

False"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.n","title":"n class-attribute instance-attribute","text":"
n: PositiveInt = Field(default=20, description='Number of rows to show.', gt=0)\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.truncate","title":"truncate class-attribute instance-attribute","text":"
truncate: Union[bool, PositiveInt] = Field(default=True, description='If set to ``True``, truncate strings longer than 20 chars by default.If set to a number greater than one, truncates long strings to length ``truncate`` and align cells right.')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.vertical","title":"vertical class-attribute instance-attribute","text":"
vertical: bool = Field(default=False, description='If set to ``True``, print output rows vertically (one line per column value).')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output","title":"Output","text":"

DummyWriter output

"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.df_content","title":"df_content class-attribute instance-attribute","text":"
df_content: str = Field(default=..., description='The content of the DataFrame as a string')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.head","title":"head class-attribute instance-attribute","text":"
head: Dict[str, Any] = Field(default=..., description='The first row of the DataFrame as a dict')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.execute","title":"execute","text":"
execute() -> Output\n

Execute the DummyWriter

Source code in src/koheesio/spark/writers/dummy.py
def execute(self) -> Output:\n    \"\"\"Execute the DummyWriter\"\"\"\n    df: DataFrame = self.df\n\n    # noinspection PyProtectedMember\n    df_content = df._jdf.showString(self.n, self.truncate, self.vertical)\n\n    # logs the equivalent of doing df.show()\n    self.log.info(f\"content of df that was passed to DummyWriter:\\n{df_content}\")\n\n    self.output.head = self.df.head().asDict()\n    self.output.df_content = df_content\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.int_truncate","title":"int_truncate","text":"
int_truncate(truncate_value) -> int\n

Truncate is either a bool or an int.

Parameters:

truncate_value : int | bool, optional, default=True If int, specifies the maximum length of the string. If bool and True, defaults to a maximum length of 20 characters.

Returns:

int The maximum length of the string.

Source code in src/koheesio/spark/writers/dummy.py
@field_validator(\"truncate\")\ndef int_truncate(cls, truncate_value) -> int:\n    \"\"\"\n    Truncate is either a bool or an int.\n\n    Parameters:\n    -----------\n    truncate_value : int | bool, optional, default=True\n        If int, specifies the maximum length of the string.\n        If bool and True, defaults to a maximum length of 20 characters.\n\n    Returns:\n    --------\n    int\n        The maximum length of the string.\n\n    \"\"\"\n    # Same logic as what is inside DataFrame.show()\n    if isinstance(truncate_value, bool) and truncate_value is True:\n        return 20  # default is 20 chars\n    return int(truncate_value)  # otherwise 0, or whatever the user specified\n
"},{"location":"api_reference/spark/writers/kafka.html","title":"Kafka","text":"

Kafka writer to write batch or streaming data into kafka topics

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter","title":"koheesio.spark.writers.kafka.KafkaWriter","text":"

Kafka writer to write batch or streaming data into kafka topics

All kafka specific options can be provided as additional init params

Parameters:

Name Type Description Default broker str

broker url of the kafka cluster

required topic str

full topic name to write the data to

required trigger Optional[Union[Trigger, str, Dict]]

Indicates optionally how to stream the data into kafka, continuous or batch

required checkpoint_location str

In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs.

required Example
KafkaWriter(\n    write_broker=\"broker.com:9500\",\n    topic=\"test-topic\",\n    trigger=Trigger(continuous=True)\n    includeHeaders: \"true\",\n    key.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    value.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    kafka.group.id: \"test-group\",\n    checkpoint_location: \"s3://bucket/test-topic\"\n)\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.batch_writer","title":"batch_writer property","text":"
batch_writer: DataFrameWriter\n

returns a batch writer

Returns:

Type Description DataFrameWriter"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.broker","title":"broker class-attribute instance-attribute","text":"
broker: str = Field(default=..., description='Kafka brokers to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.checkpoint_location","title":"checkpoint_location class-attribute instance-attribute","text":"
checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.format","title":"format class-attribute instance-attribute","text":"
format: str = 'kafka'\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.logged_option_keys","title":"logged_option_keys property","text":"
logged_option_keys\n

keys to be logged

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.options","title":"options property","text":"
options\n

retrieve the kafka options incl topic and broker.

Returns:

Type Description dict

Dict being the combination of kafka options + topic + broker

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.stream_writer","title":"stream_writer property","text":"
stream_writer: DataStreamWriter\n

returns a stream writer

Returns:

Type Description DataStreamWriter"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.streaming_query","title":"streaming_query property","text":"
streaming_query: Optional[Union[str, StreamingQuery]]\n

return the streaming query

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.topic","title":"topic class-attribute instance-attribute","text":"
topic: str = Field(default=..., description='Kafka topic to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.trigger","title":"trigger class-attribute instance-attribute","text":"
trigger: Optional[Union[Trigger, str, Dict]] = Field(Trigger(available_now=True), description='Set the trigger for the stream query. If not set data is processed in batch')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.writer","title":"writer property","text":"
writer: Union[DataStreamWriter, DataFrameWriter]\n

function to get the writer of proper type according to whether the data to written is a stream or not This function will also set the trigger property in case of a datastream.

Returns:

Type Description Union[DataStreamWriter, DataFrameWriter]

In case of streaming data -> DataStreamWriter, else -> DataFrameWriter

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output","title":"Output","text":"

Output of the KafkaWriter

"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output.streaming_query","title":"streaming_query class-attribute instance-attribute","text":"
streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.execute","title":"execute","text":"
execute()\n

Effectively write the data from the dataframe (streaming of batch) to kafka topic.

Returns:

Type Description Output

streaming_query function can be used to gain insights on running write.

Source code in src/koheesio/spark/writers/kafka.py
def execute(self):\n    \"\"\"Effectively write the data from the dataframe (streaming of batch) to kafka topic.\n\n    Returns\n    -------\n    KafkaWriter.Output\n        streaming_query function can be used to gain insights on running write.\n    \"\"\"\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self._validate_dataframe()\n\n    _writer = self.writer.format(self.format).options(**self.options)\n    self.output.streaming_query = _writer.start() if self.streaming else _writer.save()\n
"},{"location":"api_reference/spark/writers/snowflake.html","title":"Snowflake","text":"

This module contains the SnowflakeWriter class, which is used to write data to Snowflake.

"},{"location":"api_reference/spark/writers/stream.html","title":"Stream","text":"

Module that holds some classes and functions to be able to write to a stream

Classes:

Name Description Trigger

class to set the trigger for a stream query

StreamWriter

abstract class for stream writers

ForEachBatchStreamWriter

class to run a writer for each batch

Functions:

Name Description writer_to_foreachbatch

function to be used as batch_function for StreamWriter (sub)classes

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter","title":"koheesio.spark.writers.stream.ForEachBatchStreamWriter","text":"

Runnable ForEachBatchWriter

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n    self.streaming_query = self.writer.start()\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter","title":"koheesio.spark.writers.stream.StreamWriter","text":"

ABC Stream Writer

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.batch_function","title":"batch_function class-attribute instance-attribute","text":"
batch_function: Optional[Callable] = Field(default=None, description='allows you to run custom batch functions for each micro batch', alias='batch_function_for_each_df')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.checkpoint_location","title":"checkpoint_location class-attribute instance-attribute","text":"
checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.output_mode","title":"output_mode class-attribute instance-attribute","text":"
output_mode: StreamingOutputMode = Field(default=APPEND, alias='outputMode', description=__doc__)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.stream_writer","title":"stream_writer property","text":"
stream_writer: DataStreamWriter\n

Returns the stream writer for the given DataFrame and settings

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.streaming_query","title":"streaming_query class-attribute instance-attribute","text":"
streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.trigger","title":"trigger class-attribute instance-attribute","text":"
trigger: Optional[Union[Trigger, str, Dict]] = Field(default=Trigger(available_now=True), description='Set the trigger for the stream query. If this is not set it process data as batch')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.writer","title":"writer property","text":"
writer\n

Returns the stream writer since we don't have a batch mode for streams

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.await_termination","title":"await_termination","text":"
await_termination(timeout: Optional[int] = None)\n

Await termination of the stream query

Source code in src/koheesio/spark/writers/stream.py
def await_termination(self, timeout: Optional[int] = None):\n    \"\"\"Await termination of the stream query\"\"\"\n    self.streaming_query.awaitTermination(timeout=timeout)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.execute","title":"execute abstractmethod","text":"
execute()\n
Source code in src/koheesio/spark/writers/stream.py
@abstractmethod\ndef execute(self):\n    raise NotImplementedError\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger","title":"koheesio.spark.writers.stream.Trigger","text":"

Trigger types for a stream query.

Only one trigger can be set!

Example
  • processingTime='5 seconds'
  • continuous='5 seconds'
  • availableNow=True
  • once=True
See Also
  • https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.available_now","title":"available_now class-attribute instance-attribute","text":"
available_now: Optional[bool] = Field(default=None, alias='availableNow', description='if set to True, set a trigger that processes all available data in multiple batches then terminates the query.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.continuous","title":"continuous class-attribute instance-attribute","text":"
continuous: Optional[str] = Field(default=None, description=\"a time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a continuous query with a given checkpoint interval.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(validate_default=False, extra='forbid')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.once","title":"once class-attribute instance-attribute","text":"
once: Optional[bool] = Field(default=None, deprecated=True, description='if set to True, set a trigger that processes only one batch of data in a streaming query then terminates the query. use `available_now` instead of `once`.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.processing_time","title":"processing_time class-attribute instance-attribute","text":"
processing_time: Optional[str] = Field(default=None, alias='processingTime', description=\"a processing time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a microbatch query periodically based on the processing time.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.triggers","title":"triggers property","text":"
triggers\n

Returns a list of tuples with the value for each trigger

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.value","title":"value property","text":"
value: Dict[str, str]\n

Returns the trigger value as a dictionary

"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.execute","title":"execute","text":"
execute()\n

Returns the trigger value as a dictionary This method can be skipped, as the value can be accessed directly from the value property

Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n    \"\"\"Returns the trigger value as a dictionary\n    This method can be skipped, as the value can be accessed directly from the `value` property\n    \"\"\"\n    self.log.warning(\"Trigger.execute is deprecated. Use Trigger.value directly instead\")\n    self.output.value = self.value\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_any","title":"from_any classmethod","text":"
from_any(value)\n

Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a dictionary

This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types

Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_any(cls, value):\n    \"\"\"Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a\n    dictionary\n\n    This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types\n    \"\"\"\n    if isinstance(value, Trigger):\n        return value\n\n    if isinstance(value, str):\n        return cls.from_string(value)\n\n    if isinstance(value, dict):\n        return cls.from_dict(value)\n\n    raise RuntimeError(f\"Unable to create Trigger based on the given value: {value}\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_dict","title":"from_dict classmethod","text":"
from_dict(_dict)\n

Creates a Trigger class based on a dictionary

Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_dict(cls, _dict):\n    \"\"\"Creates a Trigger class based on a dictionary\"\"\"\n    return cls(**_dict)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string","title":"from_string classmethod","text":"
from_string(trigger: str)\n

Creates a Trigger class based on a string

Example Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_string(cls, trigger: str):\n    \"\"\"Creates a Trigger class based on a string\n\n    Example\n    -------\n    ### happy flow\n\n    * processingTime='5 seconds'\n    * processing_time=\"5 hours\"\n    * processingTime=4 minutes\n    * once=True\n    * once=true\n    * available_now=true\n    * continuous='3 hours'\n    * once=TrUe\n    * once=TRUE\n\n    ### unhappy flow\n    valid values, but should fail the validation check of the class\n\n    * availableNow=False\n    * continuous=True\n    * once=false\n    \"\"\"\n    import re\n\n    trigger_from_string = re.compile(r\"(?P<triggerType>\\w+)=[\\'\\\"]?(?P<value>.+)[\\'\\\"]?\")\n    _match = trigger_from_string.match(trigger)\n\n    if _match is None:\n        raise ValueError(\n            f\"Cannot parse value for Trigger: '{trigger}'. \\n\"\n            f\"Valid types are {', '.join(cls._all_triggers_with_alias())}\"\n        )\n\n    trigger_type, value = _match.groups()\n\n    # strip the value of any quotes\n    value = value.strip(\"'\").strip('\"')\n\n    # making value a boolean when given\n    value = convert_str_to_bool(value)\n\n    return cls.from_dict({trigger_type: value})\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--happy-flow","title":"happy flow","text":"
  • processingTime='5 seconds'
  • processing_time=\"5 hours\"
  • processingTime=4 minutes
  • once=True
  • once=true
  • available_now=true
  • continuous='3 hours'
  • once=TrUe
  • once=TRUE
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--unhappy-flow","title":"unhappy flow","text":"

valid values, but should fail the validation check of the class

  • availableNow=False
  • continuous=True
  • once=false
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_available_now","title":"validate_available_now","text":"
validate_available_now(available_now)\n

Validate the available_now trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"available_now\", mode=\"before\")\ndef validate_available_now(cls, available_now):\n    \"\"\"Validate the available_now trigger value\"\"\"\n    # making value a boolean when given\n    available_now = convert_str_to_bool(available_now)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if available_now is not True:\n        raise ValueError(f\"Value for availableNow must be True. Got:{available_now}\")\n    return available_now\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_continuous","title":"validate_continuous","text":"
validate_continuous(continuous)\n

Validate the continuous trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"continuous\", mode=\"before\")\ndef validate_continuous(cls, continuous):\n    \"\"\"Validate the continuous trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger` except that the if statement is not\n    # split in two parts\n    if not isinstance(continuous, str):\n        raise ValueError(f\"Value for continuous must be a string. Got: {continuous}\")\n\n    if len(continuous.strip()) == 0:\n        raise ValueError(f\"Value for continuous must be a non empty string. Got: {continuous}\")\n    return continuous\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_once","title":"validate_once","text":"
validate_once(once)\n

Validate the once trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"once\", mode=\"before\")\ndef validate_once(cls, once):\n    \"\"\"Validate the once trigger value\"\"\"\n    # making value a boolean when given\n    once = convert_str_to_bool(once)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if once is not True:\n        raise ValueError(f\"Value for once must be True. Got: {once}\")\n    return once\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_processing_time","title":"validate_processing_time","text":"
validate_processing_time(processing_time)\n

Validate the processing time trigger value

Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"processing_time\", mode=\"before\")\ndef validate_processing_time(cls, processing_time):\n    \"\"\"Validate the processing time trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if not isinstance(processing_time, str):\n        raise ValueError(f\"Value for processing_time must be a string. Got: {processing_time}\")\n\n    if len(processing_time.strip()) == 0:\n        raise ValueError(f\"Value for processingTime must be a non empty string. Got: {processing_time}\")\n    return processing_time\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_triggers","title":"validate_triggers","text":"
validate_triggers(triggers: Dict)\n

Validate the trigger value

Source code in src/koheesio/spark/writers/stream.py
@model_validator(mode=\"before\")\ndef validate_triggers(cls, triggers: Dict):\n    \"\"\"Validate the trigger value\"\"\"\n    params = [*triggers.values()]\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`; modified to work with pydantic v2\n    if not triggers:\n        raise ValueError(\"No trigger provided\")\n    if len(params) > 1:\n        raise ValueError(\"Multiple triggers not allowed.\")\n\n    return triggers\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch","title":"koheesio.spark.writers.stream.writer_to_foreachbatch","text":"
writer_to_foreachbatch(writer: Writer)\n

Call writer.execute on each batch

To be passed as batch_function for StreamWriter (sub)classes.

Example Source code in src/koheesio/spark/writers/stream.py
def writer_to_foreachbatch(writer: Writer):\n    \"\"\"Call `writer.execute` on each batch\n\n    To be passed as batch_function for StreamWriter (sub)classes.\n\n    Example\n    -------\n    ### Writing to a Delta table and a Snowflake table\n    ```python\n    DeltaTableStreamWriter(\n        table=\"my_table\",\n        checkpointLocation=\"my_checkpointlocation\",\n        batch_function=writer_to_foreachbatch(\n            SnowflakeWriter(\n                **sfOptions,\n                table=\"snowflake_table\",\n                insert_type=SnowflakeWriter.InsertType.APPEND,\n            )\n        ),\n    )\n    ```\n    \"\"\"\n\n    def inner(df, batch_id: int):\n        \"\"\"Inner method\n\n        As per the Spark documentation:\n        In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a\n        DataFrame and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the\n        output (that is, the provided Dataset) to external systems. The output DataFrame is guaranteed to exactly\n        same for the same batchId (assuming all operations are deterministic in the query).\n        \"\"\"\n        writer.log.debug(f\"Running batch function for batch {batch_id}\")\n        writer.write(df)\n\n    return inner\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch--writing-to-a-delta-table-and-a-snowflake-table","title":"Writing to a Delta table and a Snowflake table","text":"
DeltaTableStreamWriter(\n    table=\"my_table\",\n    checkpointLocation=\"my_checkpointlocation\",\n    batch_function=writer_to_foreachbatch(\n        SnowflakeWriter(\n            **sfOptions,\n            table=\"snowflake_table\",\n            insert_type=SnowflakeWriter.InsertType.APPEND,\n        )\n    ),\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html","title":"Delta","text":"

This module is the entry point for the koheesio.spark.writers.delta package.

It imports and exposes the DeltaTableWriter and DeltaTableStreamWriter classes for external use.

Classes: DeltaTableWriter: Class to write data in batch mode to a Delta table. DeltaTableStreamWriter: Class to write data in streaming mode to a Delta table.

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode","title":"koheesio.spark.writers.delta.BatchOutputMode","text":"

For Batch:

  • append: Append the contents of the DataFrame to the output table, default option in Koheesio.
  • overwrite: overwrite the existing data.
  • ignore: ignore the operation (i.e. no-op).
  • error or errorifexists: throw an exception at runtime.
  • merge: update matching data in the table and insert rows that do not exist.
  • merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.APPEND","title":"APPEND class-attribute instance-attribute","text":"
APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERROR","title":"ERROR class-attribute instance-attribute","text":"
ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute instance-attribute","text":"
ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.IGNORE","title":"IGNORE class-attribute instance-attribute","text":"
IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE","title":"MERGE class-attribute instance-attribute","text":"
MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute instance-attribute","text":"
MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute instance-attribute","text":"
MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute instance-attribute","text":"
OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.DeltaTableStreamWriter","text":"

Delta table stream writer

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options","title":"Options","text":"

Options for DeltaTableStreamWriter

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute instance-attribute","text":"
allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute instance-attribute","text":"
maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute instance-attribute","text":"
maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter","title":"koheesio.spark.writers.delta.DeltaTableWriter","text":"

Delta table Writer for both batch and streaming dataframes.

Example

Parameters:

Name Type Description Default table Union[DeltaTableStep, str]

The table to write to

required output_mode Optional[Union[str, BatchOutputMode, StreamingOutputMode]]

The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.

required params Optional[dict]

Additional parameters to use for specific mode

required"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val>=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # <------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # <------ DEFAULT, can be changed by providing custom value\n    },\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge","title":"Example for MERGE","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE","text":"

in case the table isn't created yet, first run will execute an APPEND operation

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"<update_condition>\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"<insert_condition>\",\n            },\n        ],\n        \"merge_cond\": \"<merge_condition>\",\n    },\n)\n

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"

dataframe writer options can be passed as keyword arguments

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.format","title":"format class-attribute instance-attribute","text":"
format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.output_mode","title":"output_mode class-attribute instance-attribute","text":"
output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.params","title":"params class-attribute instance-attribute","text":"
params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.partition_by","title":"partition_by class-attribute instance-attribute","text":"
partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.writer","title":"writer property","text":"
writer: Union[DeltaMergeBuilder, DataFrameWriter]\n

Specify DeltaTableWriter

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod","text":"
get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n

Retrieve an OutputMode by validating choice against a set of option OutputModes.

Currently supported output modes can be found in:

  • BatchOutputMode
  • StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.SCD2DeltaTableWriter","text":"

A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.

Attributes:

Name Type Description table InstanceOf[DeltaTableStep]

The table to merge to.

merge_key str

The key used for merging data.

include_columns List[str]

Columns to be merged. Will be selected from DataFrame. Default is all columns.

exclude_columns List[str]

Columns to be excluded from DataFrame.

scd2_columns List[str]

List of attributes for SCD2 type (track changes).

scd2_timestamp_col Optional[Column]

Timestamp column for SCD2 type (track changes). Default to current_timestamp.

scd1_columns List[str]

List of attributes for SCD1 type (just update).

meta_scd2_struct_col_name str

SCD2 struct name.

meta_scd2_effective_time_col_name str

Effective col name.

meta_scd2_is_current_col_name str

Current col name.

meta_scd2_end_time_col_name str

End time col name.

target_auto_generated_columns List[str]

Auto generated columns from target Delta table. Will be used to exclude from merge logic.

"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute instance-attribute","text":"
exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute instance-attribute","text":"
include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute instance-attribute","text":"
merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute instance-attribute","text":"
meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute instance-attribute","text":"
meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute instance-attribute","text":"
meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute instance-attribute","text":"
meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute instance-attribute","text":"
scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute instance-attribute","text":"
scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute instance-attribute","text":"
scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute instance-attribute","text":"
target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.execute","title":"execute","text":"
execute() -> None\n

Execute the SCD Type 2 operation.

This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.

Raises:

Type Description TypeError

If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.

Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n
"},{"location":"api_reference/spark/writers/delta/batch.html","title":"Batch","text":"

This module defines the DeltaTableWriter class, which is used to write both batch and streaming dataframes to Delta tables.

DeltaTableWriter supports two output modes: MERGEALL and MERGE.

  • The MERGEALL mode merges all incoming data with existing data in the table based on certain conditions.
  • The MERGE mode allows for more custom merging behavior using the DeltaMergeBuilder class from the delta.tables library.

The output_mode_params dictionary is used to specify conditions for merging, updating, and inserting data. The target_alias and source_alias keys are used to specify the aliases for the target and source dataframes in the merge conditions.

Classes:

Name Description DeltaTableWriter

A class for writing data to Delta tables.

DeltaTableStreamWriter

A class for writing streaming data to Delta tables.

Example
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val>=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n    },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter","title":"koheesio.spark.writers.delta.batch.DeltaTableWriter","text":"

Delta table Writer for both batch and streaming dataframes.

Example

Parameters:

Name Type Description Default table Union[DeltaTableStep, str]

The table to write to

required output_mode Optional[Union[str, BatchOutputMode, StreamingOutputMode]]

The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.

required params Optional[dict]

Additional parameters to use for specific mode

required"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val>=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # <------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # <------ DEFAULT, can be changed by providing custom value\n    },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge","title":"Example for MERGE","text":"
DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE","text":"

in case the table isn't created yet, first run will execute an APPEND operation

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"<update_condition>\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"<insert_condition>\",\n            },\n        ],\n        \"merge_cond\": \"<merge_condition>\",\n    },\n)\n

"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"

dataframe writer options can be passed as keyword arguments

DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n

"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.format","title":"format class-attribute instance-attribute","text":"
format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.output_mode","title":"output_mode class-attribute instance-attribute","text":"
output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.params","title":"params class-attribute instance-attribute","text":"
params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.partition_by","title":"partition_by class-attribute instance-attribute","text":"
partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.writer","title":"writer property","text":"
writer: Union[DeltaMergeBuilder, DataFrameWriter]\n

Specify DeltaTableWriter

"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod","text":"
get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n

Retrieve an OutputMode by validating choice against a set of option OutputModes.

Currently supported output modes can be found in:

  • BatchOutputMode
  • StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n
"},{"location":"api_reference/spark/writers/delta/scd.html","title":"Scd","text":"

This module defines writers to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.

Slowly Changing Dimension (SCD) is a technique used in data warehousing to handle changes to dimension data over time. SCD Type 2 is one of the most common types of SCD, where historical changes are tracked by creating new records for each change.

Koheesio is a powerful data processing framework that provides advanced capabilities for working with Delta tables in Apache Spark. It offers a convenient and efficient way to handle SCD Type 2 operations on Delta tables.

To learn more about Slowly Changing Dimension and SCD Type 2, you can refer to the following resources: - Slowly Changing Dimension (SCD) - Wikipedia

By using Koheesio, you can benefit from its efficient merge logic, support for SCD Type 2 and SCD Type 1 attributes, and seamless integration with Delta tables in Spark.

"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","text":"

A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.

Attributes:

Name Type Description table InstanceOf[DeltaTableStep]

The table to merge to.

merge_key str

The key used for merging data.

include_columns List[str]

Columns to be merged. Will be selected from DataFrame. Default is all columns.

exclude_columns List[str]

Columns to be excluded from DataFrame.

scd2_columns List[str]

List of attributes for SCD2 type (track changes).

scd2_timestamp_col Optional[Column]

Timestamp column for SCD2 type (track changes). Default to current_timestamp.

scd1_columns List[str]

List of attributes for SCD1 type (just update).

meta_scd2_struct_col_name str

SCD2 struct name.

meta_scd2_effective_time_col_name str

Effective col name.

meta_scd2_is_current_col_name str

Current col name.

meta_scd2_end_time_col_name str

End time col name.

target_auto_generated_columns List[str]

Auto generated columns from target Delta table. Will be used to exclude from merge logic.

"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute instance-attribute","text":"
exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute instance-attribute","text":"
include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute instance-attribute","text":"
merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute instance-attribute","text":"
meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute instance-attribute","text":"
meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute instance-attribute","text":"
meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute instance-attribute","text":"
meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute instance-attribute","text":"
scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute instance-attribute","text":"
scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute instance-attribute","text":"
scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.table","title":"table class-attribute instance-attribute","text":"
table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute instance-attribute","text":"
target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.execute","title":"execute","text":"
execute() -> None\n

Execute the SCD Type 2 operation.

This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.

Raises:

Type Description TypeError

If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.

Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n
"},{"location":"api_reference/spark/writers/delta/stream.html","title":"Stream","text":"

This module defines the DeltaTableStreamWriter class, which is used to write streaming dataframes to Delta tables.

"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","text":"

Delta table stream writer

"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options","title":"Options","text":"

Options for DeltaTableStreamWriter

"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute instance-attribute","text":"
allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute instance-attribute","text":"
maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute instance-attribute","text":"
maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.execute","title":"execute","text":"
execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/utils.html","title":"Utils","text":"

This module provides utility functions while working with delta framework.

"},{"location":"api_reference/spark/writers/delta/utils.html#koheesio.spark.writers.delta.utils.log_clauses","title":"koheesio.spark.writers.delta.utils.log_clauses","text":"
log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]\n

Prepare log message for clauses of DeltaMergePlan statement.

Parameters:

Name Type Description Default clauses JavaObject

The clauses of the DeltaMergePlan statement.

required source_alias str

The source alias.

required target_alias str

The target alias.

required

Returns:

Type Description Optional[str]

The log message if there are clauses, otherwise None.

Notes

This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses, processes the conditions, and constructs the log message based on the clause type and columns.

If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is None, it sets the condition_clause to \"No conditions required\".

The log message includes the clauses type, the clause type, the columns, and the condition.

Source code in src/koheesio/spark/writers/delta/utils.py
def log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]:\n    \"\"\"\n    Prepare log message for clauses of DeltaMergePlan statement.\n\n    Parameters\n    ----------\n    clauses : JavaObject\n        The clauses of the DeltaMergePlan statement.\n    source_alias : str\n        The source alias.\n    target_alias : str\n        The target alias.\n\n    Returns\n    -------\n    Optional[str]\n        The log message if there are clauses, otherwise None.\n\n    Notes\n    -----\n    This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses,\n    processes the conditions, and constructs the log message based on the clause type and columns.\n\n    If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is\n    None, it sets the condition_clause to \"No conditions required\".\n\n    The log message includes the clauses type, the clause type, the columns, and the condition.\n    \"\"\"\n    log_message = None\n\n    if not clauses.isEmpty():\n        clauses_type = clauses.last().nodeName().replace(\"DeltaMergeInto\", \"\")\n        _processed_clauses = {}\n\n        for i in range(0, clauses.length()):\n            clause = clauses.apply(i)\n            condition = clause.condition()\n\n            if \"value\" in dir(condition):\n                condition_clause = (\n                    condition.value()\n                    .toString()\n                    .replace(f\"'{source_alias}\", source_alias)\n                    .replace(f\"'{target_alias}\", target_alias)\n                )\n            elif condition.toString() == \"None\":\n                condition_clause = \"No conditions required\"\n\n            clause_type: str = clause.clauseType().capitalize()\n            columns = \"ALL\" if clause_type == \"Delete\" else clause.actions().toList().apply(0).toString()\n\n            if clause_type.lower() not in _processed_clauses:\n                _processed_clauses[clause_type.lower()] = []\n\n            log_message = (\n                f\"{clauses_type} will perform action:{clause_type} columns ({columns}) if `{condition_clause}`\"\n            )\n\n    return log_message\n
"},{"location":"api_reference/sso/index.html","title":"Sso","text":""},{"location":"api_reference/sso/okta.html","title":"Okta","text":"

This module contains Okta integration steps.

"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter","title":"koheesio.sso.okta.LoggerOktaTokenFilter","text":"
LoggerOktaTokenFilter(okta_object: OktaAccessToken, name: str = 'OktaToken')\n

Filter which hides token value from log.

Source code in src/koheesio/sso/okta.py
def __init__(self, okta_object: OktaAccessToken, name: str = \"OktaToken\"):\n    self.__okta_object = okta_object\n    super().__init__(name=name)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter.filter","title":"filter","text":"
filter(record)\n
Source code in src/koheesio/sso/okta.py
def filter(self, record):\n    # noinspection PyUnresolvedReferences\n    if token := self.__okta_object.output.token:\n        token_value = token.get_secret_value()\n        record.msg = record.msg.replace(token_value, \"<SECRET_TOKEN>\")\n\n    return True\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta","title":"koheesio.sso.okta.Okta","text":"

Base Okta class

"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_id","title":"client_id class-attribute instance-attribute","text":"
client_id: str = Field(default=..., alias='okta_id', description='Okta account ID')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_secret","title":"client_secret class-attribute instance-attribute","text":"
client_secret: SecretStr = Field(default=..., alias='okta_secret', description='Okta account secret', repr=False)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.data","title":"data class-attribute instance-attribute","text":"
data: Optional[Union[Dict[str, str], str]] = Field(default={'grant_type': 'client_credentials'}, description='Data to be sent along with the token request')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken","title":"koheesio.sso.okta.OktaAccessToken","text":"
OktaAccessToken(**kwargs)\n

Get Okta authorization token

Example:

token = (\n    OktaAccessToken(\n        url=\"https://org.okta.com\",\n        client_id=\"client\",\n        client_secret=SecretStr(\"secret\"),\n        params={\n            \"p1\": \"foo\",\n            \"p2\": \"bar\",\n        },\n    )\n    .execute()\n    .token\n)\n

Source code in src/koheesio/sso/okta.py
def __init__(self, **kwargs):\n    _logger = LoggingFactory.get_logger(name=self.__class__.__name__, inherit_from_koheesio=True)\n    logger_filter = LoggerOktaTokenFilter(okta_object=self)\n    _logger.addFilter(logger_filter)\n    super().__init__(**kwargs)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output","title":"Output","text":"

Output class for OktaAccessToken.

"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output.token","title":"token class-attribute instance-attribute","text":"
token: Optional[SecretStr] = Field(default=None, description='Okta authentication token')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.execute","title":"execute","text":"
execute()\n

Execute an HTTP Post call to Okta service and retrieve the access token.

Source code in src/koheesio/sso/okta.py
def execute(self):\n    \"\"\"\n    Execute an HTTP Post call to Okta service and retrieve the access token.\n    \"\"\"\n    HttpPostStep.execute(self)\n\n    # noinspection PyUnresolvedReferences\n    status_code = self.output.status_code\n    # noinspection PyUnresolvedReferences\n    raw_payload = self.output.raw_payload\n\n    if status_code != 200:\n        raise HTTPError(f\"Request failed with '{status_code}' code. Payload: {raw_payload}\")\n\n    # noinspection PyUnresolvedReferences\n    json_payload = self.output.json_payload\n\n    if token := json_payload.get(\"access_token\"):\n        self.output.token = SecretStr(token)\n    else:\n        raise ValueError(f\"No 'access_token' found in the Okta response: {json_payload}\")\n
"},{"location":"api_reference/steps/index.html","title":"Steps","text":"

Steps Module

This module contains the definition of the Step class, which serves as the base class for custom units of logic that can be executed. It also includes the StepOutput class, which defines the output data model for a Step.

The Step class is designed to be subclassed for creating new steps in a data pipeline. Each subclass should implement the execute method, specifying the expected inputs and outputs.

This module also exports the SparkStep class for steps that interact with Spark

Classes:
  • Step: Base class for a custom unit of logic that can be executed.
  • StepOutput: Defines the output data model for a Step.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step","title":"koheesio.steps.Step","text":"

Base class for a step

A custom unit of logic that can be executed.

The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self) method, specifying the expected inputs and outputs.

Note: since the Step class is meta classed, the execute method is wrapped with the do_execute function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.

Methods and Attributes

The Step class has several attributes and methods.

Background

A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.

A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!

The diagram serves to illustrate the concept of a Step:

\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n

Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.

  • Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
  • Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the execute method of the Step class with the _execute_wrapper function. This ensures that the execute method always returns the output of the Step along with providing logging and validation of the output.
  • Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute.
  • The Output class can be extended to add additional fields to the output of the Step.

Examples:

class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -> MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--input","title":"INPUT","text":"

The following fields are available by default on the Step class: - name: Name of the Step. If not set, the name of the class will be used. - description: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.

When subclassing a Step, any additional pydantic field will be treated as input to the Step. See also the explanation on the .execute() method below.

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--output","title":"OUTPUT","text":"

Every Step has an Output class, which is a subclass of StepOutput. This class is used to validate the output of the Step. The Output class is defined as an inner class of the Step class. The Output class can be accessed through the Step.Output attribute. The Output class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute().

  • Output: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class.
  • output: Allows you to interact with the Output of the Step lazily (see above and StepOutput)

When subclassing a Step, any additional pydantic field added to the nested Output class will be treated as output of the Step. See also the description of StepOutput for more information.

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--methods","title":"Methods:","text":"
  • execute: Abstract method to implement for new steps.
    • The Inputs of the step can be accessed, using self.input_name.
    • The output of the step can be accessed, using self.output.output_name.
  • run: Alias to .execute() method. You can use this to run the step, but execute is preferred.
  • to_yaml: YAML dump the step
  • get_description: Get the description of the Step

When subclassing a Step, execute is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.

Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute function making it always return a StepOutput. See also the explanation on the do_execute function.

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--class-methods","title":"class methods:","text":"
  • from_step: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
  • get_description: Get the description of the Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--dunder-methods","title":"dunder methods:","text":"
  • __getattr__: Allows input to be accessed through self.input_name
  • __repr__ and __str__: String representation of a step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.output","title":"output property writable","text":"
output: Output\n

Interact with the output of the Step

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.Output","title":"Output","text":"

Output class for Step

"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.execute","title":"execute abstractmethod","text":"
execute()\n

Abstract method to implement for new steps.

The Inputs of the step can be accessed, using self.input_name

Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute function making it always return the Steps output

Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.from_step","title":"from_step classmethod","text":"
from_step(step: Step, **kwargs)\n

Returns a new Step instance based on the data of another Step or BaseModel instance

Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_json","title":"repr_json","text":"
repr_json(simple=False) -> str\n

dump the step to json, meant for representation

Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid json

Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -> Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_yaml","title":"repr_yaml","text":"
repr_yaml(simple=False) -> str\n

dump the step to yaml, meant for representation

Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!

Examples:

>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n  a: foo\n

Parameters:

Name Type Description Default simple

When toggled to True, a briefer output will be produced. This is friendlier for logging purposes

False

Returns:

Type Description str

A string, which is valid yaml

Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    >>> step = MyStep(a=\"foo\")\n    >>> print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.run","title":"run","text":"
run()\n

Alias to .execute()

Source code in src/koheesio/steps/__init__.py
def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepMetaClass","title":"koheesio.steps.StepMetaClass","text":"

StepMetaClass has to be set up as a Metaclass extending ModelMetaclass to allow Pydantic to be unaffected while allowing for the execute method to be auto-decorated with do_execute

"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput","title":"koheesio.steps.StepOutput","text":"

Class for the StepOutput model

Usage

Setting up the StepOutputs class is done like this:

class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n

"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.model_config","title":"model_config class-attribute instance-attribute","text":"
model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.validate_output","title":"validate_output","text":"
validate_output() -> StepOutput\n

Validate the output of the Step

Essentially, this method is a wrapper around the validate method of the BaseModel class

Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/steps/dummy.html","title":"Dummy","text":"

Dummy step for testing purposes.

This module contains a dummy step for testing purposes. It is used to test the Koheesio framework or to provide a simple example of how to create a new step.

Example

s = DummyStep(a=\"a\", b=2)\ns.execute()\n
In this case, s.output will be equivalent to the following dictionary:
{\"a\": \"a\", \"b\": 2, \"c\": \"aa\"}\n

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput","title":"koheesio.steps.dummy.DummyOutput","text":"

Dummy output for testing purposes.

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.a","title":"a instance-attribute","text":"
a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.b","title":"b instance-attribute","text":"
b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep","title":"koheesio.steps.dummy.DummyStep","text":"

Dummy step for testing purposes.

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.a","title":"a instance-attribute","text":"
a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.b","title":"b instance-attribute","text":"
b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output","title":"Output","text":"

Dummy output for testing purposes.

"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output.c","title":"c instance-attribute","text":"
c: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.execute","title":"execute","text":"
execute()\n

Dummy execute for testing purposes.

Source code in src/koheesio/steps/dummy.py
def execute(self):\n    \"\"\"Dummy execute for testing purposes.\"\"\"\n    self.output.a = self.a\n    self.output.b = self.b\n    self.output.c = self.a * self.b\n
"},{"location":"api_reference/steps/http.html","title":"Http","text":"

This module contains a few simple HTTP Steps that can be used to perform API Calls to HTTP endpoints

Example
from koheesio.steps.http import HttpGetStep\n\nresponse = HttpGetStep(url=\"https://google.com\").execute().json_payload\n

In the above example, the response variable will contain the JSON response from the HTTP request.

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep","title":"koheesio.steps.http.HttpDeleteStep","text":"

send DELETE requests

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = DELETE\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep","title":"koheesio.steps.http.HttpGetStep","text":"

send GET requests

Example

response = HttpGetStep(url=\"https://google.com\").execute().json_payload\n
In the above example, the response variable will contain the JSON response from the HTTP request.

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = GET\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod","title":"koheesio.steps.http.HttpMethod","text":"

Enumeration of allowed http methods

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.DELETE","title":"DELETE class-attribute instance-attribute","text":"
DELETE = 'delete'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.GET","title":"GET class-attribute instance-attribute","text":"
GET = 'get'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.POST","title":"POST class-attribute instance-attribute","text":"
POST = 'post'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.PUT","title":"PUT class-attribute instance-attribute","text":"
PUT = 'put'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.from_string","title":"from_string classmethod","text":"
from_string(value: str)\n

Allows for getting the right Method Enum by simply passing a string value This method is not case-sensitive

Source code in src/koheesio/steps/http.py
@classmethod\ndef from_string(cls, value: str):\n    \"\"\"Allows for getting the right Method Enum by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep","title":"koheesio.steps.http.HttpPostStep","text":"

send POST requests

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = POST\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep","title":"koheesio.steps.http.HttpPutStep","text":"

send PUT requests

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep.method","title":"method class-attribute instance-attribute","text":"
method: HttpMethod = PUT\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep","title":"koheesio.steps.http.HttpStep","text":"

Can be used to perform API Calls to HTTP endpoints

Understanding Retries

This class includes a built-in retry mechanism for handling temporary issues, such as network errors or server downtime, that might cause the HTTP request to fail. The retry mechanism is controlled by three parameters: max_retries, initial_delay, and backoff.

  • max_retries determines the number of retries after the initial request. For example, if max_retries is set to 4, the request will be attempted a total of 5 times (1 initial attempt + 4 retries). If max_retries is set to 0, no retries will be attempted, and the request will be tried only once.

  • initial_delay sets the waiting period before the first retry. If initial_delay is set to 3, the delay before the first retry will be 3 seconds. Changing the initial_delay value directly affects the amount of delay before each retry.

  • backoff controls the rate at which the delay increases for each subsequent retry. If backoff is set to 2 (the default), the delay will double with each retry. If backoff is set to 1, the delay between retries will remain constant. Changing the backoff value affects how quickly the delay increases.

Given the default values of max_retries=3, initial_delay=2, and backoff=2, the delays between retries would be 2 seconds, 4 seconds, and 8 seconds, respectively. This results in a total delay of 14 seconds before all retries are exhausted.

For example, if you set initial_delay=3 and backoff=2, the delays before the retries would be 3 seconds, 6 seconds, and 12 seconds. If you set initial_delay=2 and backoff=3, the delays before the retries would be 2 seconds, 6 seconds, and 18 seconds. If you set initial_delay=2 and backoff=1, the delays before the retries would be 2 seconds, 2 seconds, and 2 seconds.

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.data","title":"data class-attribute instance-attribute","text":"
data: Optional[Union[Dict[str, str], str]] = Field(default_factory=dict, description='[Optional] Data to be sent along with the request', alias='body')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.headers","title":"headers class-attribute instance-attribute","text":"
headers: Optional[Dict[str, Union[str, SecretStr]]] = Field(default_factory=dict, description='Request headers', alias='header')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.method","title":"method class-attribute instance-attribute","text":"
method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.params","title":"params class-attribute instance-attribute","text":"
params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to HTTP request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.session","title":"session class-attribute instance-attribute","text":"
session: Session = Field(default_factory=Session, description='Requests session object to be used for making HTTP requests', exclude=True, repr=False)\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.timeout","title":"timeout class-attribute instance-attribute","text":"
timeout: Optional[int] = Field(default=3, description='[Optional] Request timeout')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.url","title":"url class-attribute instance-attribute","text":"
url: str = Field(default=..., description='API endpoint URL', alias='uri')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output","title":"Output","text":"

Output class for HttpStep

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.json_payload","title":"json_payload property","text":"
json_payload\n

Alias for response_json

"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.raw_payload","title":"raw_payload class-attribute instance-attribute","text":"
raw_payload: Optional[str] = Field(default=None, alias='response_text', description='The raw response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_json","title":"response_json class-attribute instance-attribute","text":"
response_json: Optional[Union[Dict, List]] = Field(default=None, alias='json_payload', description='The JSON response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_raw","title":"response_raw class-attribute instance-attribute","text":"
response_raw: Optional[Response] = Field(default=None, alias='response', description='The raw requests.Response object returned by the appropriate requests.request() call')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.status_code","title":"status_code class-attribute instance-attribute","text":"
status_code: Optional[int] = Field(default=None, description='The status return code of the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.decode_sensitive_headers","title":"decode_sensitive_headers","text":"
decode_sensitive_headers(headers)\n

Authorization headers are being converted into SecretStr under the hood to avoid dumping any sensitive content into logs by the encode_sensitive_headers method.

However, when calling the get_headers method, the SecretStr should be converted back to string, otherwise sensitive info would have looked like '**********'.

This method decodes values of the headers dictionary that are of type SecretStr into plain text.

Source code in src/koheesio/steps/http.py
@field_serializer(\"headers\", when_used=\"json\")\ndef decode_sensitive_headers(self, headers):\n    \"\"\"\n    Authorization headers are being converted into SecretStr under the hood to avoid dumping any\n    sensitive content into logs by the `encode_sensitive_headers` method.\n\n    However, when calling the `get_headers` method, the SecretStr should be converted back to\n    string, otherwise sensitive info would have looked like '**********'.\n\n    This method decodes values of the `headers` dictionary that are of type SecretStr into plain text.\n    \"\"\"\n    for k, v in headers.items():\n        headers[k] = v.get_secret_value() if isinstance(v, SecretStr) else v\n    return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.delete","title":"delete","text":"
delete() -> Response\n

Execute an HTTP DELETE call

Source code in src/koheesio/steps/http.py
def delete(self) -> requests.Response:\n    \"\"\"Execute an HTTP DELETE call\"\"\"\n    self.method = HttpMethod.DELETE\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.encode_sensitive_headers","title":"encode_sensitive_headers","text":"
encode_sensitive_headers(headers)\n

Encode potentially sensitive data into pydantic.SecretStr class to prevent them being displayed as plain text in logs.

Source code in src/koheesio/steps/http.py
@field_validator(\"headers\", mode=\"before\")\ndef encode_sensitive_headers(cls, headers):\n    \"\"\"\n    Encode potentially sensitive data into pydantic.SecretStr class to prevent them\n    being displayed as plain text in logs.\n    \"\"\"\n    if auth := headers.get(\"Authorization\"):\n        headers[\"Authorization\"] = auth if isinstance(auth, SecretStr) else SecretStr(auth)\n    return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.execute","title":"execute","text":"
execute() -> Output\n

Executes the HTTP request.

This method simply calls self.request(), which includes the retry logic. If self.request() raises an exception, it will be propagated to the caller of this method.

Raises:

Type Description (RequestException, HTTPError)

The last exception that was caught if self.request() fails after self.max_retries attempts.

Source code in src/koheesio/steps/http.py
def execute(self) -> Output:\n    \"\"\"\n    Executes the HTTP request.\n\n    This method simply calls `self.request()`, which includes the retry logic. If `self.request()` raises an\n    exception, it will be propagated to the caller of this method.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `self.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get","title":"get","text":"
get() -> Response\n

Execute an HTTP GET call

Source code in src/koheesio/steps/http.py
def get(self) -> requests.Response:\n    \"\"\"Execute an HTTP GET call\"\"\"\n    self.method = HttpMethod.GET\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_headers","title":"get_headers","text":"
get_headers()\n

Dump headers into JSON without SecretStr masking.

Source code in src/koheesio/steps/http.py
def get_headers(self):\n    \"\"\"\n    Dump headers into JSON without SecretStr masking.\n    \"\"\"\n    return json.loads(self.model_dump_json()).get(\"headers\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_options","title":"get_options","text":"
get_options()\n

options to be passed to requests.request()

Source code in src/koheesio/steps/http.py
def get_options(self):\n    \"\"\"options to be passed to requests.request()\"\"\"\n    return {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self.params,  # type: ignore\n    }\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_proper_http_method_from_str_value","title":"get_proper_http_method_from_str_value","text":"
get_proper_http_method_from_str_value(method_value)\n

Converts string value to HttpMethod enum value

Source code in src/koheesio/steps/http.py
@field_validator(\"method\")\ndef get_proper_http_method_from_str_value(cls, method_value):\n    \"\"\"Converts string value to HttpMethod enum value\"\"\"\n    if isinstance(method_value, str):\n        try:\n            method_value = HttpMethod.from_string(method_value)\n        except AttributeError as e:\n            raise AttributeError(\n                \"Only values from HttpMethod class are allowed! \"\n                f\"Provided value: '{method_value}', allowed values: {', '.join(HttpMethod.__members__.keys())}\"\n            ) from e\n\n    return method_value\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.post","title":"post","text":"
post() -> Response\n

Execute an HTTP POST call

Source code in src/koheesio/steps/http.py
def post(self) -> requests.Response:\n    \"\"\"Execute an HTTP POST call\"\"\"\n    self.method = HttpMethod.POST\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.put","title":"put","text":"
put() -> Response\n

Execute an HTTP PUT call

Source code in src/koheesio/steps/http.py
def put(self) -> requests.Response:\n    \"\"\"Execute an HTTP PUT call\"\"\"\n    self.method = HttpMethod.PUT\n    return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.request","title":"request","text":"
request(method: Optional[HttpMethod] = None) -> Response\n

Executes the HTTP request with retry logic.

Actual http_method execution is abstracted into this method. This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.

This method will try to execute requests.request up to self.max_retries times. If self.request() raises an exception, it logs a warning message and the error message, then waits for self.initial_delay * (self.backoff ** i) seconds before retrying. The delay increases exponentially after each failed attempt due to the self.backoff ** i term.

If self.request() still fails after self.max_retries attempts, it logs an error message and re-raises the last exception that was caught.

This is a good way to handle temporary issues that might cause self.request() to fail, such as network errors or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with requests if it's struggling to respond.

Parameters:

Name Type Description Default method HttpMethod

Optional parameter that allows calls to different HTTP methods and bypassing class level method parameter.

None

Raises:

Type Description (RequestException, HTTPError)

The last exception that was caught if requests.request() fails after self.max_retries attempts.

Source code in src/koheesio/steps/http.py
def request(self, method: Optional[HttpMethod] = None) -> requests.Response:\n    \"\"\"\n    Executes the HTTP request with retry logic.\n\n    Actual http_method execution is abstracted into this method.\n    This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.\n\n    This method will try to execute `requests.request` up to `self.max_retries` times. If `self.request()` raises\n    an exception, it logs a warning message and the error message, then waits for\n    `self.initial_delay * (self.backoff ** i)` seconds before retrying. The delay increases exponentially\n    after each failed attempt due to the `self.backoff ** i` term.\n\n    If `self.request()` still fails after `self.max_retries` attempts, it logs an error message and re-raises the\n    last exception that was caught.\n\n    This is a good way to handle temporary issues that might cause `self.request()` to fail, such as network errors\n    or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with\n    requests if it's struggling to respond.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        Optional parameter that allows calls to different HTTP methods and bypassing class level `method`\n        parameter.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `requests.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    _method = (method or self.method).value.upper()\n    options = self.get_options()\n\n    self.log.debug(f\"Making {_method} request to {options['url']} with headers {options['headers']}\")\n\n    response = self.session.request(method=_method, **options)\n    response.raise_for_status()\n\n    self.log.debug(f\"Received response with status code {response.status_code} and body {response.text}\")\n    self.set_outputs(response)\n\n    return response\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.set_outputs","title":"set_outputs","text":"
set_outputs(response)\n

Types of response output

Source code in src/koheesio/steps/http.py
def set_outputs(self, response):\n    \"\"\"\n    Types of response output\n    \"\"\"\n    self.output.response_raw = response\n    self.output.raw_payload = response.text\n    self.output.status_code = response.status_code\n\n    # Only decode non empty payloads to avoid triggering decoding error unnecessarily.\n    if self.output.raw_payload:\n        try:\n            self.output.response_json = response.json()\n\n        except json.decoder.JSONDecodeError as e:\n            self.log.info(f\"An error occurred while processing the JSON payload. Error message:\\n{e.msg}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep","title":"koheesio.steps.http.PaginatedHtppGetStep","text":"

Represents a paginated HTTP GET step.

Parameters:

Name Type Description Default paginate bool

Whether to paginate the API response. Defaults to False.

required pages int

Number of pages to paginate. Defaults to 1.

required offset int

Offset for paginated API calls. Offset determines the starting page. Defaults to 1.

required limit int

Limit for paginated API calls. Defaults to 100.

required"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.limit","title":"limit class-attribute instance-attribute","text":"
limit: Optional[int] = Field(default=100, description='Limit for paginated API calls. The url should (optionally) contain a named limit parameter, for example: api.example.com/data?limit={limit}')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.offset","title":"offset class-attribute instance-attribute","text":"
offset: Optional[int] = Field(default=1, description=\"Offset for paginated API calls. Offset determines the starting page. Defaults to 1. The url can (optionally) contain a named 'offset' parameter, for example: api.example.com/data?offset={offset}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.pages","title":"pages class-attribute instance-attribute","text":"
pages: Optional[int] = Field(default=1, description='Number of pages to paginate. Defaults to 1')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.paginate","title":"paginate class-attribute instance-attribute","text":"
paginate: Optional[bool] = Field(default=False, description=\"Whether to paginate the API response. Defaults to False. When set to True, the API response will be paginated. The url should contain a named 'page' parameter for example: api.example.com/data?page={page}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.execute","title":"execute","text":"
execute() -> Output\n

Executes the HTTP GET request and handles pagination.

Returns:

Type Description Output

The output of the HTTP GET request.

Source code in src/koheesio/steps/http.py
def execute(self) -> HttpGetStep.Output:\n    \"\"\"\n    Executes the HTTP GET request and handles pagination.\n\n    Returns\n    -------\n    HttpGetStep.Output\n        The output of the HTTP GET request.\n    \"\"\"\n    # Set up pagination parameters\n    offset, pages = (self.offset, self.pages + 1) if self.paginate else (1, 1)  # type: ignore\n    data = []\n    _basic_url = self.url\n\n    for page in range(offset, pages):\n        if self.paginate:\n            self.log.info(f\"Fetching page {page} of {pages - 1}\")\n\n        self.url = self._url(basic_url=_basic_url, page=page)\n        self.request()\n\n        if isinstance(self.output.response_json, list):\n            data += self.output.response_json\n        else:\n            data.append(self.output.response_json)\n\n    self.url = _basic_url\n    self.output.response_json = data\n    self.output.response_raw = None\n    self.output.raw_payload = None\n    self.output.status_code = None\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.get_options","title":"get_options","text":"
get_options()\n

Returns the options to be passed to the requests.request() function.

Returns:

Type Description dict

The options.

Source code in src/koheesio/steps/http.py
def get_options(self):\n    \"\"\"\n    Returns the options to be passed to the requests.request() function.\n\n    Returns\n    -------\n    dict\n        The options.\n    \"\"\"\n    options = {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self._adjust_params(),  # type: ignore\n    }\n\n    return options\n
"},{"location":"community/approach-documentation.html","title":"Approach documentation","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#scope","title":"Scope","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#the-system","title":"The System","text":"

We will be adopting \"The Documentation System\".

From documentation.divio.com:

There is a secret that needs to be understood in order to write good software documentation: there isn\u2019t one thing called documentation, there are four.

They are: tutorials, how-to guides, technical reference and explanation. They represent four different purposes or functions, and require four different approaches to their creation. Understanding the implications of this will help improve most documentation - often immensely.

About the system The documentation system outlined here is a simple, comprehensive and nearly universally-applicable scheme. It is proven in practice across a wide variety of fields and applications.

There are some very simple principles that govern documentation that are very rarely if ever spelled out. They seem to be a secret, though they shouldn\u2019t be.

If you can put these principles into practice, it will make your documentation better and your project, product or team more successful - that\u2019s a promise.

The system is widely adopted for large and small, open and proprietary documentation projects.

Video Presentation on YouTube:

","tags":["doctype/explanation"]},{"location":"community/contribute.html","title":"Contribute","text":""},{"location":"community/contribute.html#how-to-contribute","title":"How to contribute","text":"

There are a few guidelines that we need contributors to follow so that we are able to process requests as efficiently as possible. If you have any questions or concerns please feel free to contact us at opensource@nike.com.

"},{"location":"community/contribute.html#getting-started","title":"Getting Started","text":"
  • Review our Code of Conduct
  • Make sure you have a GitHub account
  • Submit a ticket for your issue, assuming one does not already exist.
    • Clearly describe the issue including steps to reproduce when it is a bug.
    • Make sure you fill in the earliest version that you know has the issue.
  • Fork the repository on GitHub
"},{"location":"community/contribute.html#making-changes","title":"Making Changes","text":"
  • Create a feature branch off of main before you start your work.
    • Please avoid working directly on the main branch.
  • Setup the required package manager hatch
  • Setup the dev environment see below
  • Make commits of logical units.
    • You may be asked to squash unnecessary commits down to logical units.
  • Check for unnecessary whitespace with git diff --check before committing.
  • Write meaningful, descriptive commit messages.
  • Please follow existing code conventions when working on a file
  • Make sure to check the standards on the code, see below
  • Make sure to test the code before you push changes see below
"},{"location":"community/contribute.html#submitting-changes","title":"\ud83e\udd1d Submitting Changes","text":"
  • Push your changes to a topic branch in your fork of the repository.
  • Submit a pull request to the repository in the Nike-Inc organization.
  • After feedback has been given we expect responses within two weeks. After two weeks we may close the pull request if it isn't showing any activity.
  • Bug fixes or features that lack appropriate tests may not be considered for merge.
  • Changes that lower test coverage may not be considered for merge.
"},{"location":"community/contribute.html#make-commands","title":"\ud83d\udd28 Make commands","text":"

We use make for managing different steps of setup and maintenance in the project. You can install make by following the instructions here

For a full list of available make commands, you can run:

make help\n
"},{"location":"community/contribute.html#package-manager","title":"\ud83d\udce6 Package manager","text":"

We use hatch as our package manager.

Note: Please DO NOT use pip or conda to install the dependencies. Instead, use hatch.

To install hatch, run the following command:

make init\n

or,

make hatch-install\n

This will install hatch using brew if you are on a Mac.

If you are on a different OS, you can follow the instructions here

"},{"location":"community/contribute.html#dev-environment-setup","title":"\ud83d\udccc Dev Environment Setup","text":"

To ensure our standards, make sure to install the required packages.

make dev\n

This will install all the required packages for development in the project under the .venv directory. Use this virtual environment to run the code and tests during local development.

"},{"location":"community/contribute.html#linting-and-standards","title":"\ud83e\uddf9 Linting and Standards","text":"

We use ruff, pylint, isort, black and mypy to maintain standards in the codebase.

Run the following two commands to check the codebase for any issues:

make check\n
This will run all the checks including pylint and mypy.

make fmt\n
This will format the codebase using black, isort, and ruff.

Make sure that the linters and formatters do not report any errors or warnings before submitting a pull request.

"},{"location":"community/contribute.html#testing","title":"\ud83e\uddea Testing","text":"

We use pytest to test our code.

You can run the tests by running one of the following commands:

make cov  # to run the tests and check the coverage\nmake all-tests  # to run all the tests\nmake spark-tests  # to run the spark tests\nmake non-spark-tests  # to run the non-spark tests\n

Make sure that all tests pass and that you have adequate coverage before submitting a pull request.

"},{"location":"community/contribute.html#additional-resources","title":"Additional Resources","text":"
  • General GitHub documentation
  • GitHub pull request documentation
  • Nike's Code of Conduct
  • Nike's Individual Contributor License Agreement
  • Nike OSS
"},{"location":"includes/glossary.html","title":"Glossary","text":""},{"location":"includes/glossary.html#pydantic","title":"Pydantic","text":"

Pydantic is a Python library for data validation and settings management using Python type annotations. It allows Koheesio to bring in strong typing and a high level of type safety. Essentially, it allows Koheesio to consider configurations of a pipeline (i.e. the settings used inside Steps, Tasks, etc.) as data that can be validated and structured.

"},{"location":"includes/glossary.html#pyspark","title":"PySpark","text":"

PySpark is a Python library for Apache Spark, a powerful open-source data processing engine. It allows Koheesio to handle large-scale data processing tasks efficiently.

"},{"location":"misc/info.html","title":"Info","text":"

{{ macros_info() }}

"},{"location":"reference/concepts/concepts.html","title":"Concepts","text":"

The framework architecture is built from a set of core components. Each of the implementations that the framework provides out of the box, can be swapped out for custom implementations as long as they match the API.

The core components are the following:

Note: click on the 'Concept' to take you to the corresponding module. The module documentation will have greater detail on the specifics of the implementation

"},{"location":"reference/concepts/concepts.html#step","title":"Step","text":"

A custom unit of logic that can be executed. A Step is an atomic operation and serves as the building block of data pipelines built with the framework. A step can be seen as an operation on a set of inputs, and returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.

\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n&nbsp;\n&nbsp;\nStep\n&nbsp;\n&nbsp;\n&nbsp;\n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n

Step is the core abstraction of the framework. Meaning, that it is the core building block of the framework and is used to define all the operations that can be executed.

Please see the Step documentation for more details.

"},{"location":"reference/concepts/concepts.html#task","title":"Task","text":"

The unit of work of one execution of the framework.

An execution usually consists of an Extract - Transform - Load approach of one data object. Tasks typically consist of a series of Steps.

Please see the Task documentation for more details.

"},{"location":"reference/concepts/concepts.html#context","title":"Context","text":"

The Context is used to configure the environment where a Task or Step runs.

It is often based on configuration files and can be used to adapt behaviour of a Task or Step based on the environment it runs in.

Please see the Context documentation for more details.

"},{"location":"reference/concepts/concepts.html#logger","title":"logger","text":"

A logger object to log messages with different levels.

Please see the Logging documentation for more details.

The interactions between the base concepts of the model is visible in the below diagram:

---\ntitle: Koheesio Class Diagram\n---\nclassDiagram\n    Step .. Task\n    Step .. Transformation\n    Step .. Reader\n    Step .. Writer\n\n    class Context\n\n    class LoggingFactory\n\n    class Task{\n        <<abstract>>\n        + List~Step~ steps\n        ...\n        + execute() Output\n    }\n\n    class Step{\n        <<abstract>>\n        ...\n        Output: ...\n        + execute() Output\n    }\n\n    class Transformation{\n        <<abstract>>\n        + df: DataFrame\n        ...\n        Output:\n        + df: DataFrame\n        + transform(df: DataFrame) DataFrame\n    }\n\n    class Reader{\n        <<abstract>>\n        ...\n        Output:\n        + df: DataFrame\n        + read() DataFrame\n    }\n\n    class Writer{\n        <<abstract>>\n        + df: DataFrame\n        ...\n        + write(df: DataFrame)\n    }
"},{"location":"reference/concepts/context.html","title":"Context in Koheesio","text":"

In the Koheesio framework, the Context class plays a pivotal role. It serves as a flexible and powerful tool for managing configuration data and shared variables across tasks and steps in your application.

Context behaves much like a Python dictionary, but with additional features that enhance its usability and flexibility. It allows you to store and retrieve values, including complex Python objects, with ease. You can access these values using dictionary-like methods or as class attributes, providing a simple and intuitive interface.

Moreover, Context supports nested keys and recursive merging of contexts, making it a versatile tool for managing complex configurations. It also provides serialization and deserialization capabilities, allowing you to easily save and load configurations in JSON, YAML, or TOML formats.

Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context provides a robust and efficient solution.

This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.

"},{"location":"reference/concepts/context.html#api-reference","title":"API Reference","text":"

See API Reference for a detailed description of the Context class and its methods.

"},{"location":"reference/concepts/context.html#key-features","title":"Key Features","text":"
  • Accessing Values: Context simplifies accessing configuration values. You can access them using dictionary-like methods or as class attributes. This allows for a more intuitive interaction with the Context object. For example:

    context = Context({\"bronze_table\": \"catalog.schema.table_name\"})\nprint(context.bronze_table)  # Outputs: catalog.schema.table_name\n
  • Nested Keys: Context supports nested keys, allowing you to access and add nested keys in a straightforward way. This is useful when dealing with complex configurations that require a hierarchical structure. For example:

    context = Context({\"bronze\": {\"table\": \"catalog.schema.table_name\"}})\nprint(context.bronze.table)  # Outputs: catalog.schema.table_name\n
  • Merging Contexts: You can merge two Contexts together, with the incoming Context having priority. Recursive merging is also supported. This is particularly useful when you want to update a Context with new data without losing the existing values. For example:

    context1 = Context({\"bronze_table\": \"catalog.schema.table_name\"})\ncontext2 = Context({\"silver_table\": \"catalog.schema.table_name\"})\ncontext1.merge(context2)\nprint(context1.silver_table)  # Outputs: catalog.schema.table_name\n
  • Adding Keys: You can add keys to a Context by using the add method. This allows you to dynamically update the Context as needed. For example:

    context.add(\"silver_table\", \"catalog.schema.table_name\")\n
  • Checking Key Existence: You can check if a key exists in a Context by using the contains method. This is useful when you want to ensure a key is present before attempting to access its value. For example:

    context.contains(\"silver_table\")  # Returns: True\n
  • Getting Key-Value Pair: You can get a key-value pair from a Context by using the get_item method. This can be useful when you want to extract a specific piece of data from the Context. For example:

    context.get_item(\"silver_table\")  # Returns: {\"silver_table\": \"catalog.schema.table_name\"}\n
  • Converting to Dictionary: You can convert a Context to a dictionary by using the to_dict method. This can be useful when you need to interact with code that expects a standard Python dictionary. For example:

    context_dict = context.to_dict()\n
  • Creating from Dictionary: You can create a Context from a dictionary by using the from_dict method. This allows you to easily convert existing data structures into a Context. For example:

    context = Context.from_dict({\"bronze_table\": \"catalog.schema.table_name\"})\n
"},{"location":"reference/concepts/context.html#advantages-over-a-dictionary","title":"Advantages over a Dictionary","text":"

While a dictionary can be used to store configuration values, Context provides several advantages:

  • Support for nested keys: Unlike a standard Python dictionary, Context allows you to access nested keys as if they were attributes. This makes it easier to work with complex, hierarchical data.

  • Recursive merging of two Contexts: Context allows you to merge two Contexts together, with the incoming Context having priority. This is useful when you want to update a Context with new data without losing the existing values.

  • Accessing keys as if they were class attributes: This provides a more intuitive way to interact with the Context, as you can use dot notation to access values.

  • Code completion in IDEs: Because you can access keys as if they were attributes, IDEs can provide code completion for Context keys. This can make your coding process more efficient and less error-prone.

  • Easy creation from a YAML, JSON, or TOML file: Context provides methods to easily load data from YAML or JSON files, making it a great tool for managing configuration data.

"},{"location":"reference/concepts/context.html#data-formats-and-serialization","title":"Data Formats and Serialization","text":"

Context leverages JSON, YAML, and TOML for serialization and deserialization. These formats are widely used in the industry and provide a balance between readability and ease of use.

  • JSON: A lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It's widely used for APIs and web-based applications.

  • YAML: A human-friendly data serialization standard often used for configuration files. It's more readable than JSON and supports complex data structures.

  • TOML: A minimal configuration file format that's easy to read due to its clear and simple syntax. It's often used for configuration files in Python applications.

"},{"location":"reference/concepts/context.html#examples","title":"Examples","text":"

In this section, we provide a variety of examples to demonstrate the capabilities of the Context class in Koheesio.

"},{"location":"reference/concepts/context.html#basic-operations","title":"Basic Operations","text":"

Here are some basic operations you can perform with Context. These operations form the foundation of how you interact with a Context object:

# Create a Context\ncontext = Context({\"bronze_table\": \"catalog.schema.table_name\"})\n\n# Access a value\nvalue = context.bronze_table\n\n# Add a key\ncontext.add(\"silver_table\", \"catalog.schema.table_name\")\n\n# Merge two Contexts\ncontext.merge(Context({\"silver_table\": \"catalog.schema.table_name\"}))\n
"},{"location":"reference/concepts/context.html#serialization-and-deserialization","title":"Serialization and Deserialization","text":"

Context supports serialization and deserialization to and from JSON, YAML, and TOML formats. This allows you to easily save and load Context data:

# Load context from a JSON file\ncontext = Context.from_json(\"path/to/context.json\")\n\n# Save context to a JSON file\ncontext.to_json(\"path/to/context.json\")\n\n# Load context from a YAML file\ncontext = Context.from_yaml(\"path/to/context.yaml\")\n\n# Save context to a YAML file\ncontext.to_yaml(\"path/to/context.yaml\")\n\n# Load context from a TOML file\ncontext = Context.from_toml(\"path/to/context.toml\")\n\n# Save context to a TOML file\ncontext.to_toml(\"path/to/context.toml\")\n
"},{"location":"reference/concepts/context.html#nested-keys","title":"Nested Keys","text":"

Context supports nested keys, allowing you to create hierarchical configurations. This is useful when dealing with complex data structures:

# Create a Context with nested keys\ncontext = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Access a nested key\nprint(context.database.bronze_table)  # Outputs: catalog.schema.bronze_table\n
"},{"location":"reference/concepts/context.html#recursive-merging","title":"Recursive Merging","text":"

Context also supports recursive merging, allowing you to merge two Contexts together at all levels of their hierarchy. This is particularly useful when you want to update a Context with new data without losing the existing values:

# Create two Contexts with nested keys\ncontext1 = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\"\n    }\n})\n\ncontext2 = Context({\n    \"database\": {\n        \"silver_table\": \"catalog.schema.new_silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Merge the two Contexts\ncontext1.merge(context2)\n\n# Print the merged Context\nprint(context1.to_dict())  \n# Outputs: \n# {\n#     \"database\": {\n#         \"bronze_table\": \"catalog.schema.bronze_table\",\n#         \"silver_table\": \"catalog.schema.new_silver_table\",\n#         \"gold_table\": \"catalog.schema.gold_table\"\n#     }\n# }\n
"},{"location":"reference/concepts/context.html#jsonpickle-and-complex-python-objects","title":"Jsonpickle and Complex Python Objects","text":"

The Context class in Koheesio also uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON. This allows you to convert complex Python objects, including custom classes, into a format that can be easily stored and transferred.

Here's an example of how this works:

# Import necessary modules\nfrom koheesio.context import Context\n\n# Initialize SnowflakeReader and store in a Context\nsnowflake_reader = SnowflakeReader(...)  # fill in with necessary arguments\ncontext = Context({\"snowflake_reader\": snowflake_reader})\n\n# Serialize the Context to a JSON string\njson_str = context.to_json()\n\n# Print the serialized Context\nprint(json_str)\n\n# Deserialize the JSON string back into a Context\ndeserialized_context = Context.from_json(json_str)\n\n# Access the deserialized SnowflakeReader\ndeserialized_snowflake_reader = deserialized_context.snowflake_reader\n\n# Now you can use the deserialized SnowflakeReader as you would the original\n

This feature is particularly useful when you need to save the state of your application, transfer it over a network, or store it in a database. When you're ready to use the stored data, you can easily convert it back into the original Python objects.

However, there are a few things to keep in mind:

  1. The classes you're serializing must be importable (i.e., they must be in the Python path) when you're deserializing the JSON. jsonpickle needs to be able to import the class to reconstruct the object. This holds true for most Koheesio classes, as they are designed to be importable and reconstructible.

  2. Not all Python objects can be serialized. For example, objects that hold a reference to a file or a network connection can't be serialized because their state can't be easily captured in a static file.

  3. As mentioned in the code comments, jsonpickle is not secure against malicious data. You should only deserialize data that you trust.

So, while the Context class provides a powerful tool for handling complex Python objects, it's important to be aware of these limitations.

"},{"location":"reference/concepts/context.html#conclusion","title":"Conclusion","text":"

In this document, we've covered the key features of the Context class in the Koheesio framework, including its ability to handle complex Python objects, support for nested keys and recursive merging, and its serialization and deserialization capabilities.

Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context provides a robust and efficient solution.

"},{"location":"reference/concepts/context.html#further-reading","title":"Further Reading","text":"

For more information, you can refer to the following resources:

  • Python jsonpickle Documentation
  • Python JSON Documentation
  • Python YAML Documentation
  • Python TOML Documentation

Refer to the API documentation for more details on the Context class and its methods.

"},{"location":"reference/concepts/logger.html","title":"Python Logger Code Instructions","text":"

Here you can find instructions on how to use the Koheesio Logging Factory.

"},{"location":"reference/concepts/logger.html#logging-factory","title":"Logging Factory","text":"

The LoggingFactory class is a factory for creating and configuring loggers. To use it, follow these steps:

  1. Import the necessary modules:

    from koheesio.logger import LoggingFactory\n
  2. Initialize logging factory for koheesio modules:

    factory = LoggingFactory(name=\"replace_koheesio_parent_name\", env=\"local\", logger_id=\"your_run_id\")\n# Or use default \nfactory = LoggingFactory()\n# Or just specify log level for koheesio modules\nfactory = LoggingFactory(level=\"DEBUG\")\n
  3. Create a logger by calling the create_logger method of the LoggingFactory class, you can inherit from koheesio logger:

    python logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME) # Or for koheesio modules logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME,inherit_from_koheesio=True)

  4. You can now use the logger object to log messages:

    logger.debug(\"Debug message\")\nlogger.info(\"Info message\")\nlogger.warning(\"Warning message\")\nlogger.error(\"Error message\")\nlogger.critical(\"Critical message\")\n
  5. (Optional) You can add additional handlers to the logger by calling the add_handlers method of the LoggingFactory class:

    handlers = [\n    (\"your_handler_module.YourHandlerClass\", {\"level\": \"INFO\"}),\n    # Add more handlers if needed\n]\nfactory.add_handlers(handlers)\n
  6. (Optional) You can create child loggers based on the parent logger by calling the get_logger method of the LoggingFactory class:

    child_logger = factory.get_logger(name=\"your_child_logger_name\")\n
  7. (Optional) Get an independent logger without inheritance

    If you need an independent logger without inheriting from the LoggingFactory logger, you can use the get_logger method:

    your_logger = factory.get_logger(name=\"your_logger_name\", inherit=False)\n

By setting inherit to False, you will obtain a logger that is not tied to the LoggingFactory logger hierarchy, only format of message will be the same, but you can also change it. This allows you to have an independent logger with its own configuration. You can use the your_logger object to log messages:

```python\nyour_logger.debug(\"Debug message\")\nyour_logger.info(\"Info message\")\nyour_logger.warning(\"Warning message\")\nyour_logger.error(\"Error message\")\nyour_logger.critical(\"Critical message\")\n```\n
  1. (Optional) You can use Masked types to masked secrets/tokens/passwords in output. The Masked types are special types provided by the koheesio library to handle sensitive data that should not be logged or printed in plain text. They are used to wrap sensitive data and override their string representation to prevent accidental exposure of the data.Here are some examples of how to use Masked types:

    import logging\nfrom koheesio.logger import MaskedString, MaskedInt, MaskedFloat, MaskedDict\n\n# Set up logging\nlogger = logging.getLogger(__name__)\nlogging.basicConfig(level=logging.INFO)\n\n# Using MaskedString\nmasked_string = MaskedString(\"my secret string\")\nlogger.info(masked_string)  # This will not log the actual string\n\n# Using MaskedInt\nmasked_int = MaskedInt(12345)\nlogger.info(masked_int)  # This will not log the actual integer\n\n# Using MaskedFloat\nmasked_float = MaskedFloat(3.14159)\nlogger.info(masked_float)  # This will not log the actual float\n\n# Using MaskedDict\nmasked_dict = MaskedDict({\"key\": \"value\"})\nlogger.info(masked_dict)  # This will not log the actual dictionary\n

Please make sure to replace \"your_logger_name\", \"your_run_id\", \"your_handler_module.YourHandlerClass\", \"your_child_logger_name\", and other placeholders with your own values according to your application's requirements.

By following these steps, you can obtain an independent logger without inheriting from the LoggingFactory logger. This allows you to customize the logger configuration and use it separately in your code.

Note: Ensure that you have imported the necessary modules, instantiated the LoggingFactory class, and customized the logger name and other parameters according to your application's requirements.

"},{"location":"reference/concepts/logger.html#example","title":"Example","text":"
import logging\n\n# Step 2: Instantiate the LoggingFactory class\nfactory = LoggingFactory(env=\"local\")\n\n# Step 3: Create an independent logger with a custom log level\nyour_logger = factory.get_logger(\"your_logger\", inherit_from_koheesio=False)\nyour_logger.setLevel(logging.DEBUG)\n\n# Step 4: Create a logger using the create_logger method from LoggingFactory with a different log level\nfactory_logger = LoggingFactory(level=\"WARNING\").get_logger(name=factory.LOGGER_NAME)\n\n# Step 5: Create a child logger with a debug level\nchild_logger = factory.get_logger(name=\"child\")\nchild_logger.setLevel(logging.DEBUG)\n\nchild2_logger = factory.get_logger(name=\"child2\")\nchild2_logger.setLevel(logging.INFO)\n\n# Step 6: Log messages at different levels for both loggers\nyour_logger.debug(\"Debug message\")  # This message will be displayed\nyour_logger.info(\"Info message\")  # This message will be displayed\nyour_logger.warning(\"Warning message\")  # This message will be displayed\nyour_logger.error(\"Error message\")  # This message will be displayed\nyour_logger.critical(\"Critical message\")  # This message will be displayed\n\nfactory_logger.debug(\"Debug message\")  # This message will not be displayed\nfactory_logger.info(\"Info message\")  # This message will not be displayed\nfactory_logger.warning(\"Warning message\")  # This message will be displayed\nfactory_logger.error(\"Error message\")  # This message will be displayed\nfactory_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild_logger.debug(\"Debug message\")  # This message will be displayed\nchild_logger.info(\"Info message\")  # This message will be displayed\nchild_logger.warning(\"Warning message\")  # This message will be displayed\nchild_logger.error(\"Error message\")  # This message will be displayed\nchild_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild2_logger.debug(\"Debug message\")  # This message will be displayed\nchild2_logger.info(\"Info message\")  # This message will be displayed\nchild2_logger.warning(\"Warning message\")  # This message will be displayed\nchild2_logger.error(\"Error message\")  # This message will be displayed\nchild2_logger.critical(\"Critical message\")  # This message will be displayed\n

Output:

[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [your_logger] {__init__.py:<module>:118} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [your_logger] {__init__.py:<module>:119} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [your_logger] {__init__.py:<module>:120} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [your_logger] {__init__.py:<module>:121} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [your_logger] {__init__.py:<module>:122} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio] {__init__.py:<module>:126} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio] {__init__.py:<module>:127} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio] {__init__.py:<module>:128} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [koheesio.child] {__init__.py:<module>:130} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child] {__init__.py:<module>:131} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child] {__init__.py:<module>:132} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child] {__init__.py:<module>:133} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child] {__init__.py:<module>:134} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child2] {__init__.py:<module>:137} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child2] {__init__.py:<module>:138} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child2] {__init__.py:<module>:139} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child2] {__init__.py:<module>:140} - Critical message\n
"},{"location":"reference/concepts/logger.html#loggeridfilter-class","title":"LoggerIDFilter Class","text":"

The LoggerIDFilter class is a filter that injects run_id information into the log. To use it, follow these steps:

  1. Import the necessary modules:

    import logging\n
  2. Create an instance of the LoggerIDFilter class:

    logger_filter = LoggerIDFilter()\n
  3. Set the LOGGER_ID attribute of the LoggerIDFilter class to the desired run ID:

    LoggerIDFilter.LOGGER_ID = \"your_run_id\"\n
  4. Add the logger_filter to your logger or handler:

    logger = logging.getLogger(\"your_logger_name\")\nlogger.addFilter(logger_filter)\n
"},{"location":"reference/concepts/logger.html#loggingfactory-set-up-optional","title":"LoggingFactory Set Up (Optional)","text":"
  1. Import the LoggingFactory class in your application code.

  2. Set the value for the LOGGER_FILTER variable:

  3. If you want to assign a specific logging.Filter instance, replace None with your desired filter instance.
  4. If you want to keep the default value of None, leave it unchanged.

  5. Set the value for the LOGGER_LEVEL variable:

  6. If you want to use the value from the \"KOHEESIO_LOGGING_LEVEL\" environment variable, leave the code as is.
  7. If you want to use a different environment variable or a specific default value, modify the code accordingly.

  8. Set the value for the LOGGER_ENV variable:

  9. Replace \"local\" with your desired environment name.

  10. Set the value for the LOGGER_FORMAT variable:

  11. If you want to customize the log message format, modify the value within the double quotes.
  12. The format should follow the desired log message format pattern.

  13. Set the value for the LOGGER_FORMATTER variable:

  14. If you want to assign a specific Formatter instance, replace Formatter(LOGGER_FORMAT) with your desired formatter instance.
  15. If you want to keep the default formatter with the defined log message format, leave it unchanged.

  16. Set the value for the CONSOLE_HANDLER variable:

    • If you want to assign a specific logging.Handler instance, replace None with your desired handler instance.
    • If you want to keep the default value of None, leave it unchanged.
  17. Set the value for the ENV variable:

    • Replace None with your desired environment value if applicable.
    • If you don't need to set this variable, leave it as None.
  18. Save the changes to the file.

"},{"location":"reference/concepts/step.html","title":"Steps in Koheesio","text":"

In the Koheesio framework, the Step class and its derivatives play a crucial role. They serve as the building blocks for creating data pipelines, allowing you to define custom units of logic that can be executed. This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.

Several type of Steps are available in Koheesio, including Reader, Transformation, Writer, and Task.

"},{"location":"reference/concepts/step.html#what-is-a-step","title":"What is a Step?","text":"

A Step is an atomic operation serving as the building block of data pipelines built with the Koheesio framework. Tasks typically consist of a series of Steps.

A step can be seen as an operation on a set of inputs, that returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.

\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n&nbsp;\n&nbsp;\nStep\n&nbsp;\n&nbsp;\n&nbsp;\n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n
"},{"location":"reference/concepts/step.html#how-to-read-a-step","title":"How to Read a Step?","text":"

A Step in Koheesio is a class that represents a unit of work in a data pipeline. It's similar to a Python built-in data class, but with additional features for execution, validation, and logging.

When you look at a Step, you'll typically see the following components:

  1. Class Definition: The Step is defined as a class that inherits from the base Step class in Koheesio. For example, class MyStep(Step):.

  2. Input Fields: These are defined as class attributes with type annotations, similar to attributes in a Python data class. These fields represent the inputs to the Step. For example, a: str defines an input field a of type str. Additionally, you will often see these fields defined using Pydantic's Field class, which allows for more detailed validation and documentation as well as default values and aliasing.

  3. Output Fields: These are defined in a nested class called Output that inherits from StepOutput. This class represents the output of the Step. For example, class Output(StepOutput): b: str defines an output field b of type str.

  4. Execute Method: This is a method that you need to implement when you create a new Step. It contains the logic of the Step and is where you use the input fields and populate the output fields. For example, def execute(self): self.output.b = f\"{self.a}-some-suffix\".

Here's an example of a Step:

class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -> MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n

In this Step, a is an input field of type str, b is an output field of type str, and the execute method appends -some-suffix to the input a and assigns it to the output b.

When you see a Step, you can think of it as a function where the class attributes are the inputs, the Output class defines the outputs, and the execute method is the function body. The main difference is that a Step also includes automatic validation of inputs and outputs (thanks to Pydantic), logging, and error handling.

"},{"location":"reference/concepts/step.html#understanding-inheritance-in-steps","title":"Understanding Inheritance in Steps","text":"

Inheritance is a core concept in object-oriented programming where a class (child or subclass) inherits properties and methods from another class (parent or superclass). In the context of Koheesio, when you create a new Step, you're creating a subclass that inherits from the base Step class.

When a new Step is defined (like class MyStep(Step):), it inherits all the properties and methods from the Step class. This includes the execute method, which is then overridden to provide the specific functionality for that Step.

Here's a simple breakdown:

  1. Parent Class (Superclass): This is the Step class in Koheesio. It provides the basic structure and functionalities of a Step, including input and output validation, logging, and error handling.

  2. Child Class (Subclass): This is the new Step you define, like MyStep. It inherits all the properties and methods from the Step class and can add or override them as needed.

  3. Inheritance: This is the process where MyStep inherits the properties and methods from the Step class. In Python, this is done by mentioning the parent class in parentheses when defining the child class, like class MyStep(Step):.

  4. Overriding: This is when you provide a new implementation of a method in the child class that is already defined in the parent class. In the case of Steps, you override the execute method to define the specific logic of your Step.

Understanding inheritance is key to understanding how Steps work in Koheesio. It allows you to leverage the functionalities provided by the Step class and focus on implementing the specific logic of your Step.

"},{"location":"reference/concepts/step.html#benefits-of-using-steps-in-data-pipelines","title":"Benefits of Using Steps in Data Pipelines","text":"

The concept of a Step is beneficial when creating Data Pipelines or Data Products for several reasons:

  1. Modularity: Each Step represents a self-contained unit of work, which makes the pipeline modular. This makes it easier to understand, test, and maintain the pipeline. If a problem arises, you can pinpoint which step is causing the issue.

  2. Reusability: Steps can be reused across different pipelines. Once a Step is defined, it can be used in any number of pipelines. This promotes code reuse and consistency across projects.

  3. Readability: Steps make the pipeline code more readable. Each Step has a clear input, output, and execution logic, which makes it easier to understand what each part of the pipeline is doing.

  4. Validation: Steps automatically validate their inputs and outputs. This ensures that the data flowing into and out of each step is of the expected type and format, which can help catch errors early.

  5. Logging: Steps automatically log the start and end of their execution, along with the input and output data. This can be very useful for debugging and understanding the flow of data through the pipeline.

  6. Error Handling: Steps provide built-in error handling. If an error occurs during the execution of a step, it is caught, logged, and then re-raised. This provides a clear indication of where the error occurred.

  7. Scalability: Steps can be easily parallelized or distributed, which is crucial for processing large datasets. This is especially true for steps that are designed to work with distributed computing frameworks like Apache Spark.

By using the concept of a Step, you can create data pipelines that are modular, reusable, readable, and robust, while also being easier to debug and scale.

"},{"location":"reference/concepts/step.html#compared-to-a-regular-pydantic-basemodel","title":"Compared to a regular Pydantic Basemodel","text":"

A Step in Koheesio, while built on top of Pydantic's BaseModel, provides additional features specifically designed for creating data pipelines. Here are some key differences:

  1. Execution Method: A Step includes an execute method that needs to be implemented. This method contains the logic of the step and is automatically decorated with functionalities such as logging and output validation.

  2. Input and Output Validation: A Step uses Pydantic models to define and validate its inputs and outputs. This ensures that the data flowing into and out of the step is of the expected type and format.

  3. Automatic Logging: A Step automatically logs the start and end of its execution, along with the input and output data. This is done through the do_execute decorator applied to the execute method.

  4. Error Handling: A Step provides built-in error handling. If an error occurs during the execution of the step, it is caught, logged, and then re-raised. This should help in debugging and understanding the flow of data.

  5. Serialization: A Step can be serialized to a YAML string using the to_yaml method. This can be useful for saving and loading steps.

  6. Lazy Mode Support: The StepOutput class in a Step supports lazy mode, which allows validation of the items stored in the class to be called at will instead of being forced to run it upfront.

In contrast, a regular Pydantic BaseModel is a simple data validation model that doesn't include these additional features. It's used for data parsing and validation, but doesn't include methods for execution, automatic logging, error handling, or serialization to YAML.

"},{"location":"reference/concepts/step.html#key-features-of-a-step","title":"Key Features of a Step","text":""},{"location":"reference/concepts/step.html#defining-a-step","title":"Defining a Step","text":"

To define a new step, you subclass the Step class and implement the execute method. The inputs of the step can be accessed using self.input_name. The output of the step can be accessed using self.output.output_name. For example:

class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n
"},{"location":"reference/concepts/step.html#running-a-step","title":"Running a Step","text":"

To run a step, you can call the execute method. You can also use the run method, which is an alias to execute. For example:

step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-step-output","title":"Accessing Step Output","text":"

The output of a step can be accessed using self.output.output_name. For example:

step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\n
"},{"location":"reference/concepts/step.html#serializing-a-step","title":"Serializing a Step","text":"

You can serialize a step to a YAML string using the to_yaml method. For example:

step = MyStep(input1=\"value1\", input2=2)\nyaml_str = step.to_yaml()\n
"},{"location":"reference/concepts/step.html#getting-step-description","title":"Getting Step Description","text":"

You can get the description of a step using the get_description method. For example:

step = MyStep(input1=\"value1\", input2=2)\ndescription = step.get_description()\n
"},{"location":"reference/concepts/step.html#defining-a-step-with-multiple-inputs-and-outputs","title":"Defining a Step with Multiple Inputs and Outputs","text":"

Here's an example of how to define a new step with multiple inputs and outputs:

class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n    input3: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n        output2: int = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n        self.output.output2 = self.input2 + self.input3\n
"},{"location":"reference/concepts/step.html#running-a-step-with-multiple-inputs","title":"Running a Step with Multiple Inputs","text":"

To run a step with multiple inputs, you can do the following:

step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-multiple-step-outputs","title":"Accessing Multiple Step Outputs","text":"

The outputs of a step can be accessed using self.output.output_name. For example:

step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\nprint(step.output.output2)  # Outputs: 5\n
"},{"location":"reference/concepts/step.html#special-features","title":"Special Features","text":""},{"location":"reference/concepts/step.html#the-execute-method","title":"The Execute method","text":"

The execute method in the Step class is automatically decorated with the StepMetaClass._execute_wrapper function due to the metaclass StepMetaClass. This provides several advantages:

  1. Automatic Output Validation: The decorator ensures that the output of the execute method is always a StepOutput instance. This means that the output is automatically validated against the defined output model, ensuring data integrity and consistency.

  2. Logging: The decorator provides automatic logging at the start and end of the execute method. This includes logging the input and output of the step, which can be useful for debugging and understanding the flow of data.

  3. Error Handling: If an error occurs during the execution of the Step, the decorator catches the exception and logs an error message before re-raising the exception. This provides a clear indication of where the error occurred.

  4. Simplifies Step Implementation: Since the decorator handles output validation, logging, and error handling, the user can focus on implementing the logic of the execute method without worrying about these aspects.

  5. Consistency: By automatically decorating the execute method, the library ensures that these features are consistently applied across all steps, regardless of who implements them or how they are used. This makes the behavior of steps predictable and consistent.

  6. Prevents Double Wrapping: The decorator checks if the function is already wrapped with StepMetaClass._execute_wrapper and prevents double wrapping. This ensures that the decorator doesn't interfere with itself if execute is overridden in subclasses.

Notice that you never have to explicitly return anything from the execute method. The StepMetaClass._execute_wrapper decorator takes care of that for you.

Implementation examples for custom metaclass which can be used to override the default behavior of the StepMetaClass._execute_wrapper:

    class MyMetaClass(StepMetaClass):\n        @classmethod\n        def _log_end_message(cls, step: Step, skip_logging: bool = False, *args, **kwargs):\n            print(\"It's me from custom meta class\")\n            super()._log_end_message(step, skip_logging, *args, **kwargs)\n\n    class MyMetaClass2(StepMetaClass):\n        @classmethod\n        def _validate_output(cls, step: Step, skip_validating: bool = False, *args, **kwargs):\n            # i want always have a dummy value in the output\n            step.output.dummy_value = \"dummy\"\n\n    class YourClassWithCustomMeta(Step, metaclass=MyMetaClass):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n\n    class YourClassWithCustomMeta2(Step, metaclass=MyMetaClass2):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n
"},{"location":"reference/concepts/step.html#sparkstep","title":"SparkStep","text":"

The SparkStep class is a subclass of Step that is designed for steps that interact with Spark. It extends the Step class with SparkSession support. Spark steps are expected to return a Spark DataFrame as output. The spark property is available to access the active SparkSession instance. Output in a SparkStep is expected to be a DataFrame although optional.

"},{"location":"reference/concepts/step.html#using-a-sparkstep","title":"Using a SparkStep","text":"

Here's an example of how to use a SparkStep:

class MySparkStep(SparkStep):\n    input1: str = Field(...)\n\n    class Output(StepOutput):\n        output1: DataFrame = Field(...)\n\n    def execute(self):\n        # Your logic here\n        df = self.spark.read.text(self.input1)\n        self.output.output1 = df\n

To run a SparkStep, you can do the following:

step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\n

To access the output of a SparkStep, you can do the following:

step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\ndf = step.output.output1\ndf.show()\n
"},{"location":"reference/concepts/step.html#conclusion","title":"Conclusion","text":"

In this document, we've covered the key features of the Step class in the Koheesio framework, including its ability to define custom units of logic, manage inputs and outputs, and support for serialization. The automatic decoration of the execute method provides several advantages that simplify step implementation and ensure consistency across all steps.

Whether you're defining a new operation in your data pipeline or managing the flow of data between steps, Step provides a robust and efficient solution.

"},{"location":"reference/concepts/step.html#further-reading","title":"Further Reading","text":"

For more information, you can refer to the following resources:

  • Python Pydantic Documentation
  • Python YAML Documentation

Refer to the API documentation for more details on the Step class and its methods.

"},{"location":"reference/spark/readers.html","title":"Reader Module","text":"

The Reader module in Koheesio provides a set of classes for reading data from various sources. A Reader is a type of SparkStep that reads data from a source based on the input parameters and stores the result in self.output.df for subsequent steps.

"},{"location":"reference/spark/readers.html#what-is-a-reader","title":"What is a Reader?","text":"

A Reader is a subclass of SparkStep that reads data from a source and stores the result. The source could be a file, a database, a web API, or any other data source. The result is stored in a DataFrame, which is accessible through the df property of the Reader.

"},{"location":"reference/spark/readers.html#api-reference","title":"API Reference","text":"

See API Reference for a detailed description of the Reader class and its methods.

"},{"location":"reference/spark/readers.html#key-features-of-a-reader","title":"Key Features of a Reader","text":"
  1. Read Method: The Reader class provides a read method that calls the execute method and returns the result. Essentially, calling .read() is a shorthand for calling .execute().output.df. This allows you to read data from a Reader without having to call the execute method directly. This is a convenience method that simplifies the usage of a Reader.

Here's an example of how to use the .read() method:

# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the .read() method to get the data as a DataFrame\ndf = my_reader.read()\n\n# Now df is a DataFrame with the data read by MyReader\n

In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader, you call the .read() method to read the data and get it back as a DataFrame. The DataFrame df now contains the data read by MyReader.

  1. DataFrame Property: The Reader class provides a df property as a shorthand for accessing self.output.df. If self.output.df is None, the execute method is run first. This property ensures that the data is loaded and ready to be used, even if the execute method hasn't been explicitly called.

Here's an example of how to use the df property:

# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the df property to get the data as a DataFrame\ndf = my_reader.df\n\n# Now df is a DataFrame with the data read by MyReader\n

In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader, you access the df property to get the data as a DataFrame. The DataFrame df now contains the data read by MyReader.

  1. SparkSession: Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession, which can be used to perform distributed data processing tasks.

Here's an example of how to use the spark property:

# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the spark property to get the SparkSession\nspark = my_reader.spark\n\n# Now spark is the SparkSession associated with MyReader\n

In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader, you access the spark property to get the SparkSession. The SparkSession spark can now be used to perform distributed data processing tasks.

"},{"location":"reference/spark/readers.html#how-to-define-a-reader","title":"How to Define a Reader?","text":"

To define a Reader, you create a subclass of the Reader class and implement the execute method. The execute method should read from the source and store the result in self.output.df. This is an abstract method, which means it must be implemented in any subclass of Reader.

Here's an example of a Reader:

class MyReader(Reader):\n  def execute(self):\n    # read data from source\n    data = read_from_source()\n    # store result in self.output.df\n    self.output.df = data\n
"},{"location":"reference/spark/readers.html#understanding-inheritance-in-readers","title":"Understanding Inheritance in Readers","text":"

Just like a Step, a Reader is defined as a subclass that inherits from the base Reader class. This means it inherits all the properties and methods from the Reader class and can add or override them as needed. The main method that needs to be overridden is the execute method, which should implement the logic for reading data from the source and storing it in self.output.df.

"},{"location":"reference/spark/readers.html#benefits-of-using-readers-in-data-pipelines","title":"Benefits of Using Readers in Data Pipelines","text":"

Using Reader classes in your data pipelines has several benefits:

  1. Simplicity: Readers abstract away the details of reading data from various sources, allowing you to focus on the logic of your pipeline.

  2. Consistency: By using Readers, you ensure that data is read in a consistent manner across different parts of your pipeline.

  3. Flexibility: Readers can be easily swapped out for different data sources without changing the rest of your pipeline.

  4. Efficiency: Readers automatically manage resources like connections and file handles, ensuring efficient use of resources.

By using the concept of a Reader, you can create data pipelines that are simple, consistent, flexible, and efficient.

"},{"location":"reference/spark/readers.html#examples-of-reader-classes-in-koheesio","title":"Examples of Reader Classes in Koheesio","text":"

Koheesio provides a variety of Reader subclasses for reading data from different sources. Here are just a few examples:

  1. Teradata Reader: A Reader subclass for reading data from Teradata databases. It's defined in the koheesio/steps/readers/teradata.py file.

  2. Snowflake Reader: A Reader subclass for reading data from Snowflake databases. It's defined in the koheesio/steps/readers/snowflake.py file.

  3. Box Reader: A Reader subclass for reading data from Box. It's defined in the koheesio/steps/integrations/box.py file.

These are just a few examples of the many Reader subclasses available in Koheesio. Each Reader subclass is designed to read data from a specific source. They all inherit from the base Reader class and implement the execute method to read data from their respective sources and store it in self.output.df.

Please note that this is not an exhaustive list. Koheesio provides many more Reader subclasses for a wide range of data sources. For a complete list, please refer to the Koheesio documentation or the source code.

More readers can be found in the koheesio/steps/readers module.

"},{"location":"reference/spark/transformations.html","title":"Transformation Module","text":"

The Transformation module in Koheesio provides a set of classes for transforming data within a DataFrame. A Transformation is a type of SparkStep that takes a DataFrame as input, applies a transformation, and returns a DataFrame as output. The transformation logic is implemented in the execute method of each Transformation subclass.

"},{"location":"reference/spark/transformations.html#what-is-a-transformation","title":"What is a Transformation?","text":"

A Transformation is a subclass of SparkStep that applies a transformation to a DataFrame and stores the result. The transformation could be any operation that modifies the data or structure of the DataFrame, such as adding a new column, filtering rows, or aggregating data.

Using Transformation classes ensures that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.

"},{"location":"reference/spark/transformations.html#api-reference","title":"API Reference","text":"

See API Reference for a detailed description of the Transformation classes and their methods.

"},{"location":"reference/spark/transformations.html#types-of-transformations","title":"Types of Transformations","text":"

There are three main types of transformations in Koheesio:

  1. Transformation: This is the base class for all transformations. It takes a DataFrame as input and returns a DataFrame as output. The transformation logic is implemented in the execute method.

  2. ColumnsTransformation: This is an extended Transformation class with a preset validator for handling column(s) data. It standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.

  3. ColumnsTransformationWithTarget: This is an extended ColumnsTransformation class with an additional target_column field. This field can be used to store the result of the transformation in a new column. If the target_column is not provided, the result will be stored in the source column.

Each type of transformation has its own use cases and advantages. The right one to use depends on the specific requirements of your data pipeline.

"},{"location":"reference/spark/transformations.html#how-to-define-a-transformation","title":"How to Define a Transformation","text":"

To define a Transformation, you create a subclass of the Transformation class and implement the execute method. The execute method should take a DataFrame from self.input.df, apply a transformation, and store the result in self.output.df.

Transformation classes abstract away some of the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.

Here's an example of a Transformation:

class MyTransformation(Transformation):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # apply transformation\n        transformed_data = apply_transformation(data)\n        # store result in self.output.df\n        self.output.df = transformed_data\n

In this example, MyTransformation is a subclass of Transformation that you've defined. The execute method gets the data from self.input.df, applies a transformation called apply_transformation (undefined in this example), and stores the result in self.output.df.

"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformation","title":"How to Define a ColumnsTransformation","text":"

To define a ColumnsTransformation, you create a subclass of the ColumnsTransformation class and implement the execute method. The execute method should apply a transformation to the specified columns of the DataFrame.

ColumnsTransformation classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.

Here's an example of a ColumnsTransformation:

from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n

In this example, AddOne is a subclass of ColumnsTransformation that you've defined. The execute method adds 1 to each column in self.get_columns().

The ColumnsTransformation class has a ColumnConfig class that can be used to configure the behavior of the class. This class has the following fields:

  • run_for_all_data_type: Allows to run the transformation for all columns of a given type.
  • limit_data_type: Allows to limit the transformation to a specific data type.
  • data_type_strict_mode: Toggles strict mode for data type validation. Will only work if limit_data_type is set.

Note that data types need to be specified as a SparkDatatype enum. Users should not have to interact with the ColumnConfig class directly.

"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformationwithtarget","title":"How to Define a ColumnsTransformationWithTarget","text":"

To define a ColumnsTransformationWithTarget, you create a subclass of the ColumnsTransformationWithTarget class and implement the func method. The func method should return the transformation that will be applied to the column(s). The execute method, which is already preset, will use the get_columns_with_target method to loop over all the columns and apply this function to transform the DataFrame.

Here's an example of a ColumnsTransformationWithTarget:

from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n

In this example, AddOneWithTarget is a subclass of ColumnsTransformationWithTarget that you've defined. The func method adds 1 to the values of a given column.

The ColumnsTransformationWithTarget class has an additional target_column field. This field can be used to store the result of the transformation in a new column. If the target_column is not provided, the result will be stored in the source column. If more than one column is passed, the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.

The ColumnsTransformationWithTarget class also has a get_columns_with_target method. This method returns an iterator of the columns and handles the target_column as well.

"},{"location":"reference/spark/transformations.html#key-features-of-a-transformation","title":"Key Features of a Transformation","text":"
  1. Execute Method: The Transformation class provides an execute method to implement in your subclass. This method should take a DataFrame from self.input.df, apply a transformation, and store the result in self.output.df.

    For ColumnsTransformation and ColumnsTransformationWithTarget, the execute method is already implemented in the base class. Instead of overriding execute, you implement a func method in your subclass. This func method should return the transformation to be applied to each column. The execute method will then apply this func to each column in a loop.

  2. DataFrame Property: The Transformation class provides a df property as a shorthand for accessing self.input.df. This property ensures that the data is ready to be transformed, even if the execute method hasn't been explicitly called. This is useful for 'early validation' of the input data.

  3. SparkSession: Every Transformation has a SparkSession available as self.spark. This is the currently active SparkSession, which can be used to perform distributed data processing tasks.

  4. Columns Property: The ColumnsTransformation and ColumnsTransformationWithTarget classes provide a columns property. This property standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.

  5. Target Column Property: The ColumnsTransformationWithTarget class provides a target_column property. This field can be used to store the result of the transformation in a new column. If the target_column is not provided, the result will be stored in the source column.

"},{"location":"reference/spark/transformations.html#examples-of-transformation-classes-in-koheesio","title":"Examples of Transformation Classes in Koheesio","text":"

Koheesio provides a variety of Transformation subclasses for transforming data in different ways. Here are some examples:

  • DataframeLookup: This transformation joins two dataframes together based on a list of join mappings. It allows you to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.

    Here's an example of how to use the DataframeLookup transformation:

    from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\nspark = SparkSession.builder.getOrCreate()\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.execute().df\n
  • HashUUID5: This transformation is a subclass of Transformation and provides an interface to generate a UUID5 hash for each row in the DataFrame. The hash is generated based on the values of the specified source columns.

    Here's an example of how to use the HashUUID5 transformation:

    from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\n\nspark = SparkSession.builder.getOrCreate()\ndf = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\n\nhash_transform = HashUUID5(\n    df=df,\n    source_columns=[\"id\", \"value\"],\n    target_column=\"hash\"\n)\n\noutput_df = hash_transform.execute().df\n

In this example, HashUUID5 is a subclass of Transformation. After creating an instance of HashUUID5, you call the execute method to apply the transformation. The execute method generates a UUID5 hash for each row in the DataFrame based on the values of the id and value columns and stores the result in a new column named hash.

"},{"location":"reference/spark/transformations.html#benefits-of-using-koheesio-transformations","title":"Benefits of using Koheesio Transformations","text":"

Using a Koheesio Transformation over plain Spark provides several benefits:

  1. Consistency: By using Transformation classes, you ensure that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.

  2. Abstraction: Transformation classes abstract away the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.

  3. Flexibility: Transformation classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.

  4. Early Input Validation: As a Transformation is a type of SparkStep, which in turn is a Step and a type of Pydantic BaseModel, all inputs are validated when an instance of a Transformation class is created. This early validation helps catch errors related to invalid input, such as an invalid column name, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.

  5. Ease of Testing: Transformation classes are designed to be easily testable. This can make it easier to write unit tests for your data pipeline, helping to ensure its correctness and reliability.

  6. Robustness: Koheesio has been extensively tested with hundreds of unit tests, ensuring that the Transformation classes work as expected under a wide range of conditions. This makes your data pipelines more robust and less likely to fail due to unexpected inputs or edge cases.

By using the concept of a Transformation, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.

"},{"location":"reference/spark/transformations.html#advanced-usage-of-transformations","title":"Advanced Usage of Transformations","text":"

Transformations can be combined and chained together to create complex data processing pipelines. Here's an example of how to chain transformations:

from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\n# Create a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Define two DataFrames\ndf1 = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\ndf2 = spark.createDataFrame([(1, \"C\"), (3, \"D\")], [\"id\", \"value\"])\n\n# Define the first transformation\nlookup = DataframeLookup(\n    other=df2,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\n# Apply the first transformation\noutput_df = lookup.transform(df1)\n\n# Define the second transformation\nhash_transform = HashUUID5(\n    source_columns=[\"id\", \"value\", \"right_value\"],\n    target_column=\"hash\"\n)\n\n# Apply the second transformation\noutput_df2 = hash_transform.transform(output_df)\n

In this example, DataframeLookup is a subclass of ColumnsTransformation and HashUUID5 is a subclass of Transformation. After creating instances of DataframeLookup and HashUUID5, you call the transform method to apply each transformation. The transform method of DataframeLookup performs a left join with df2 on the id column and adds the value column from df2 to the result DataFrame as right_value. The transform method of HashUUID5 generates a UUID5 hash for each row in the DataFrame based on the values of the id, value, and right_value columns and stores the result in a new column named hash.

"},{"location":"reference/spark/transformations.html#troubleshooting-transformations","title":"Troubleshooting Transformations","text":"

If you encounter an error when using a transformation, here are some steps you can take to troubleshoot:

  1. Check the Input Data: Make sure the input DataFrame to the transformation is correct. You can use the show method of the DataFrame to print the first few rows of the DataFrame.

  2. Check the Transformation Parameters: Make sure the parameters passed to the transformation are correct. For example, if you're using a DataframeLookup, make sure the join mappings and target columns are correctly specified.

  3. Check the Transformation Logic: If the input data and parameters are correct, there might be an issue with the transformation logic. You can use PySpark's logging utilities to log intermediate results and debug the transformation logic.

  4. Check the Output Data: If the transformation executes without errors but the output data is not as expected, you can use the show method of the DataFrame to print the first few rows of the output DataFrame. This can help you identify any issues with the transformation logic.

"},{"location":"reference/spark/transformations.html#conclusion","title":"Conclusion","text":"

The Transformation module in Koheesio provides a powerful and flexible way to transform data in a DataFrame. By using Transformation classes, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable. Whether you're performing simple transformations like adding a new column, or complex transformations like joining multiple DataFrames, the Transformation module has you covered.

"},{"location":"reference/spark/writers.html","title":"Writer Module","text":"

The Writer module in Koheesio provides a set of classes for writing data to various destinations. A Writer is a type of SparkStep that takes data from self.input.df and writes it to a destination based on the output parameters.

"},{"location":"reference/spark/writers.html#what-is-a-writer","title":"What is a Writer?","text":"

A Writer is a subclass of SparkStep that writes data to a destination. The data to be written is taken from a DataFrame, which is accessible through the df property of the Writer.

"},{"location":"reference/spark/writers.html#how-to-define-a-writer","title":"How to Define a Writer?","text":"

To define a Writer, you create a subclass of the Writer class and implement the execute method. The execute method should take data from self.input.df and write it to the destination.

Here's an example of a Writer:

class MyWriter(Writer):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # write data to destination\n        write_to_destination(data)\n
"},{"location":"reference/spark/writers.html#key-features-of-a-writer","title":"Key Features of a Writer","text":"
  1. Write Method: The Writer class provides a write method that calls the execute method and writes the data to the destination. Essentially, calling .write() is a shorthand for calling .execute().output.df. This allows you to write data to a Writer without having to call the execute method directly. This is a convenience method that simplifies the usage of a Writer.

    Here's an example of how to use the .write() method:

    # Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the .write() method to write the data\nmy_writer.write()\n\n# The data from MyWriter's DataFrame is now written to the destination\n

    In this example, MyWriter is a subclass of Writer that you've defined. After creating an instance of MyWriter, you call the .write() method to write the data to the destination. The data from MyWriter's DataFrame is now written to the destination.

  2. DataFrame Property: The Writer class provides a df property as a shorthand for accessing self.input.df. This property ensures that the data is ready to be written, even if the execute method hasn't been explicitly called.

    Here's an example of how to use the df property:

    # Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the df property to get the data as a DataFrame\ndf = my_writer.df\n\n# Now df is a DataFrame with the data that will be written by MyWriter\n

    In this example, MyWriter is a subclass of Writer that you've defined. After creating an instance of MyWriter, you access the df property to get the data as a DataFrame. The DataFrame df now contains the data that will be written by MyWriter.

  3. SparkSession: Every Writer has a SparkSession available as self.spark. This is the currently active SparkSession, which can be used to perform distributed data processing tasks.

    Here's an example of how to use the spark property:

    # Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the spark property to get the SparkSession\nspark = my_writer.spark\n\n# Now spark is the SparkSession associated with MyWriter\n

    In this example, MyWriter is a subclass of Writer that you've defined. After creating an instance of MyWriter, you access the spark property to get the SparkSession. The SparkSession spark can now be used to perform distributed data processing tasks.

"},{"location":"reference/spark/writers.html#understanding-inheritance-in-writers","title":"Understanding Inheritance in Writers","text":"

Just like a Step, a Writer is defined as a subclass that inherits from the base Writer class. This means it inherits all the properties and methods from the Writer class and can add or override them as needed. The main method that needs to be overridden is the execute method, which should implement the logic for writing data from self.input.df to the destination.

"},{"location":"reference/spark/writers.html#examples-of-writer-classes-in-koheesio","title":"Examples of Writer Classes in Koheesio","text":"

Koheesio provides a variety of Writer subclasses for writing data to different destinations. Here are just a few examples:

  • BoxFileWriter
  • DeltaTableStreamWriter
  • DeltaTableWriter
  • DummyWriter
  • ForEachBatchStreamWriter
  • KafkaWriter
  • SnowflakeWriter
  • StreamWriter

Please note that this is not an exhaustive list. Koheesio provides many more Writer subclasses for a wide range of data destinations. For a complete list, please refer to the Koheesio documentation or the source code.

"},{"location":"reference/spark/writers.html#benefits-of-using-writers-in-data-pipelines","title":"Benefits of Using Writers in Data Pipelines","text":"

Using Writer classes in your data pipelines has several benefits:

  1. Simplicity: Writers abstract away the details of writing data to various destinations, allowing you to focus on the logic of your pipeline.
  2. Consistency: By using Writers, you ensure that data is written in a consistent manner across different parts of your pipeline.
  3. Flexibility: Writers can be easily swapped out for different data destinations without changing the rest of your pipeline.
  4. Efficiency: Writers automatically manage resources like connections and file handles, ensuring efficient use of resources.
  5. Early Input Validation: As a Writer is a type of SparkStep, which in turn is a Step and a type of Pydantic BaseModel, all inputs are validated when an instance of a Writer class is created. This early validation helps catch errors related to invalid input, such as an invalid URL for a database, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.

By using the concept of a Writer, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.

"},{"location":"tutorials/advanced-data-processing.html","title":"Advanced Data Processing with Koheesio","text":"

In this guide, we will explore some advanced data processing techniques using Koheesio. We will cover topics such as complex transformations, handling large datasets, and optimizing performance.

"},{"location":"tutorials/advanced-data-processing.html#complex-transformations","title":"Complex Transformations","text":"

Koheesio provides a variety of built-in transformations, but sometimes you may need to perform more complex operations on your data. In such cases, you can create custom transformations.

Here's an example of a custom transformation that normalizes a column in a DataFrame:

from pyspark.sql import DataFrame\nfrom koheesio.spark.transformations.transform import Transform\n\ndef normalize_column(df: DataFrame, column: str) -> DataFrame:\n    max_value = df.agg({column: \"max\"}).collect()[0][0]\n    min_value = df.agg({column: \"min\"}).collect()[0][0]\n    return df.withColumn(column, (df[column] - min_value) / (max_value - min_value))\n\n\nclass NormalizeColumnTransform(Transform):\n    column: str\n\n    def transform(self, df: DataFrame) -> DataFrame:\n        return normalize_column(df, self.column)\n
"},{"location":"tutorials/advanced-data-processing.html#handling-large-datasets","title":"Handling Large Datasets","text":"

When working with large datasets, it's important to manage resources effectively to ensure good performance. Koheesio provides several features to help with this.

"},{"location":"tutorials/advanced-data-processing.html#partitioning","title":"Partitioning","text":"

Partitioning is a technique that divides your data into smaller, more manageable pieces, called partitions. Koheesio allows you to specify the partitioning scheme for your data when writing it to a target.

from koheesio.steps.writers.delta import DeltaTableWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\nclass MyTask(EtlTask):\n    target = DeltaTableWriter(table=\"my_table\", partitionBy=[\"column1\", \"column2\"])\n
"},{"location":"tutorials/getting-started.html","title":"Getting Started with Koheesio","text":""},{"location":"tutorials/getting-started.html#requirements","title":"Requirements","text":"
  • Python 3.9+
"},{"location":"tutorials/getting-started.html#installation","title":"Installation","text":""},{"location":"tutorials/getting-started.html#poetry","title":"Poetry","text":"

If you're using Poetry, add the following entry to the pyproject.toml file:

pyproject.toml
[[tool.poetry.source]]\nname = \"nike\"\nurl = \"https://artifactory.nike.com/artifactory/api/pypi/python-virtual/simple\"\nsecondary = true\n
poetry add koheesio\n
"},{"location":"tutorials/getting-started.html#pip","title":"pip","text":"

If you're using pip, run the following command to install Koheesio:

Requires pip.

pip install koheesio\n
"},{"location":"tutorials/getting-started.html#basic-usage","title":"Basic Usage","text":"

Once you've installed Koheesio, you can start using it in your Python scripts. Here's a basic example:

from koheesio import Step\n\n# Define a step\nclass MyStep(Step):\n    def execute(self):\n        # Your step logic here\n\n# Create an instance of the step\nstep = MyStep()\n\n# Run the step\nstep.execute()\n
"},{"location":"tutorials/getting-started.html#advanced-usage","title":"Advanced Usage","text":"
from pyspark.sql.functions import lit\nfrom pyspark.sql import DataFrame, SparkSession\n\n# Step 1: import Koheesio dependencies\nfrom koheesio.context import Context\nfrom koheesio.steps.readers.dummy import DummyReader\nfrom koheesio.steps.transformations.camel_to_snake import CamelToSnakeTransformation\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\n# Step 2: Set up a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Step 3: Configure your Context\ncontext = Context({\n    \"source\": DummyReader(),\n    \"transformations\": [CamelToSnakeTransformation()],\n    \"target\": DummyWriter(),\n    \"my_favorite_movie\": \"inception\",\n})\n\n# Step 4: Create a Task\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: DataFrame = None) -> DataFrame:\n        df = df.withColumn(\"MyFavoriteMovie\", lit(self.my_favorite_movie))\n        return super().transform(df)\n\n# Step 5: Run your Task\ntask = MyFavoriteMovieTask(**context)\ntask.run()\n
"},{"location":"tutorials/getting-started.html#contributing","title":"Contributing","text":"

If you want to contribute to Koheesio, check out the CONTRIBUTING.md file in this repository. It contains guidelines for contributing, including how to submit issues and pull requests.

"},{"location":"tutorials/getting-started.html#testing","title":"Testing","text":"

To run the tests for Koheesio, use the following command:

make dev-test\n

This will run all the tests in the tests directory.

"},{"location":"tutorials/hello-world.html","title":"Simple Examples","text":""},{"location":"tutorials/hello-world.html#creating-a-custom-step","title":"Creating a Custom Step","text":"

This example demonstrates how to use the SparkStep class from the koheesio library to create a custom step named HelloWorldStep.

"},{"location":"tutorials/hello-world.html#code","title":"Code","text":"
from koheesio.steps.step import SparkStep\n\nclass HelloWorldStep(SparkStep):\n    message: str\n\n    def execute(self) -> SparkStep.Output:\n        # create a DataFrame with a single row containing the message\n        self.output.df = self.spark.createDataFrame([(1, self.message)], [\"id\", \"message\"])\n
"},{"location":"tutorials/hello-world.html#usage","title":"Usage","text":"
hello_world_step = HelloWorldStep(message=\"Hello, World!\")\nhello_world_step.execute()\n\nhello_world_step.output.df.show()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code","title":"Understanding the Code","text":"

The HelloWorldStep class is a SparkStep in Koheesio, designed to generate a DataFrame with a single row containing a custom message. Here's a more detailed overview:

  • HelloWorldStep inherits from SparkStep, a fundamental building block in Koheesio for creating data processing steps with Apache Spark.
  • It has a message attribute. When creating an instance of HelloWorldStep, you can pass a custom message that will be used in the DataFrame.
  • SparkStep has a spark attribute, which is the active SparkSession. This is the entry point for any Spark functionality, allowing the step to interact with the Spark cluster.
  • SparkStep also includes an Output class, used to store the output of the step. In this case, Output has a df attribute to store the output DataFrame.
  • The execute method creates a DataFrame with the custom message and stores it in output.df. It doesn't return a value explicitly; instead, the output DataFrame can be accessed via output.df.
  • Koheesio uses pydantic for automatic validation of the step's input and output, ensuring they are correctly defined and of the correct types.

Note: Pydantic is a data validation library that provides a way to validate that the data (in this case, the input and output of the step) conforms to the expected format.

"},{"location":"tutorials/hello-world.html#creating-a-custom-task","title":"Creating a Custom Task","text":"

This example demonstrates how to use the EtlTask from the koheesio library to create a custom task named MyFavoriteMovieTask.

"},{"location":"tutorials/hello-world.html#code_1","title":"Code","text":"
from typing import Any\nfrom pyspark.sql import DataFrame, functions as f\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.tasks.etl_task import EtlTask\n\n\ndef add_column(df: DataFrame, target_column: str, value: Any):\n    return df.withColumn(target_column, f.lit(value))\n\n\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n        df = df or self.extract()\n\n        # pre-transformations specific to this class\n        pre_transformations = [\n            Transform(add_column, target_column=\"myFavoriteMovie\", value=self.my_favorite_movie)\n        ]\n\n        # execute transformations one by one\n        for t in pre_transformations:\n            df = t.transform(df)\n\n        self.output.transform_df = df\n        return df\n
"},{"location":"tutorials/hello-world.html#configuration","title":"Configuration","text":"

Here is the sample.yaml configuration file used in this example:

raw_layer:\n  catalog: development\n  schema: my_favorite_team\n  table: some_random_table\nmovies:\n  favorite: Office Space\nhash_settings:\n  source_columns:\n  - id\n  - foo\n  target_column: hash_uuid5\nsource:\n  range: 4\n
"},{"location":"tutorials/hello-world.html#usage_1","title":"Usage","text":"
from pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\n\ncontext = Context.from_yaml(\"sample.yaml\")\n\nSparkSession.builder.getOrCreate()\n\nmy_fav_mov_task = MyFavoriteMovieTask(\n    source=DummyReader(**context.raw_layer),\n    target=DummyWriter(truncate=False),\n    my_favorite_movie=context.movies.favorite,\n)\nmy_fav_mov_task.execute()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code_1","title":"Understanding the Code","text":"

This example creates a MyFavoriteMovieTask that adds a column named myFavoriteMovie to the DataFrame. The value for this column is provided when the task is instantiated.

The MyFavoriteMovieTask class is a custom task that extends the EtlTask from the koheesio library. It demonstrates how to add a custom transformation to a DataFrame. Here's a detailed breakdown:

  • MyFavoriteMovieTask inherits from EtlTask, a base class in Koheesio for creating Extract-Transform-Load (ETL) tasks with Apache Spark.

  • It has a my_favorite_movie attribute. When creating an instance of MyFavoriteMovieTask, you can pass a custom movie title that will be used in the DataFrame.

  • The transform method is where the main logic of the task is implemented. It first extracts the data (if not already provided), then applies a series of transformations to the DataFrame.

  • In this case, the transformation is adding a new column to the DataFrame named myFavoriteMovie, with the value set to the my_favorite_movie attribute. This is done using the add_column function and the Transform class from Koheesio.

  • The transformed DataFrame is then stored in self.output.transform_df.

  • The sample.yaml configuration file is used to provide the context for the task, including the source data and the favorite movie title.

  • In the usage example, an instance of MyFavoriteMovieTask is created with a DummyReader as the source, a DummyWriter as the target, and the favorite movie title from the context. The task is then executed, which runs the transformations and stores the result in self.output.transform_df.

"},{"location":"tutorials/learn-koheesio.html","title":"Learn Koheesio","text":"

Koheesio is designed to simplify the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.

"},{"location":"tutorials/learn-koheesio.html#core-concepts","title":"Core Concepts","text":"

Koheesio is built around several core concepts:

  • Step: The fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.

    See the Step documentation for more information.

  • Context: A configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.

    See the Context documentation for more information.

  • Logger: A class for logging messages at different levels.

    See the Logger documentation for more information.

The Logger and Context classes provide support, enabling detailed logging of the pipeline's execution and customization of the pipeline's behavior based on the environment, respectively.

"},{"location":"tutorials/learn-koheesio.html#implementations","title":"Implementations","text":"

In the context of Koheesio, an implementation refers to a specific way of executing Steps, the fundamental units of work in Koheesio. Each implementation uses a different technology or approach to process data along with its own set of Steps, designed to work with the specific technology or approach used by the implementation.

For example, the Spark implementation includes Steps for reading data from a Spark DataFrame, transforming the data using Spark operations, and writing the data to a Spark-supported destination.

Currently, Koheesio supports two implementations: Spark, and AsyncIO.

"},{"location":"tutorials/learn-koheesio.html#spark","title":"Spark","text":"

Requires: Apache Spark (pyspark) Installation: pip install koheesio[spark] Module: koheesio.spark

This implementation uses Apache Spark, a powerful open-source unified analytics engine for large-scale data processing.

Steps that use this implementation can leverage Spark's capabilities for distributed data processing, making it suitable for handling large volumes of data. The Spark implementation includes the following types of Steps:

  • Reader: from koheesio.spark.readers import Reader A type of Step that reads data from a source and stores the result (to make it available for subsequent steps). For more information, see the Reader documentation.

  • Writer: from koheesio.spark.writers import Writer This controls how data is written to the output in both batch and streaming contexts. For more information, see the Writer documentation.

  • Transformation: from koheesio.spark.transformations import Transformation A type of Step that takes a DataFrame as input and returns a DataFrame as output. For more information, see the Transformation documentation.

In any given pipeline, you can expect to use Readers, Writers, and Transformations to express the ETL logic. Readers are responsible for extracting data from various sources, such as databases, files, or APIs. Transformations then process this data, performing operations like filtering, aggregation, or conversion. Finally, Writers handle the loading of the transformed data to the desired destination, which could be a database, a file, or a data stream.

"},{"location":"tutorials/learn-koheesio.html#async","title":"Async","text":"

Module: koheesio.asyncio

This implementation uses Python's asyncio library for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives. Steps that use this implementation can perform data processing tasks asynchronously, which can be beneficial for IO-bound tasks.

"},{"location":"tutorials/learn-koheesio.html#best-practices","title":"Best Practices","text":"

Here are some best practices for using Koheesio:

  1. Use Context: The Context class in Koheesio is designed to behave like a dictionary, but with added features. It's a good practice to use Context to customize the behavior of a task. This allows you to share variables across tasks and adapt the behavior of a task based on its environment; for example, by changing the source or target of the data between development and production environments.

  2. Modular Design: Each step in the pipeline (reading, transformation, writing) should be encapsulated in its own class, making the code easier to understand and maintain. This also promotes re-usability as steps can be reused across different tasks.

  3. Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks. Make sure to leverage this feature to make your pipelines robust and fault-tolerant.

  4. Logging: Use the built-in logging feature in Koheesio to log information and errors in data processing tasks. This can be very helpful for debugging and monitoring the pipeline. Koheesio sets the log level to WARNING by default, but you can change it to INFO or DEBUG as needed.

  5. Testing: Each step can be tested independently, making it easier to write unit tests. It's a good practice to write tests for your steps to ensure they are working as expected.

  6. Use Transformations: The Transform class in Koheesio allows you to define transformations on your data. It's a good practice to encapsulate your transformation logic in Transform classes for better readability and maintainability.

  7. Consistent Structure: Koheesio enforces a consistent structure for data processing tasks. Stick to this structure to make your codebase easier to understand for new developers.

  8. Use Readers and Writers: Use the built-in Reader and Writer classes in Koheesio to handle data extraction and loading. This not only simplifies your code but also makes it more robust and efficient.

Remember, these are general best practices and might need to be adapted based on your specific use case and requirements.

"},{"location":"tutorials/learn-koheesio.html#pydantic","title":"Pydantic","text":"

Koheesio Steps are Pydantic models, which means they can be validated and serialized. This makes it easy to define the inputs and outputs of a Step, and to validate them before running the Step. Pydantic models also provide a consistent way to define the schema of the data that a Step expects and produces, making it easier to understand and maintain the code.

Learn more about Pydantic here.

"},{"location":"tutorials/onboarding.html","title":"Onboarding","text":"

tags: - doctype/how-to

"},{"location":"tutorials/onboarding.html#onboarding-to-koheesio","title":"Onboarding to Koheesio","text":"

Koheesio is a Python library that simplifies the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.

This guide will walk you through the process of transforming a traditional Spark application into a Koheesio pipeline along with explaining the advantages of using Koheesio over raw Spark.

"},{"location":"tutorials/onboarding.html#traditional-spark-application","title":"Traditional Spark Application","text":"

First let's create a simple Spark application that you might use to process data.

The following Spark application reads a CSV file, performs a transformation, and writes the result to a Delta table. The transformation includes filtering data where age is greater than 18 and performing an aggregation to calculate the average salary per country. The result is then written to a Delta table partitioned by country.

from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, avg\n\nspark = SparkSession.builder.getOrCreate()\n\n# Read data from CSV file\ndf = spark.read.csv(\"input.csv\", header=True, inferSchema=True)\n\n# Filter data where age is greater than 18\ndf = df.filter(col(\"age\") > 18)\n\n# Perform aggregation\ndf = df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n\n# Write data to Delta table with partitioning\ndf.write.format(\"delta\").partitionBy(\"country\").save(\"/path/to/delta_table\")\n
"},{"location":"tutorials/onboarding.html#transforming-to-koheesio","title":"Transforming to Koheesio","text":"

The same pipeline can be rewritten using Koheesio's EtlTask. In this version, each step (reading, transformations, writing) is encapsulated in its own class, making the code easier to understand and maintain.

First, a CsvReader is defined to read the input CSV file. Then, a DeltaTableWriter is defined to write the result to a Delta table partitioned by country.

Two transformations are defined: 1. one to filter data where age is greater than 18 2. and, another to calculate the average salary per country.

These transformations are then passed to an EtlTask along with the reader and writer. Finally, the EtlTask is executed to run the pipeline.

from koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta.batch import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\nfrom pyspark.sql.functions import col, avg\n\n# Define reader\nreader = CsvReader(path=\"input.csv\", header=True, inferSchema=True)\n\n# Define writer\nwriter = DeltaTableWriter(table=\"delta_table\", partition_by=[\"country\"])\n\n# Define transformations\nage_transformation = Transform(\n    func=lambda df: df.filter(col(\"age\") > 18)\n)\navg_salary_per_country = Transform(\n    func=lambda df: df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n)\n\n# Define and execute EtlTask\ntask = EtlTask(\n    source=reader, \n    target=writer, \n    transformations=[\n        age_transformation,\n        avg_salary_per_country\n    ]\n)\ntask.execute()\n
This approach with Koheesio provides several advantages. It makes the code more modular and easier to test. Each step can be tested independently and reused across different tasks. It also makes the pipeline more readable and easier to maintain.

"},{"location":"tutorials/onboarding.html#advantages-of-koheesio","title":"Advantages of Koheesio","text":"

Using Koheesio instead of raw Spark has several advantages:

  • Modularity: Each step in the pipeline (reading, transformation, writing) is encapsulated in its own class, making the code easier to understand and maintain.
  • Reusability: Steps can be reused across different tasks, reducing code duplication.
  • Testability: Each step can be tested independently, making it easier to write unit tests.
  • Flexibility: The behavior of a task can be customized using a Context class.
  • Consistency: Koheesio enforces a consistent structure for data processing tasks, making it easier for new developers to understand the codebase.
  • Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks.
  • Logging: Koheesio provides a consistent way to log information and errors in data processing tasks.

In contrast, using the plain PySpark API for transformations can lead to more verbose and less structured code, which can be harder to understand, maintain, and test. It also doesn't provide the same level of error handling, logging, and flexibility as the Koheesio Transform class.

"},{"location":"tutorials/onboarding.html#using-a-context-class","title":"Using a Context Class","text":"

Here's a simple example of how to use a Context class to customize the behavior of a task. The Context class in Koheesio is designed to behave like a dictionary, but with added features.

from koheesio import Context\nfrom koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\n\ncontext = Context({  # this could be stored in a JSON or YAML\n    \"age_threshold\": 18,\n    \"reader_options\": {\n        \"path\": \"input.csv\",\n        \"header\": True,\n        \"inferSchema\": True\n    },\n    \"writer_options\": {\n        \"table\": \"delta_table\",\n        \"partition_by\": [\"country\"]\n    }\n})\n\ntask = EtlTask(\n    source = CsvReader(**context.reader_options),\n    target = DeltaTableWriter(**context.writer_options),\n    transformations = [\n        Transform(func=lambda df: df.filter(df[\"age\"] > context.age_threshold))\n    ]\n)\n\ntask.execute()\n

In this example, we're using CsvReader to read the input data, DeltaTableWriter to write the output data, and a Transform step to filter the data based on the age threshold. The options for the reader and writer are stored in a Context object, which can be easily updated or loaded from a JSON or YAML file.

"},{"location":"tutorials/testing-koheesio-steps.html","title":"Testing Koheesio Tasks","text":"

Testing is a crucial part of any software development process. Koheesio provides a structured way to define and execute data processing tasks, which makes it easier to build, test, and maintain complex data workflows. This guide will walk you through the process of testing Koheesio tasks.

"},{"location":"tutorials/testing-koheesio-steps.html#unit-testing","title":"Unit Testing","text":"

Unit testing involves testing individual components of the software in isolation. In the context of Koheesio, this means testing individual tasks or steps.

Here's an example of how to unit test a Koheesio task:

from koheesio.tasks.etl_task import EtlTask\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.steps.transformations import Transform\nfrom pyspark.sql import SparkSession, DataFrame\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df: DataFrame) -> DataFrame:\n    return df.filter(col(\"Age\") > 18)\n\n\ndef test_etl_task():\n    # Initialize SparkSession\n    spark = SparkSession.builder.getOrCreate()\n\n    # Create a DataFrame for testing\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    df = spark.createDataFrame(data, [\"Name\", \"Age\"])\n\n    # Define the task\n    task = EtlTask(\n        source=DummyReader(df=df),\n        target=DummyWriter(),\n        transformations=[\n            Transform(filter_age)\n        ]\n    )\n\n    # Execute the task\n    task.execute()\n\n    # Assert the result\n    result_df = task.output.df\n    assert result_df.count() == 2\n    assert result_df.filter(\"Name == 'Tom'\").count() == 0\n

In this example, we're testing an EtlTask that reads data from a DataFrame, applies a filter transformation, and writes the result to another DataFrame. The test asserts that the task correctly filters out rows where the age is less than or equal to 18.

"},{"location":"tutorials/testing-koheesio-steps.html#integration-testing","title":"Integration Testing","text":"

Integration testing involves testing the interactions between different components of the software. In the context of Koheesio, this means testing the entirety of data flowing through one or more tasks.

We'll create a simple test for a hypothetical EtlTask that uses DeltaReader and DeltaWriter. We'll use pytest and unittest.mock to mock the responses of the reader and writer. First, let's assume that you have an EtlTask defined in a module named my_module. This task reads data from a Delta table, applies some transformations, and writes the result to another Delta table.

Here's an example of how to write an integration test for this task:

# my_module.py\nfrom koheesio.tasks.etl_task import EtlTask\nfrom koheesio.spark.readers.delta import DeltaReader\nfrom koheesio.steps.writers.delta import DeltaWriter\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.context import Context\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df):\n    return df.filter(col(\"Age\") > 18)\n\n\ncontext = Context({\n    \"reader_options\": {\n        \"table\": \"input_table\"\n    },\n    \"writer_options\": {\n        \"table\": \"output_table\"\n    }\n})\n\ntask = EtlTask(\n    source=DeltaReader(**context.reader_options),\n    target=DeltaWriter(**context.writer_options),\n    transformations=[\n        Transform(filter_age)\n    ]\n)\n

Now, let's create a test for this task. We'll use pytest and unittest.mock to mock the responses of the reader and writer. We'll also use a pytest fixture to create a test context and a test DataFrame.

# test_my_module.py\nimport pytest\nfrom unittest.mock import MagicMock, patch\nfrom pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import Reader\nfrom koheesio.steps.writers import Writer\n\nfrom my_module import task\n\n@pytest.fixture(scope=\"module\")\ndef spark():\n    return SparkSession.builder.getOrCreate()\n\n@pytest.fixture(scope=\"module\")\ndef test_context():\n    return Context({\n        \"reader_options\": {\n            \"table\": \"test_input_table\"\n        },\n        \"writer_options\": {\n            \"table\": \"test_output_table\"\n        }\n    })\n\n@pytest.fixture(scope=\"module\")\ndef test_df(spark):\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    return spark.createDataFrame(data, [\"Name\", \"Age\"])\n\ndef test_etl_task(spark, test_context, test_df):\n    # Mock the read method of the Reader class\n    with patch.object(Reader, \"read\", return_value=test_df):\n        # Mock the write method of the Writer class\n        with patch.object(Writer, \"write\") as mock_write:\n            # Execute the task\n            task.execute()\n\n            # Assert the result\n            result_df = task.output.df\n            assert result_df.count() == 2\n            assert result_df.filter(\"Name == 'Tom'\").count() == 0\n\n            # Assert that the reader and writer were called with the correct arguments\n            Reader.read.assert_called_once_with(**test_context.reader_options)\n            mock_write.assert_called_once_with(**test_context.writer_options)\n

In this test, we're mocking the DeltaReader and DeltaWriter to return a test DataFrame and check that they're called with the correct arguments. We're also asserting that the task correctly filters out rows where the age is less than or equal to 18.

"},{"location":"misc/tags.html","title":"{{ page.title }}","text":""},{"location":"misc/tags.html#doctypeexplanation","title":"doctype/explanation","text":"
  • Approach documentation
"},{"location":"misc/tags.html#doctypehow-to","title":"doctype/how-to","text":"
  • How to
"}]} \ No newline at end of file