From 0dcfac777f632071df65687a2e7504bb50d21053 Mon Sep 17 00:00:00 2001
From: Danny Meijer <10511979+dannymeijer@users.noreply.github.com>
Date: Thu, 30 May 2024 19:10:56 +0100
Subject: [PATCH] Deployed e6bc5b5 to dev with MkDocs 1.6.0 and mike 2.1.1

---
 dev/index.html               | 2 +-
 dev/search/search_index.json | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/dev/index.html b/dev/index.html
index 80d5d06d..fdc5c93d 100644
--- a/dev/index.html
+++ b/dev/index.html
@@ -3744,7 +3744,7 @@ <h3 id="koheesio-core-components">Koheesio Core Components<a class="headerlink"
   taking in inputs and producing outputs.
     <div class="highlight"><pre><span></span><code><a id="__codelineno-0-1" name="__codelineno-0-1" href="#__codelineno-0-1"></a>┌─────────┐        ┌──────────────────┐        ┌──────────┐
 <a id="__codelineno-0-2" name="__codelineno-0-2" href="#__codelineno-0-2"></a>│ Input 1 │───────▶│                  ├───────▶│ Output 1 │
-<a id="__codelineno-0-3" name="__codelineno-0-3" href="#__codelineno-0-3"></a>└─────────┘        │                  │        └────√─────┘
+<a id="__codelineno-0-3" name="__codelineno-0-3" href="#__codelineno-0-3"></a>└─────────┘        │                  │        └──────────┘
 <a id="__codelineno-0-4" name="__codelineno-0-4" href="#__codelineno-0-4"></a>                   │                  │
 <a id="__codelineno-0-5" name="__codelineno-0-5" href="#__codelineno-0-5"></a>┌─────────┐        │                  │        ┌──────────┐
 <a id="__codelineno-0-6" name="__codelineno-0-6" href="#__codelineno-0-6"></a>│ Input 2 │───────▶│       Step       │───────▶│ Output 2 │
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index c4cd0135..007d7646 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":""},{"location":"index.html#koheesio","title":"Koheesio","text":"CI/CD Package Meta <p>Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines.  It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components. </p> <p>The framework is versatile, aiming to support multiple implementations and working seamlessly with various data  processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.</p> <p>Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type  safety and structured configurations within pipeline components.</p> <p>Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.</p>"},{"location":"index.html#what-sets-koheesio-apart-from-other-libraries","title":"What sets Koheesio apart from other libraries?\"","text":"<p>Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While  similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks  like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.</p> <p>Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus  on utility in a multitude of scenarios. Our preference is for integration, not competition...</p> <p>We invite contributions from all, promoting collaboration and innovation in the data engineering community.</p>"},{"location":"index.html#koheesio-core-components","title":"Koheesio Core Components","text":"<p>Here are the key components included in Koheesio:</p> <ul> <li>Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline,   taking in inputs and producing outputs.     <pre><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u221a\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre></li> <li>Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.</li> <li>Logger: This is a class for logging messages at different levels.</li> </ul>"},{"location":"index.html#installation","title":"Installation","text":"<p>You can install Koheesio using either pip or poetry.</p>"},{"location":"index.html#using-pip","title":"Using Pip","text":"<p>To install Koheesio using pip, run the following command in your terminal:</p> <pre><code>pip install koheesio\n</code></pre>"},{"location":"index.html#using-hatch","title":"Using Hatch","text":"<p>If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your <code>pyproject.toml</code>.</p>"},{"location":"index.html#using-poetry","title":"Using Poetry","text":"<p>If you're using poetry for package management, you can add Koheesio to your project with the following command:</p> <pre><code>poetry add koheesio\n</code></pre> <p>or add the following line to your <code>pyproject.toml</code> (under <code>[tool.poetry.dependencies]</code>), making sure to replace <code>...</code> with the version you want to have installed:</p> <pre><code>koheesio = {version = \"...\"}\n</code></pre>"},{"location":"index.html#extras","title":"Extras","text":"<p>Koheesio also provides some additional features that can be useful in certain scenarios. These include:</p> <ul> <li> <p>Spark Expectations:  Available through the <code>koheesio.steps.integration.spark.dq.spark_expectations</code> module; installable through the <code>se</code> extra.</p> <ul> <li>SE Provides Data Quality checks for Spark DataFrames.</li> <li>For more information, refer to the Spark Expectations docs.</li> </ul> </li> <li> <p>Box: Available through the <code>koheesio.steps.integration.box</code> module; installable through the <code>box</code> extra.</p> <ul> <li>Box is a cloud content management and file sharing service for businesses.</li> </ul> </li> <li> <p>SFTP: Available through the <code>koheesio.steps.integration.spark.sftp</code> module; installable through the <code>sftp</code> extra.</p> <ul> <li>SFTP is a network protocol used for secure file transfer over a secure shell.</li> </ul> </li> </ul> <p>Note: Some of the steps require extra dependencies. See the Extras section for additional info. Extras can be added to Poetry by adding <code>extras=['name_of_the_extra']</code> to the toml entry mentioned above</p>"},{"location":"index.html#contributing","title":"Contributing","text":""},{"location":"index.html#how-to-contribute","title":"How to Contribute","text":"<p>We welcome contributions to our project! Here's a brief overview of our development process:</p> <ul> <li> <p>Code Standards: We use <code>pylint</code>, <code>black</code>, and <code>mypy</code> to maintain code standards. Please ensure your code passes   these checks by running <code>make check</code>. No errors or warnings should be reported by the linter before you submit a pull   request.</p> </li> <li> <p>Testing: We use <code>pytest</code> for testing. Run the tests with <code>make test</code> and ensure all tests pass before submitting   a pull request.</p> </li> <li> <p>Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with   admin rights will create a new release on GitHub and publish the new version to PyPI.</p> </li> </ul> <p>For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.</p>"},{"location":"index.html#additional-resources","title":"Additional Resources","text":"<ul> <li>General GitHub documentation</li> <li>GitHub pull request documentation</li> <li>Nike OSS</li> </ul>"},{"location":"api_reference/index.html","title":"API Reference","text":""},{"location":"api_reference/index.html#koheesio.ABOUT","title":"koheesio.ABOUT  <code>module-attribute</code>","text":"<pre><code>ABOUT = _about()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.VERSION","title":"koheesio.VERSION  <code>module-attribute</code>","text":"<pre><code>VERSION = __version__\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel","title":"koheesio.BaseModel","text":"<p>Base model for all models.</p> <p>Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.</p> Additional methods and properties: Different Modes <p>This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.</p> <ul> <li> <p>Normal mode:     you need to know the values ahead of time     <pre><code>normal_mode = YourOwnModel(a=\"foo\", b=42)\n</code></pre></p> </li> <li> <p>Lazy mode:     being able to defer the validation until later     <pre><code>lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n</code></pre>     The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add     them as they become available. All while still being able to validate that you have collected all your output     at the end.</p> </li> <li> <p>With statements:     With statements are also allowed. The <code>validate_output</code> method from the earlier example will run upon exit of     the with-statement.     <pre><code>with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n</code></pre>     Note: that a lazy mode BaseModel object is required to work with a with-statement.</p> </li> </ul> <p>Examples:</p> <pre><code>from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n</code></pre> <p>In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The <code>validate_output</code> method is then called to validate the instance.</p> Koheesio specific configuration: <p>Koheesio models are configured differently from Pydantic defaults. The following configuration is used:</p> <ol> <li> <p>extra=\"allow\"</p> <p>This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.</p> </li> <li> <p>arbitrary_types_allowed=True</p> <p>This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.</p> </li> <li> <p>populate_by_name=True</p> <p>This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.</p> </li> <li> <p>validate_assignment=False</p> <p>This setting determines whether the model should be revalidated when the data is changed. If set to <code>True</code>, every time a field is assigned a new value, the entire model is validated again.</p> <p>Pydantic default is (also) <code>False</code>, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.</p> </li> <li> <p>revalidate_instances=\"subclass-instances\"</p> <p>This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is <code>never</code>, which means that the model and dataclass instances are not revalidated during validation.</p> </li> <li> <p>validate_default=True</p> <p>This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.</p> </li> <li> <p>frozen=False</p> <p>This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.</p> </li> <li> <p>coerce_numbers_to_str=True</p> <p>This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any <code>Number</code> type to <code>str</code>. Pydantic doesn't allow number types (<code>int</code>, <code>float</code>, <code>Decimal</code>) to be coerced as type <code>str</code> by default.</p> </li> <li> <p>use_enum_values=True</p> <p>This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.</p> </li> </ol>"},{"location":"api_reference/index.html#koheesio.BaseModel--fields","title":"Fields","text":"<p>Every Koheesio BaseModel has two fields: <code>name</code> and <code>description</code>. These fields are used to provide a name and a description to the model.</p> <ul> <li> <p><code>name</code>: This is the name of the Model. If not provided, it defaults to the class name.</p> </li> <li> <p><code>description</code>: This is the description of the Model. It has several default behaviors:</p> <ul> <li>If not provided, it defaults to the docstring of the class.</li> <li>If the docstring is not provided, it defaults to the name of the class.</li> <li>For multi-line descriptions, it has the following behaviors:<ul> <li>Only the first non-empty line is used.</li> <li>Empty lines are removed.</li> <li>Only the first 3 lines are considered.</li> <li>Only the first 120 characters are considered.</li> </ul> </li> </ul> </li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--validators","title":"Validators","text":"<ul> <li><code>_set_name_and_description</code>: Set the name and description of the Model as per the rules mentioned above.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--properties","title":"Properties","text":"<ul> <li><code>log</code>: Returns a logger with the name of the class.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--class-methods","title":"Class Methods","text":"<ul> <li><code>from_basemodel</code>: Returns a new BaseModel instance based on the data of another BaseModel.</li> <li><code>from_context</code>: Creates BaseModel instance from a given Context.</li> <li><code>from_dict</code>: Creates BaseModel instance from a given dictionary.</li> <li><code>from_json</code>: Creates BaseModel instance from a given JSON string.</li> <li><code>from_toml</code>: Creates BaseModel object from a given toml file.</li> <li><code>from_yaml</code>: Creates BaseModel object from a given yaml file.</li> <li><code>lazy</code>: Constructs the model without doing validation.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--dunder-methods","title":"Dunder Methods","text":"<ul> <li><code>__add__</code>: Allows to add two BaseModel instances together.</li> <li><code>__enter__</code>: Allows for using the model in a with-statement.</li> <li><code>__exit__</code>: Allows for using the model in a with-statement.</li> <li><code>__setitem__</code>: Set Item dunder method for BaseModel.</li> <li><code>__getitem__</code>: Get Item dunder method for BaseModel.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--instance-methods","title":"Instance Methods","text":"<ul> <li><code>hasattr</code>: Check if given key is present in the model.</li> <li><code>get</code>: Get an attribute of the model, but don't fail if not present.</li> <li><code>merge</code>: Merge key,value map with self.</li> <li><code>set</code>: Allows for subscribing / assigning to <code>class[key]</code>.</li> <li><code>to_context</code>: Converts the BaseModel instance to a Context object.</li> <li><code>to_dict</code>: Converts the BaseModel instance to a dictionary.</li> <li><code>to_json</code>: Converts the BaseModel instance to a JSON string.</li> <li><code>to_yaml</code>: Converts the BaseModel instance to a YAML string.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: Optional[str] = Field(default=None, description='Description of the Model')\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.log","title":"log  <code>property</code>","text":"<pre><code>log: Logger\n</code></pre> <p>Returns a logger with the name of the class</p>"},{"location":"api_reference/index.html#koheesio.BaseModel.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.name","title":"name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>name: Optional[str] = Field(default=None, description='Name of the Model')\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_basemodel","title":"from_basemodel  <code>classmethod</code>","text":"<pre><code>from_basemodel(basemodel: BaseModel, **kwargs)\n</code></pre> <p>Returns a new BaseModel instance based on the data of another BaseModel</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_context","title":"from_context  <code>classmethod</code>","text":"<pre><code>from_context(context: Context) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given Context</p> <p>You have to make sure that the Context object has the necessary attributes to create the model.</p> <p>Examples:</p> <pre><code>class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_context(cls, context: Context) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(data: Dict[str, Any]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given dictionary</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.from_json : Deserializes a JSON string to a Context object</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given toml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file, or string containing toml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given yaml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.get","title":"get","text":"<pre><code>get(key: str, default: Optional[Any] = None)\n</code></pre> <p>Get an attribute of the model, but don't fail if not present</p> <p>Similar to dict.get()</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>name of the key to get</p> required <code>default</code> <code>Optional[Any]</code> <p>Default value in case the attribute does not exist</p> <code>None</code> <p>Returns:</p> Type Description <code>Any</code> <p>The value of the attribute</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.hasattr","title":"hasattr","text":"<pre><code>hasattr(key: str) -&gt; bool\n</code></pre> <p>Check if given key is present in the model</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def hasattr(self, key: str) -&gt; bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.lazy","title":"lazy  <code>classmethod</code>","text":"<pre><code>lazy()\n</code></pre> <p>Constructs the model without doing validation</p> <p>Essentially an alias to BaseModel.construct()</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.merge","title":"merge","text":"<pre><code>merge(other: Union[Dict, BaseModel])\n</code></pre> <p>Merge key,value map with self</p> <p>Functionally similar to adding two dicts together; like running <code>{**dict_a, **dict_b}</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[Dict, BaseModel]</code> <p>Dict or another instance of a BaseModel class that will be added to self</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.set","title":"set","text":"<pre><code>set(key: str, value: Any)\n</code></pre> <p>Allows for subscribing / assigning to <code>class[key]</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>The key of the attribute to assign to</p> required <code>value</code> <code>Any</code> <p>Value that should be assigned to the given key</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_context","title":"to_context","text":"<pre><code>to_context() -&gt; Context\n</code></pre> <p>Converts the BaseModel instance to a Context object</p> <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_context(self) -&gt; Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Converts the BaseModel instance to a dictionary</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False)\n</code></pre> <p>Converts the BaseModel instance to a JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.to_json : Serializes a Context object to a JSON string</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Converts the BaseModel instance to a YAML string</p> <p>BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.validate","title":"validate","text":"<pre><code>validate() -&gt; BaseModel\n</code></pre> <p>Validate the BaseModel instance</p> <p>This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.</p> <p>This method is intended to be used with the <code>lazy</code> method. The <code>lazy</code> method is used to create an instance of the BaseModel without immediate validation. The <code>validate</code> method is then used to validate the instance after.</p> <p>Note: in the Pydantic BaseModel, the <code>validate</code> method throws a deprecated warning. This is because Pydantic recommends using the <code>validate_model</code> method instead. However, we are using the <code>validate</code> method here in a different context and a slightly different way.</p> <p>Examples:</p> <p><pre><code>class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n</code></pre> In this example, the <code>foo_model</code> instance is created without immediate validation. The attributes foo and lorem are set afterward. The <code>validate</code> method is then called to validate the instance.</p> <p>Returns:</p> Type Description <code>BaseModel</code> <p>The BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def validate(self) -&gt; BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    &gt; Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context","title":"koheesio.Context","text":"<pre><code>Context(*args, **kwargs)\n</code></pre> <p>The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.</p> Key Features <ul> <li>Nested keys: Supports accessing and adding nested keys similar to dictionary keys.</li> <li>Recursive merging: Merges two Contexts together, with the incoming Context having priority.</li> <li>Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be     converted back to a dictionary.</li> <li>Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects     to and from JSON.</li> </ul> <p>For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.</p> <p>Methods:</p> Name Description <code>add</code> <p>Add a key/value pair to the context.</p> <code>get</code> <p>Get value of a given key.</p> <code>get_item</code> <p>Acts just like <code>.get</code>, except that it returns the key also.</p> <code>contains</code> <p>Check if the context contains a given key.</p> <code>merge</code> <p>Merge this context with the context of another, where the incoming context has priority.</p> <code>to_dict</code> <p>Returns all parameters of the context as a dict.</p> <code>from_dict</code> <p>Creates Context object from the given dict.</p> <code>from_yaml</code> <p>Creates Context object from a given yaml file.</p> <code>from_json</code> <p>Creates Context object from a given json file.</p> Dunder methods <ul> <li>_<code>_iter__()</code>: Allows for iteration across a Context.</li> <li><code>__len__()</code>: Returns the length of the Context.</li> <li><code>__getitem__(item)</code>: Makes class subscriptable.</li> </ul> Inherited from Mapping <ul> <li><code>items()</code>: Returns all items of the Context.</li> <li><code>keys()</code>: Returns all keys of the Context.</li> <li><code>values()</code>: Returns all values of the Context.</li> </ul> Source code in <code>src/koheesio/context.py</code> <pre><code>def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.add","title":"add","text":"<pre><code>add(key: str, value: Any) -&gt; Context\n</code></pre> <p>Add a key/value pair to the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def add(self, key: str, value: Any) -&gt; Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.contains","title":"contains","text":"<pre><code>contains(key: str) -&gt; bool\n</code></pre> <p>Check if the context contains a given key</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/context.py</code> <pre><code>def contains(self, key: str) -&gt; bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(kwargs: dict) -&gt; Context\n</code></pre> <p>Creates Context object from the given dict</p> <p>Parameters:</p> Name Type Description Default <code>kwargs</code> <code>dict</code> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_dict(cls, kwargs: dict) -&gt; Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given json file</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> Security <p>(from https://jsonpickle.github.io/)</p> <p>jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    &gt; jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    &gt; The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_json--warning","title":"! Warning !","text":"<p>The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.</p>"},{"location":"api_reference/index.html#koheesio.Context.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given toml file</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file or string containing toml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; Context\n</code></pre> <p>Creates Context object from a given yaml file</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.get","title":"get","text":"<pre><code>get(key: str, default: Any = None, safe: bool = True) -&gt; Any\n</code></pre> <p>Get value of a given key</p> <p>The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's <code>.get()</code> method otherwise.</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>Can be a real key, or can be a dotted notation of a nested key</p> required <code>default</code> <code>Any</code> <p>Default value to return</p> <code>None</code> <code>safe</code> <code>bool</code> <p>Toggles whether to fail or not when item cannot be found</p> <code>True</code> <p>Returns:</p> Type Description <code>Any</code> <p>Value of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n</code></pre> <p>Returns <code>c</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get(self, key: str, default: Any = None, safe: bool = True) -&gt; Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.get_all","title":"get_all","text":"<pre><code>get_all() -&gt; dict\n</code></pre> <p>alias to to_dict()</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_all(self) -&gt; dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.get_item","title":"get_item","text":"<pre><code>get_item(key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]\n</code></pre> <p>Acts just like <code>.get</code>, except that it returns the key also</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>key/value-pair of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n</code></pre> <p>Returns <code>{'a.b': 'c'}</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_item(self, key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.merge","title":"merge","text":"<pre><code>merge(context: Context, recursive: bool = False) -&gt; Context\n</code></pre> <p>Merge this context with the context of another, where the incoming context has priority.</p> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> <p>Another Context class</p> required <code>recursive</code> <code>bool</code> <p>Recursively merge two dictionaries to an arbitrary depth</p> <code>False</code> <p>Returns:</p> Type Description <code>Context</code> <p>updated context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def merge(self, context: Context, recursive: bool = False) -&gt; Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.process_value","title":"process_value","text":"<pre><code>process_value(value: Any) -&gt; Any\n</code></pre> <p>Processes the given value, converting dictionaries to Context objects as needed.</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def process_value(self, value: Any) -&gt; Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Returns all parameters of the context as a dict</p> <p>Returns:</p> Type Description <code>dict</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a json string</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_json(self, pretty: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a yaml string</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n</code></pre>"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin","title":"koheesio.ExtraParamsMixin","text":"<p>Mixin class that adds support for arbitrary keyword arguments to Pydantic models.</p> <p>The keyword arguments are extracted from the model's <code>values</code> and moved to a <code>params</code> dictionary.</p>"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.extra_params","title":"extra_params  <code>cached</code> <code>property</code>","text":"<pre><code>extra_params: Dict[str, Any]\n</code></pre> <p>Extract params (passed as arbitrary kwargs) from values and move them to params dict</p>"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = Field(default_factory=dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory","title":"koheesio.LoggingFactory","text":"<pre><code>LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n</code></pre> <p>Logging factory to be used to generate logger instances.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>Optional[str]</code> <code>None</code> <code>env</code> <code>Optional[str]</code> <code>None</code> <code>logger_id</code> <code>Optional[str]</code> <code>None</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CONSOLE_HANDLER: Optional[Handler] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.ENV","title":"ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ENV: Optional[str] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER","title":"LOGGER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER: Optional[Logger] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_ENV: str = 'local'\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FILTER: Optional[Filter] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_NAME: str = 'koheesio'\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.add_handlers","title":"add_handlers  <code>staticmethod</code>","text":"<pre><code>add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None\n</code></pre> <p>Add handlers to existing root logger.</p> <p>Parameters:</p> Name Type Description Default <code>handler_class</code> required <code>handlers_config</code> required Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.get_logger","title":"get_logger  <code>staticmethod</code>","text":"<pre><code>get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger\n</code></pre> <p>Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> required <code>inherit_from_koheesio</code> <code>bool</code> <code>False</code> <p>Returns:</p> Name Type Description <code>logger</code> <code>Logger</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step","title":"koheesio.Step","text":"<p>Base class for a step</p> <p>A custom unit of logic that can be executed.</p> <p>The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the <code>def execute(self)</code> method, specifying the expected inputs and outputs.</p> <p>Note: since the Step class is meta classed, the execute method is wrapped with the <code>do_execute</code> function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.</p> Methods and Attributes <p>The Step class has several attributes and methods.</p> Background <p>A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.</p> <p>A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!</p> <p>The diagram serves to illustrate the concept of a Step:</p> <pre><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre> <p>Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.</p> <ul> <li>Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to   automatically validate data against the defined fields and their types.</li> <li>Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the <code>execute</code> method of the Step   class with the <code>_execute_wrapper</code> function. This ensures that the <code>execute</code> method always returns the output of   the Step along with providing logging and validation of the output.</li> <li>Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output   of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be   accessed through the <code>Step.Output</code> attribute.</li> <li>The <code>Output</code> class can be extended to add additional fields to the output of the Step.</li> </ul> <p>Examples:</p> <pre><code>class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -&gt; MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step--input","title":"INPUT","text":"<p>The following fields are available by default on the Step class: - <code>name</code>: Name of the Step. If not set, the name of the class will be used. - <code>description</code>: Description of the Step. If not set, the docstring of the class will be used. If the docstring     contains multiple lines, only the first line will be used.</p> <p>When subclassing a Step, any additional pydantic field will be treated as <code>input</code> to the Step. See also the explanation on the <code>.execute()</code> method below.</p>"},{"location":"api_reference/index.html#koheesio.Step--output","title":"OUTPUT","text":"<p>Every Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be accessed through the <code>Step.Output</code> attribute. The <code>Output</code> class can be extended to add additional fields to the output of the Step. See also the explanation on the <code>.execute()</code>.</p> <ul> <li><code>Output</code>: A nested class representing the output of the Step used to validate the output of the     Step and based on the StepOutput class.</li> <li><code>output</code>: Allows you to interact with the Output of the Step lazily (see above and StepOutput)</li> </ul> <p>When subclassing a Step, any additional pydantic field added to the nested <code>Output</code> class will be treated as <code>output</code> of the Step. See also the description of <code>StepOutput</code> for more information.</p>"},{"location":"api_reference/index.html#koheesio.Step--methods","title":"Methods:","text":"<ul> <li><code>execute</code>: Abstract method to implement for new steps.<ul> <li>The Inputs of the step can be accessed, using <code>self.input_name</code>.</li> <li>The output of the step can be accessed, using <code>self.output.output_name</code>.</li> </ul> </li> <li><code>run</code>: Alias to .execute() method. You can use this to run the step, but execute is preferred.</li> <li><code>to_yaml</code>: YAML dump the step</li> <li><code>get_description</code>: Get the description of the Step</li> </ul> <p>When subclassing a Step, <code>execute</code> is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.</p> <p>Note: since the Step class is meta-classed, the execute method is automatically wrapped with the <code>do_execute</code> function making it always return a StepOutput. See also the explanation on the <code>do_execute</code> function.</p>"},{"location":"api_reference/index.html#koheesio.Step--class-methods","title":"class methods:","text":"<ul> <li><code>from_step</code>: Returns a new Step instance based on the data of another Step instance.     for example: <code>MyStep.from_step(other_step, a=\"foo\")</code></li> <li><code>get_description</code>: Get the description of the Step</li> </ul>"},{"location":"api_reference/index.html#koheesio.Step--dunder-methods","title":"dunder methods:","text":"<ul> <li><code>__getattr__</code>: Allows input to be accessed through <code>self.input_name</code></li> <li><code>__repr__</code> and <code>__str__</code>: String representation of a step</li> </ul>"},{"location":"api_reference/index.html#koheesio.Step.output","title":"output  <code>property</code> <code>writable</code>","text":"<pre><code>output: Output\n</code></pre> <p>Interact with the output of the Step</p>"},{"location":"api_reference/index.html#koheesio.Step.Output","title":"Output","text":"<p>Output class for Step</p>"},{"location":"api_reference/index.html#koheesio.Step.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Abstract method to implement for new steps.</p> <p>The Inputs of the step can be accessed, using <code>self.input_name</code></p> <p>Note: since the Step class is meta-classed, the execute method is wrapped with the <code>do_execute</code> function making   it always return the Steps output</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.from_step","title":"from_step  <code>classmethod</code>","text":"<pre><code>from_step(step: Step, **kwargs)\n</code></pre> <p>Returns a new Step instance based on the data of another Step or BaseModel instance</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.repr_json","title":"repr_json","text":"<pre><code>repr_json(simple=False) -&gt; str\n</code></pre> <p>dump the step to json, meant for representation</p> <p>Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid json</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_json(self, simple=False) -&gt; str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -&gt; Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.repr_yaml","title":"repr_yaml","text":"<pre><code>repr_yaml(simple=False) -&gt; str\n</code></pre> <p>dump the step to yaml, meant for representation</p> <p>Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_yaml())\ninput:\n  a: foo\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid yaml</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_yaml(self, simple=False) -&gt; str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>Alias to .execute()</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.StepOutput","title":"koheesio.StepOutput","text":"<p>Class for the StepOutput model</p> Usage <p>Setting up the StepOutputs class is done like this: <pre><code>class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n</code></pre></p>"},{"location":"api_reference/index.html#koheesio.StepOutput.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(validate_default=False, defer_build=True)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.StepOutput.validate_output","title":"validate_output","text":"<pre><code>validate_output() -&gt; StepOutput\n</code></pre> <p>Validate the output of the Step</p> <p>Essentially, this method is a wrapper around the validate method of the BaseModel class</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def validate_output(self) -&gt; StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.print_logo","title":"koheesio.print_logo","text":"<pre><code>print_logo()\n</code></pre> Source code in <code>src/koheesio/__init__.py</code> <pre><code>def print_logo():\n    global _logo_printed\n    global _koheesio_print_logo\n\n    if not _logo_printed and _koheesio_print_logo:\n        print(ABOUT)\n        _logo_printed = True\n</code></pre>"},{"location":"api_reference/context.html","title":"Context","text":"<p>The Context module is a part of the Koheesio framework and is primarily used for managing the environment configuration where a Task or Step runs. It helps in adapting the behavior of a Task/Step based on the environment it operates in, thereby avoiding the repetition of configuration values across different tasks.</p> <p>The Context class, which is a key component of this module, functions similarly to a dictionary but with additional features. It supports operations like handling nested keys, recursive merging of contexts, and serialization/deserialization to and from various formats like JSON, YAML, and TOML.</p> <p>For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.</p>"},{"location":"api_reference/context.html#koheesio.context.Context","title":"koheesio.context.Context","text":"<pre><code>Context(*args, **kwargs)\n</code></pre> <p>The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.</p> Key Features <ul> <li>Nested keys: Supports accessing and adding nested keys similar to dictionary keys.</li> <li>Recursive merging: Merges two Contexts together, with the incoming Context having priority.</li> <li>Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be     converted back to a dictionary.</li> <li>Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects     to and from JSON.</li> </ul> <p>For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.</p> <p>Methods:</p> Name Description <code>add</code> <p>Add a key/value pair to the context.</p> <code>get</code> <p>Get value of a given key.</p> <code>get_item</code> <p>Acts just like <code>.get</code>, except that it returns the key also.</p> <code>contains</code> <p>Check if the context contains a given key.</p> <code>merge</code> <p>Merge this context with the context of another, where the incoming context has priority.</p> <code>to_dict</code> <p>Returns all parameters of the context as a dict.</p> <code>from_dict</code> <p>Creates Context object from the given dict.</p> <code>from_yaml</code> <p>Creates Context object from a given yaml file.</p> <code>from_json</code> <p>Creates Context object from a given json file.</p> Dunder methods <ul> <li>_<code>_iter__()</code>: Allows for iteration across a Context.</li> <li><code>__len__()</code>: Returns the length of the Context.</li> <li><code>__getitem__(item)</code>: Makes class subscriptable.</li> </ul> Inherited from Mapping <ul> <li><code>items()</code>: Returns all items of the Context.</li> <li><code>keys()</code>: Returns all keys of the Context.</li> <li><code>values()</code>: Returns all values of the Context.</li> </ul> Source code in <code>src/koheesio/context.py</code> <pre><code>def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.add","title":"add","text":"<pre><code>add(key: str, value: Any) -&gt; Context\n</code></pre> <p>Add a key/value pair to the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def add(self, key: str, value: Any) -&gt; Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.contains","title":"contains","text":"<pre><code>contains(key: str) -&gt; bool\n</code></pre> <p>Check if the context contains a given key</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/context.py</code> <pre><code>def contains(self, key: str) -&gt; bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(kwargs: dict) -&gt; Context\n</code></pre> <p>Creates Context object from the given dict</p> <p>Parameters:</p> Name Type Description Default <code>kwargs</code> <code>dict</code> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_dict(cls, kwargs: dict) -&gt; Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given json file</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> Security <p>(from https://jsonpickle.github.io/)</p> <p>jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    &gt; jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    &gt; The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_json--warning","title":"! Warning !","text":"<p>The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.</p>"},{"location":"api_reference/context.html#koheesio.context.Context.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given toml file</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file or string containing toml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; Context\n</code></pre> <p>Creates Context object from a given yaml file</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.get","title":"get","text":"<pre><code>get(key: str, default: Any = None, safe: bool = True) -&gt; Any\n</code></pre> <p>Get value of a given key</p> <p>The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's <code>.get()</code> method otherwise.</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>Can be a real key, or can be a dotted notation of a nested key</p> required <code>default</code> <code>Any</code> <p>Default value to return</p> <code>None</code> <code>safe</code> <code>bool</code> <p>Toggles whether to fail or not when item cannot be found</p> <code>True</code> <p>Returns:</p> Type Description <code>Any</code> <p>Value of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n</code></pre> <p>Returns <code>c</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get(self, key: str, default: Any = None, safe: bool = True) -&gt; Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.get_all","title":"get_all","text":"<pre><code>get_all() -&gt; dict\n</code></pre> <p>alias to to_dict()</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_all(self) -&gt; dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.get_item","title":"get_item","text":"<pre><code>get_item(key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]\n</code></pre> <p>Acts just like <code>.get</code>, except that it returns the key also</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>key/value-pair of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n</code></pre> <p>Returns <code>{'a.b': 'c'}</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_item(self, key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.merge","title":"merge","text":"<pre><code>merge(context: Context, recursive: bool = False) -&gt; Context\n</code></pre> <p>Merge this context with the context of another, where the incoming context has priority.</p> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> <p>Another Context class</p> required <code>recursive</code> <code>bool</code> <p>Recursively merge two dictionaries to an arbitrary depth</p> <code>False</code> <p>Returns:</p> Type Description <code>Context</code> <p>updated context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def merge(self, context: Context, recursive: bool = False) -&gt; Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.process_value","title":"process_value","text":"<pre><code>process_value(value: Any) -&gt; Any\n</code></pre> <p>Processes the given value, converting dictionaries to Context objects as needed.</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def process_value(self, value: Any) -&gt; Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Returns all parameters of the context as a dict</p> <p>Returns:</p> Type Description <code>dict</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a json string</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_json(self, pretty: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a yaml string</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n</code></pre>"},{"location":"api_reference/intro_api.html","title":"Intro api","text":""},{"location":"api_reference/intro_api.html#api-reference","title":"API Reference","text":"<p>You can navigate the API by clicking on the modules listed on the left to access the documentation. </p>"},{"location":"api_reference/logger.html","title":"Logger","text":"<p>Loggers are used to log messages from your application.</p> <p>For a comprehensive guide on the usage, examples, and additional features of the logging classes, please refer to the reference/concepts/logging section of the Koheesio documentation.</p> <p>Classes:</p> Name Description <code>LoggingFactory</code> <p>Logging factory to be used to generate logger instances.</p> <code>Masked</code> <p>Represents a masked value.</p> <code>MaskedString</code> <p>Represents a masked string value.</p> <code>MaskedInt</code> <p>Represents a masked integer value.</p> <code>MaskedFloat</code> <p>Represents a masked float value.</p> <code>MaskedDict</code> <p>Represents a masked dictionary value.</p> <code>LoggerIDFilter</code> <p>Filter which injects run_id information into the log.</p> <p>Functions:</p> Name Description <code>warn</code> <p>Issue a warning.</p>"},{"location":"api_reference/logger.html#koheesio.logger.T","title":"koheesio.logger.T  <code>module-attribute</code>","text":"<pre><code>T = TypeVar('T')\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter","title":"koheesio.logger.LoggerIDFilter","text":"<p>Filter which injects run_id information into the log.</p>"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.LOGGER_ID","title":"LOGGER_ID  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_ID: str = str(uuid4())\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.filter","title":"filter","text":"<pre><code>filter(record)\n</code></pre> Source code in <code>src/koheesio/logger.py</code> <pre><code>def filter(self, record):\n    record.logger_id = LoggerIDFilter.LOGGER_ID\n\n    return True\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory","title":"koheesio.logger.LoggingFactory","text":"<pre><code>LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n</code></pre> <p>Logging factory to be used to generate logger instances.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>Optional[str]</code> <code>None</code> <code>env</code> <code>Optional[str]</code> <code>None</code> <code>logger_id</code> <code>Optional[str]</code> <code>None</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CONSOLE_HANDLER: Optional[Handler] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.ENV","title":"ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ENV: Optional[str] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER","title":"LOGGER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER: Optional[Logger] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_ENV: str = 'local'\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FILTER: Optional[Filter] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_NAME: str = 'koheesio'\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.add_handlers","title":"add_handlers  <code>staticmethod</code>","text":"<pre><code>add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None\n</code></pre> <p>Add handlers to existing root logger.</p> <p>Parameters:</p> Name Type Description Default <code>handler_class</code> required <code>handlers_config</code> required Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.get_logger","title":"get_logger  <code>staticmethod</code>","text":"<pre><code>get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger\n</code></pre> <p>Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> required <code>inherit_from_koheesio</code> <code>bool</code> <code>False</code> <p>Returns:</p> Name Type Description <code>logger</code> <code>Logger</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.Masked","title":"koheesio.logger.Masked","text":"<pre><code>Masked(value: T)\n</code></pre> <p>Represents a masked value.</p> <p>Parameters:</p> Name Type Description Default <code>value</code> <code>T</code> <p>The value to be masked.</p> required <p>Attributes:</p> Name Type Description <code>_value</code> <code>T</code> <p>The original value.</p> <p>Methods:</p> Name Description <code>__repr__</code> <p>Returns a string representation of the masked value.</p> <code>__str__</code> <p>Returns a string representation of the masked value.</p> <code>__get_validators__</code> <p>Returns a generator of validators for the masked value.</p> <code>validate</code> <p>Validates the masked value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.Masked.validate","title":"validate  <code>classmethod</code>","text":"<pre><code>validate(v: Any, _values)\n</code></pre> <p>Validate the input value and return an instance of the class.</p> <p>Parameters:</p> Name Type Description Default <code>v</code> <code>Any</code> <p>The input value to validate.</p> required <code>_values</code> <code>Any</code> <p>Additional values used for validation.</p> required <p>Returns:</p> Name Type Description <code>instance</code> <code>cls</code> <p>An instance of the class.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>@classmethod\ndef validate(cls, v: Any, _values):\n    \"\"\"\n    Validate the input value and return an instance of the class.\n\n    Parameters\n    ----------\n    v : Any\n        The input value to validate.\n    _values : Any\n        Additional values used for validation.\n\n    Returns\n    -------\n    instance : cls\n        An instance of the class.\n\n    \"\"\"\n    return cls(v)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedDict","title":"koheesio.logger.MaskedDict","text":"<pre><code>MaskedDict(value: T)\n</code></pre> <p>Represents a masked dictionary value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedFloat","title":"koheesio.logger.MaskedFloat","text":"<pre><code>MaskedFloat(value: T)\n</code></pre> <p>Represents a masked float value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedInt","title":"koheesio.logger.MaskedInt","text":"<pre><code>MaskedInt(value: T)\n</code></pre> <p>Represents a masked integer value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedString","title":"koheesio.logger.MaskedString","text":"<pre><code>MaskedString(value: T)\n</code></pre> <p>Represents a masked string value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/utils.html","title":"Utils","text":"<p>Utility functions</p>"},{"location":"api_reference/utils.html#koheesio.utils.convert_str_to_bool","title":"koheesio.utils.convert_str_to_bool","text":"<pre><code>convert_str_to_bool(value) -&gt; Any\n</code></pre> <p>Converts a string to a boolean if the string is either 'true' or 'false'</p> Source code in <code>src/koheesio/utils.py</code> <pre><code>def convert_str_to_bool(value) -&gt; Any:\n    \"\"\"Converts a string to a boolean if the string is either 'true' or 'false'\"\"\"\n    if isinstance(value, str) and (v := value.lower()) in [\"true\", \"false\"]:\n        value = v == \"true\"\n    return value\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.get_args_for_func","title":"koheesio.utils.get_args_for_func","text":"<pre><code>get_args_for_func(func: Callable, params: Dict) -&gt; Tuple[Callable, Dict[str, Any]]\n</code></pre> <p>Helper function that matches keyword arguments (params) on a given function</p> <p>This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to  construct a new Callable (partial) function on which the input was mapped.</p> Example <pre><code>input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\ndef example_func(a: str):\n    return a\n\n\nfunc, kwargs = get_args_for_func(example_func, input_dict)\n</code></pre> <p>In this example, - <code>func</code> would be a callable with the input mapped toward it (i.e. can be called like any normal function) - <code>kwargs</code> would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})</p> <p>Parameters:</p> Name Type Description Default <code>func</code> <code>Callable</code> <p>The function to inspect</p> required <code>params</code> <code>Dict</code> <p>Dictionary with keyword values that will be mapped on the 'func'</p> required <p>Returns:</p> Type Description <code>Tuple[Callable, Dict[str, Any]]</code> <ul> <li>Callable     a partial() func with the found keyword values mapped toward it</li> <li>Dict[str, Any]     the keyword args that match the func</li> </ul> Source code in <code>src/koheesio/utils.py</code> <pre><code>def get_args_for_func(func: Callable, params: Dict) -&gt; Tuple[Callable, Dict[str, Any]]:\n    \"\"\"Helper function that matches keyword arguments (params) on a given function\n\n    This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to\n     construct a new Callable (partial) function on which the input was mapped.\n\n    Example\n    -------\n    ```python\n    input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\n    def example_func(a: str):\n        return a\n\n\n    func, kwargs = get_args_for_func(example_func, input_dict)\n    ```\n\n    In this example,\n    - `func` would be a callable with the input mapped toward it (i.e. can be called like any normal function)\n    - `kwargs` would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})\n\n    Parameters\n    ----------\n    func: Callable\n        The function to inspect\n    params: Dict\n        Dictionary with keyword values that will be mapped on the 'func'\n\n    Returns\n    -------\n    Tuple[Callable, Dict[str, Any]]\n        - Callable\n            a partial() func with the found keyword values mapped toward it\n        - Dict[str, Any]\n            the keyword args that match the func\n    \"\"\"\n    _kwargs = {k: v for k, v in params.items() if k in inspect.getfullargspec(func).args}\n    return (\n        partial(func, **_kwargs),\n        _kwargs,\n    )\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.get_project_root","title":"koheesio.utils.get_project_root","text":"<pre><code>get_project_root() -&gt; Path\n</code></pre> <p>Returns project root path.</p> Source code in <code>src/koheesio/utils.py</code> <pre><code>def get_project_root() -&gt; Path:\n    \"\"\"Returns project root path.\"\"\"\n    cmd = Path(__file__)\n    return Path([i for i in cmd.parents if i.as_uri().endswith(\"src\")][0]).parent\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.get_random_string","title":"koheesio.utils.get_random_string","text":"<pre><code>get_random_string(length: int = 64, prefix: Optional[str] = None) -&gt; str\n</code></pre> <p>Generate a random string of specified length</p> Source code in <code>src/koheesio/utils.py</code> <pre><code>def get_random_string(length: int = 64, prefix: Optional[str] = None) -&gt; str:\n    \"\"\"Generate a random string of specified length\"\"\"\n    if prefix:\n        return f\"{prefix}_{uuid.uuid4().hex}\"[0:length]\n    return f\"{uuid.uuid4().hex}\"[0:length]\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.import_class","title":"koheesio.utils.import_class","text":"<pre><code>import_class(module_class: str) -&gt; Any\n</code></pre> <p>Import class and module based on provided string.</p> <p>Parameters:</p> Name Type Description Default <code>module_class</code> <code>str</code> required <p>Returns:</p> Type Description <code>object  Class from specified input string.</code> Source code in <code>src/koheesio/utils.py</code> <pre><code>def import_class(module_class: str) -&gt; Any:\n    \"\"\"Import class and module based on provided string.\n\n    Parameters\n    ----------\n    module_class module+class to be imported.\n\n    Returns\n    -------\n    object  Class from specified input string.\n\n    \"\"\"\n    module_path, class_name = module_class.rsplit(\".\", 1)\n    module = import_module(module_path)\n\n    return getattr(module, class_name)\n</code></pre>"},{"location":"api_reference/asyncio/index.html","title":"Asyncio","text":"<p>This module provides classes for asynchronous steps in the koheesio package.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep","title":"koheesio.asyncio.AsyncStep","text":"<p>Asynchronous step class that inherits from Step and uses the AsyncStepMetaClass metaclass.</p> <p>Attributes:</p> Name Type Description <code>Output</code> <code>AsyncStepOutput</code> <p>The output class for the asynchronous step.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep.Output","title":"Output","text":"<p>Output class for asyncio step.</p> <p>This class represents the output of the asyncio step. It inherits from the AsyncStepOutput class.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepMetaClass","title":"koheesio.asyncio.AsyncStepMetaClass","text":"<p>Metaclass for asynchronous steps.</p> <p>This metaclass is used to define asynchronous steps in the Koheesio framework. It inherits from the StepMetaClass and provides additional functionality for executing asynchronous steps.</p> <p>Attributes:     None</p> <p>Methods:     _execute_wrapper: Wrapper method for executing asynchronous steps.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput","title":"koheesio.asyncio.AsyncStepOutput","text":"<p>Represents the output of an asynchronous step.</p> <p>This class extends the base <code>Step.Output</code> class and provides additional functionality for merging key-value maps.</p> <p>Attributes:</p> Name Type Description <code>...</code> <p>Methods:</p> Name Description <code>merge</code> <p>Merge key-value map with self.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput.merge","title":"merge","text":"<pre><code>merge(other: Union[Dict, StepOutput])\n</code></pre> <p>Merge key,value map with self</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n</code></pre> <p>Functionally similar to adding two dicts together; like running <code>{**dict_a, **dict_b}</code>.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[Dict, StepOutput]</code> <p>Dict or another instance of a StepOutputs class that will be added to self</p> required Source code in <code>src/koheesio/asyncio/__init__.py</code> <pre><code>def merge(self, other: Union[Dict, StepOutput]):\n    \"\"\"Merge key,value map with self\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Parameters\n    ----------\n    other: Union[Dict, StepOutput]\n        Dict or another instance of a StepOutputs class that will be added to self\n    \"\"\"\n    if isinstance(other, StepOutput):\n        other = other.model_dump()  # ensures we really have a dict\n\n    if not iscoroutine(other):\n        for k, v in other.items():\n            self.set(k, v)\n\n    return self\n</code></pre>"},{"location":"api_reference/asyncio/http.html","title":"Http","text":"<p>This module contains async implementation of HTTP step.</p>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep","title":"koheesio.asyncio.http.AsyncHttpGetStep","text":"<p>Represents an asynchronous HTTP GET step.</p> <p>This class inherits from the AsyncHttpStep class and specifies the HTTP method as GET.</p> <p>Attributes:     method (HttpMethod): The HTTP method for the step, set to HttpMethod.GET.</p>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = GET\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep","title":"koheesio.asyncio.http.AsyncHttpStep","text":"<p>Asynchronous HTTP step for making HTTP requests using aiohttp.</p> <p>Parameters:</p> Name Type Description Default <code>client_session</code> <code>Optional[ClientSession]</code> <p>Aiohttp ClientSession.</p> required <code>url</code> <code>List[URL]</code> <p>List of yarl.URL.</p> required <code>retry_options</code> <code>Optional[RetryOptionsBase]</code> <p>Retry options for the request.</p> required <code>connector</code> <code>Optional[BaseConnector]</code> <p>Connector for the aiohttp request.</p> required <code>headers</code> <code>Optional[Dict[str, Union[str, SecretStr]]]</code> <p>Request headers.</p> required Output <p>responses_urls : Optional[List[Tuple[Dict[str, Any], yarl.URL]]]     List of responses from the API and request URL.</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; import asyncio\n&gt;&gt;&gt; from aiohttp import ClientSession\n&gt;&gt;&gt; from aiohttp.connector import TCPConnector\n&gt;&gt;&gt; from aiohttp_retry import ExponentialRetry\n&gt;&gt;&gt; from koheesio.steps.async.http import AsyncHttpStep\n&gt;&gt;&gt; from yarl import URL\n&gt;&gt;&gt; from typing import Dict, Any, Union, List, Tuple\n&gt;&gt;&gt;\n&gt;&gt;&gt; # Initialize the AsyncHttpStep\n&gt;&gt;&gt; async def main():\n&gt;&gt;&gt;     session = ClientSession()\n&gt;&gt;&gt;     urls = [URL('https://example.com/api/1'), URL('https://example.com/api/2')]\n&gt;&gt;&gt;     retry_options = ExponentialRetry()\n&gt;&gt;&gt;     connector = TCPConnector(limit=10)\n&gt;&gt;&gt;     headers = {'Content-Type': 'application/json'}\n&gt;&gt;&gt;     step = AsyncHttpStep(\n&gt;&gt;&gt;         client_session=session,\n&gt;&gt;&gt;         url=urls,\n&gt;&gt;&gt;         retry_options=retry_options,\n&gt;&gt;&gt;         connector=connector,\n&gt;&gt;&gt;         headers=headers\n&gt;&gt;&gt;     )\n&gt;&gt;&gt;\n&gt;&gt;&gt;     # Execute the step\n&gt;&gt;&gt;     responses_urls=  await step.get()\n&gt;&gt;&gt;\n&gt;&gt;&gt;     return responses_urls\n&gt;&gt;&gt;\n&gt;&gt;&gt; # Run the main function\n&gt;&gt;&gt; responses_urls = asyncio.run(main())\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.client_session","title":"client_session  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_session: Optional[ClientSession] = Field(default=None, description='Aiohttp ClientSession', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.connector","title":"connector  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>connector: Optional[BaseConnector] = Field(default=None, description='Connector for the aiohttp request', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.headers","title":"headers  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Dict[str, Union[str, SecretStr]] = Field(default_factory=dict, description='Request headers', alias='header', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.retry_options","title":"retry_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>retry_options: Optional[RetryOptionsBase] = Field(default=None, description='Retry options for the request', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.timeout","title":"timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timeout: None = Field(default=None, description='[Optional] Request timeout')\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: List[URL] = Field(default=None, alias='urls', description='Expecting list, as there is no value in executing async request for one value.\\n        yarl.URL is preferable, because params/data can be injected into URL instance', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output","title":"Output","text":"<p>Output class for Step</p>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output.responses_urls","title":"responses_urls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>responses_urls: Optional[List[Tuple[Dict[str, Any], URL]]] = Field(default=None, description='List of responses from the API and request URL', repr=False)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.delete","title":"delete  <code>async</code>","text":"<pre><code>delete() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make DELETE requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def delete(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make DELETE requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.DELETE)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute the step.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the specified HTTP method is not implemented in AsyncHttpStep.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def execute(self) -&gt; AsyncHttpStep.Output:\n    \"\"\"\n    Execute the step.\n\n    Raises\n    ------\n    ValueError\n        If the specified HTTP method is not implemented in AsyncHttpStep.\n    \"\"\"\n    # By design asyncio does not allow its event loop to be nested. This presents a practical problem:\n    #   When in an environment where the event loop is already running\n    #   it\u2019s impossible to run tasks and wait for the result.\n    #   Trying to do so will give the error \u201cRuntimeError: This event loop is already running\u201d.\n    #   The issue pops up in various environments, such as web servers, GUI applications and in\n    #   Jupyter/DataBricks notebooks.\n    nest_asyncio.apply()\n\n    map_method_func = {\n        HttpMethod.GET: self.get,\n        HttpMethod.POST: self.post,\n        HttpMethod.PUT: self.put,\n        HttpMethod.DELETE: self.delete,\n    }\n\n    if self.method not in map_method_func:\n        raise ValueError(f\"Method {self.method} not implemented in AsyncHttpStep.\")\n\n    self.output.responses_urls = asyncio.run(map_method_func[self.method]())\n\n    return self.output\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get","title":"get  <code>async</code>","text":"<pre><code>get() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make GET requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def get(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make GET requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.GET)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_headers","title":"get_headers","text":"<pre><code>get_headers()\n</code></pre> <p>Get the request headers.</p> <p>Returns:</p> Type Description <code>Optional[Dict[str, Union[str, SecretStr]]]</code> <p>The request headers.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def get_headers(self):\n    \"\"\"\n    Get the request headers.\n\n    Returns\n    -------\n    Optional[Dict[str, Union[str, SecretStr]]]\n        The request headers.\n    \"\"\"\n    _headers = None\n\n    if self.headers:\n        _headers = {k: v.get_secret_value() if isinstance(v, SecretStr) else v for k, v in self.headers.items()}\n\n        for k, v in self.headers.items():\n            if isinstance(v, SecretStr):\n                self.headers[k] = v.get_secret_value()\n\n    return _headers or self.headers\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Get the options of the step.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def get_options(self):\n    \"\"\"\n    Get the options of the step.\n    \"\"\"\n    warnings.warn(\"get_options is not implemented in AsyncHttpStep.\")\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.post","title":"post  <code>async</code>","text":"<pre><code>post() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make POST requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def post(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make POST requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.POST)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.put","title":"put  <code>async</code>","text":"<pre><code>put() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make PUT requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def put(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make PUT requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.PUT)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.request","title":"request  <code>async</code>","text":"<pre><code>request(method: HttpMethod, url: URL, **kwargs) -&gt; Tuple[Dict[str, Any], URL]\n</code></pre> <p>Make an HTTP request.</p> <p>Parameters:</p> Name Type Description Default <code>method</code> <code>HttpMethod</code> <p>The HTTP method to use for the request.</p> required <code>url</code> <code>URL</code> <p>The URL to make the request to.</p> required <code>kwargs</code> <code>Any</code> <p>Additional keyword arguments to pass to the request.</p> <code>{}</code> <p>Returns:</p> Type Description <code>Tuple[Dict[str, Any], URL]</code> <p>A tuple containing the response data and the request URL.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def request(\n    self,\n    method: HttpMethod,\n    url: yarl.URL,\n    **kwargs,\n) -&gt; Tuple[Dict[str, Any], yarl.URL]:\n    \"\"\"\n    Make an HTTP request.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        The HTTP method to use for the request.\n    url : yarl.URL\n        The URL to make the request to.\n    kwargs : Any\n        Additional keyword arguments to pass to the request.\n\n    Returns\n    -------\n    Tuple[Dict[str, Any], yarl.URL]\n        A tuple containing the response data and the request URL.\n    \"\"\"\n    async with self.__retry_client.request(method=method, url=url, **kwargs) as response:\n        res = await response.json()\n\n    return (res, response.request_info.url)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.set_outputs","title":"set_outputs","text":"<pre><code>set_outputs(response)\n</code></pre> <p>Set the outputs of the step.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Any</code> <p>The response data.</p> required Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def set_outputs(self, response):\n    \"\"\"\n    Set the outputs of the step.\n\n    Parameters\n    ----------\n    response : Any\n        The response data.\n    \"\"\"\n    warnings.warn(\"set outputs is not implemented in AsyncHttpStep.\")\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.validate_timeout","title":"validate_timeout","text":"<pre><code>validate_timeout(timeout)\n</code></pre> <p>Validate the 'data' field.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Any</code> <p>The value of the 'timeout' field.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If 'data' is not allowed in AsyncHttpStep.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>@field_validator(\"timeout\")\ndef validate_timeout(cls, timeout):\n    \"\"\"\n    Validate the 'data' field.\n\n    Parameters\n    ----------\n    data : Any\n        The value of the 'timeout' field.\n\n    Raises\n    ------\n    ValueError\n        If 'data' is not allowed in AsyncHttpStep.\n    \"\"\"\n    if timeout:\n        raise ValueError(\"timeout is not allowed in AsyncHttpStep. Provide timeout through retry_options.\")\n</code></pre>"},{"location":"api_reference/integrations/index.html","title":"Integrations","text":"<p>Nothing to see here, move along.</p>"},{"location":"api_reference/integrations/box.html","title":"Box","text":"<p>Box Module</p> <p>The module is used to facilitate various interactions with Box service. The implementation is based on the functionalities available in Box Python SDK: https://github.com/box/box-python-sdk</p> Prerequisites <ul> <li>Box Application is created in the developer portal using the JWT auth method (Developer Portal - My Apps - Create)</li> <li>Application is authorized for the enterprise (Developer Portal - MyApp - Authorization)</li> </ul>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box","title":"koheesio.integrations.box.Box","text":"<pre><code>Box(**data)\n</code></pre> <p>Configuration details required for the authentication can be obtained in the Box Developer Portal by generating the Public / Private key pair in \"Application Name -&gt; Configuration -&gt; Add and Manage Public Keys\".</p> <p>The downloaded JSON file will look like this: <pre><code>{\n  \"boxAppSettings\": {\n    \"clientID\": \"client_id\",\n    \"clientSecret\": \"client_secret\",\n    \"appAuth\": {\n      \"publicKeyID\": \"public_key_id\",\n      \"privateKey\": \"private_key\",\n      \"passphrase\": \"pass_phrase\"\n    }\n  },\n  \"enterpriseID\": \"123456\"\n}\n</code></pre> This class is used as a base for the rest of Box integrations, however it can also be used separately to obtain the Box client which is created at class initialization.</p> <p>Examples:</p> <pre><code>b = Box(\n    client_id=\"client_id\",\n    client_secret=\"client_secret\",\n    enterprise_id=\"enterprise_id\",\n    jwt_key_id=\"jwt_key_id\",\n    rsa_private_key_data=\"rsa_private_key_data\",\n    rsa_private_key_passphrase=\"rsa_private_key_passphrase\",\n)\nb.client\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.auth_options","title":"auth_options  <code>property</code>","text":"<pre><code>auth_options\n</code></pre> <p>Get a dictionary of authentication options, that can be handily used in the child classes</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client","title":"client  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client: SkipValidation[Client] = None\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_id","title":"client_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientID', description='Client ID from the Box Developer console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_secret","title":"client_secret  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_secret: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientSecret', description='Client Secret from the Box Developer console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.enterprise_id","title":"enterprise_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enterprise_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='enterpriseID', description='Enterprise ID from the Box Developer console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.jwt_key_id","title":"jwt_key_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>jwt_key_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='publicKeyID', description='PublicKeyID for the public/private generated key pair.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_data","title":"rsa_private_key_data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rsa_private_key_data: Union[SecretStr, SecretBytes] = Field(default=..., alias='privateKey', description='Private key generated in the app management console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_passphrase","title":"rsa_private_key_passphrase  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rsa_private_key_passphrase: Union[SecretStr, SecretBytes] = Field(default=..., alias='passphrase', description='Private key passphrase generated in the app management console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    # Plug to be able to unit test ABC\n    pass\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.init_client","title":"init_client","text":"<pre><code>init_client()\n</code></pre> <p>Set up the Box client.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def init_client(self):\n    \"\"\"Set up the Box client.\"\"\"\n    if not self.client:\n        self.client = Client(JWTAuth(**self.auth_options))\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader","title":"koheesio.integrations.box.BoxCsvFileReader","text":"<pre><code>BoxCsvFileReader(**data)\n</code></pre> <p>Class facilitates reading one or multiple CSV files with the same structure directly from Box and producing Spark Dataframe.</p> Notes <p>To manually identify the ID of the file in Box, open the file through Web UI, and copy ID from the page URL, e.g. https://foo.ent.box.com/file/1234567890 , where 1234567890 is the ID.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxCsvFileReader\nfrom pyspark.sql.types import StructType\n\nschema = StructType(...)\nb = BoxCsvFileReader(\n    client_id=\"\",\n    client_secret=\"\",\n    enterprise_id=\"\",\n    jwt_key_id=\"\",\n    rsa_private_key_data=\"\",\n    rsa_private_key_passphrase=\"\",\n    file=[\"1\", \"2\"],\n    schema=schema,\n).execute()\nb.df.show()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.file","title":"file  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file: Union[str, list[str]] = Field(default=..., description='ID or list of IDs for the files to read.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Loop through the list of provided file identifiers and load data into dataframe. For traceability purposes the following columns will be added to the dataframe:     * meta_file_id: the identifier of the file on Box     * meta_file_name: name of the file</p> <p>Returns:</p> Type Description <code>DataFrame</code> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Loop through the list of provided file identifiers and load data into dataframe.\n    For traceability purposes the following columns will be added to the dataframe:\n        * meta_file_id: the identifier of the file on Box\n        * meta_file_name: name of the file\n\n    Returns\n    -------\n    DataFrame\n    \"\"\"\n    df = None\n    for f in self.file:\n        self.log.debug(f\"Reading contents of file with the ID '{f}' into Spark DataFrame\")\n        file = self.client.file(file_id=f)\n        data = file.content().decode(\"utf-8\").splitlines()\n        rdd = self.spark.sparkContext.parallelize(data)\n        temp_df = self.spark.read.csv(rdd, header=True, schema=self.schema_, **self.params)\n        temp_df = (\n            temp_df\n            # fmt: off\n            .withColumn(\"meta_file_id\", lit(file.object_id))\n            .withColumn(\"meta_file_name\", lit(file.get().name))\n            .withColumn(\"meta_load_timestamp\", expr(\"to_utc_timestamp(current_timestamp(), current_timezone())\"))\n            # fmt: on\n        )\n\n        df = temp_df if not df else df.union(temp_df)\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader","title":"koheesio.integrations.box.BoxCsvPathReader","text":"<pre><code>BoxCsvPathReader(**data)\n</code></pre> <p>Read all CSV files from the specified path into the dataframe. Files can be filtered using the regular expression in the 'filter' parameter. The default behavior is to read all CSV / TXT files from the specified path.</p> Notes <p>The class does not contain archival capability as it is presumed that the user wants to make sure that the full pipeline is successful (for example, the source data was transformed and saved) prior to moving the source files. Use BoxToBoxFileMove class instead and provide the list of IDs from 'file_id' output.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxCsvPathReader\n\nauth_params = {...}\nb = BoxCsvPathReader(**auth_params, path=\"foo/bar/\").execute()\nb.df  # Spark Dataframe\n...  # do something with the dataframe\nfrom koheesio.steps.integrations.box import BoxToBoxFileMove\n\nbm = BoxToBoxFileMove(**auth_params, file=b.file_id, path=\"/foo/bar/archive\")\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.filter","title":"filter  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>filter: Optional[str] = Field(default='.csv|.txt$', description='[Optional] Regexp to filter folder contents')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: str = Field(default=..., description='Box path')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Identify the list of files from the source Box path that match desired filter and load them into Dataframe</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Identify the list of files from the source Box path that match desired filter and load them into Dataframe\n    \"\"\"\n    folder = BoxFolderGet.from_step(self).execute().folder\n\n    # Identify the list of files that should be processed\n    files = [item for item in folder.get_items() if item.type == \"file\" and re.search(self.filter, item.name)]\n\n    if len(files) &gt; 0:\n        self.log.info(\n            f\"A total of {len(files)} files, that match the mask '{self.mask}' has been detected in {self.path}.\"\n            f\" They will be loaded into Spark Dataframe: {files}\"\n        )\n    else:\n        raise BoxPathIsEmptyError(f\"Path '{self.path}' is empty or none of files match the mask '{self.filter}'\")\n\n    file = [file_id.object_id for file_id in files]\n    self.output.df = BoxCsvFileReader.from_step(self, file=file).read()\n    self.output.file = file  # e.g. if files should be archived after pipeline is successful\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase","title":"koheesio.integrations.box.BoxFileBase","text":"<pre><code>BoxFileBase(**data)\n</code></pre> <p>Generic class to facilitate interactions with Box folders.</p> <p>Box SDK provides File class that has various properties and methods to interact with Box files. The object can be obtained in multiple ways:     * provide Box file identified to <code>file</code> parameter (the identifier can be obtained, for example, from URL)     * provide existing object to <code>file</code> parameter (boxsdk.object.file.File)</p> Notes <p>Refer to BoxFolderBase for mor info about <code>folder</code> and <code>path</code> parameters</p> See Also <p>boxsdk.object.file.File</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.files","title":"files  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>files: conlist(Union[File, str], min_length=1) = Field(default=..., alias='file', description='List of Box file objects or identifiers')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.folder","title":"folder  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.action","title":"action","text":"<pre><code>action(file: File, folder: Folder)\n</code></pre> <p>Abstract class for File level actions.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self, file: File, folder: Folder):\n    \"\"\"\n    Abstract class for File level actions.\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects from various parameter inputs</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects\n    from various parameter inputs\n    \"\"\"\n    if self.path:\n        _folder = BoxFolderGet.from_step(self).execute().folder\n    else:\n        _folder = self.client.folder(folder_id=self.folder) if isinstance(self.folder, str) else self.folder\n\n    for _file in self.files:\n        _file = self.client.file(file_id=_file) if isinstance(_file, str) else _file\n        self.action(file=_file, folder=_folder)\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter","title":"koheesio.integrations.box.BoxFileWriter","text":"<pre><code>BoxFileWriter(**data)\n</code></pre> <p>Write file or a file-like object to Box.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFileWriter\n\nauth_params = {...}\nf1 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=\"path/to/my/file.ext\").execute()\n# or\nimport io\n\nb = io.BytesIO(b\"my-sample-data\")\nf2 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=b, name=\"file.ext\").execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: Optional[str] = Field(None, description='Optional description to add to the file in Box')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file","title":"file  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file: Union[str, BytesIO] = Field(default=..., description='Path to file or a file-like object')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file_name","title":"file_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file_name: Optional[str] = Field(default=None, description=\"When file path or name is provided to 'file' parameter, this will override the original name.When binary stream is provided, the 'name' should be used to set the desired name for the Box file.\")\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output","title":"Output","text":"<p>Output class for BoxFileWriter.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.file","title":"file  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file: File = Field(default=..., description='File object in Box')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.shared_link","title":"shared_link  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>shared_link: str = Field(default=..., description='Shared link for the Box file')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.action","title":"action","text":"<pre><code>action()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    _file = self.file\n    _name = self.file_name\n\n    if isinstance(_file, str):\n        _name = _name if _name else PurePath(_file).name\n        with open(_file, \"rb\") as f:\n            _file = BytesIO(f.read())\n\n    folder: Folder = BoxFolderGet.from_step(self, create_sub_folders=True).execute().folder\n    folder.preflight_check(size=0, name=_name)\n\n    self.log.info(f\"Uploading file '{_name}' to Box folder '{folder.get().name}'...\")\n    _box_file: File = folder.upload_stream(file_stream=_file, file_name=_name, file_description=self.description)\n\n    self.output.file = _box_file\n    self.output.shared_link = _box_file.get_shared_link()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self) -&gt; Output:\n    self.action()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.validate_name_for_binary_data","title":"validate_name_for_binary_data","text":"<pre><code>validate_name_for_binary_data(values)\n</code></pre> <p>Validate 'file_name' parameter when providing a binary input for 'file'.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_name_for_binary_data(cls, values):\n    \"\"\"Validate 'file_name' parameter when providing a binary input for 'file'.\"\"\"\n    file, file_name = values.get(\"file\"), values.get(\"file_name\")\n    if not isinstance(file, str) and not file_name:\n        raise AttributeError(\"The parameter 'file_name' is mandatory when providing a binary input for 'file'.\")\n\n    return values\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase","title":"koheesio.integrations.box.BoxFolderBase","text":"<pre><code>BoxFolderBase(**data)\n</code></pre> <p>Generic class to facilitate interactions with Box folders.</p> <p>Box SDK provides Folder class that has various properties and methods to interact with Box folders. The object can be obtained in multiple ways:     * provide Box folder identified to <code>folder</code> parameter (the identifier can be obtained, for example, from URL)     * provide existing object to <code>folder</code> parameter (boxsdk.object.folder.Folder)     * provide filesystem-like path to <code>path</code> parameter</p> See Also <p>boxsdk.object.folder.Folder</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.folder","title":"folder  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.root","title":"root  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>root: Optional[Union[Folder, str]] = Field(default='0', description='Folder object or identifier of the folder that should be used as root')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output","title":"Output","text":"<p>Define outputs for the BoxFolderBase class</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output.folder","title":"folder  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>folder: Optional[Folder] = Field(default=None, description='Box folder object')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.action","title":"action","text":"<pre><code>action()\n</code></pre> <p>Placeholder for 'action' method, that should be implemented in the child classes</p> <p>Returns:</p> Type Description <code>    Folder or None</code> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    \"\"\"\n    Placeholder for 'action' method, that should be implemented in the child classes\n\n    Returns\n    -------\n        Folder or None\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self) -&gt; Output:\n    self.output.folder = self.action()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.validate_folder_or_path","title":"validate_folder_or_path","text":"<pre><code>validate_folder_or_path()\n</code></pre> <p>Validations for 'folder' and 'path' parameter usage</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_folder_or_path(self):\n    \"\"\"\n    Validations for 'folder' and 'path' parameter usage\n    \"\"\"\n    folder_value = self.folder\n    path_value = self.path\n\n    if folder_value and path_value:\n        raise AttributeError(\"Cannot user 'folder' and 'path' parameter at the same time\")\n\n    if not folder_value and not path_value:\n        raise AttributeError(\"Neither 'folder' nor 'path' parameters are set\")\n\n    return self\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate","title":"koheesio.integrations.box.BoxFolderCreate","text":"<pre><code>BoxFolderCreate(**data)\n</code></pre> <p>Explicitly create the new Box folder object and parent directories.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFolderCreate\n\nauth_params = {...}\nfolder = BoxFolderCreate(**auth_params, path=\"/foo/bar\").execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.create_sub_folders","title":"create_sub_folders  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_sub_folders: bool = Field(default=True, description='Create sub-folders recursively if the path does not exist.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.validate_folder","title":"validate_folder","text":"<pre><code>validate_folder(folder)\n</code></pre> <p>Validate 'folder' parameter</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@field_validator(\"folder\")\ndef validate_folder(cls, folder):\n    \"\"\"\n    Validate 'folder' parameter\n    \"\"\"\n    if folder:\n        raise AttributeError(\"Only 'path' parameter is allowed in the context of folder creation.\")\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete","title":"koheesio.integrations.box.BoxFolderDelete","text":"<pre><code>BoxFolderDelete(**data)\n</code></pre> <p>Delete existing Box folder based on object, identifier or path.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFolderDelete\n\nauth_params = {...}\nBoxFolderDelete(**auth_params, path=\"/foo/bar\").execute()\n# or\nBoxFolderDelete(**auth_params, folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxFolderDelete(**auth_params, folder=folder).execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete.action","title":"action","text":"<pre><code>action()\n</code></pre> <p>Delete folder action</p> <p>Returns:</p> Type Description <code>    None</code> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    \"\"\"\n    Delete folder action\n\n    Returns\n    -------\n        None\n    \"\"\"\n    if self.folder:\n        folder = self._obj_from_id\n    else:  # path\n        folder = BoxFolderGet.from_step(self).action()\n\n    self.log.info(f\"Deleting Box folder '{folder}'...\")\n    folder.delete()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet","title":"koheesio.integrations.box.BoxFolderGet","text":"<pre><code>BoxFolderGet(**data)\n</code></pre> <p>Get the Box folder object for an existing folder or create a new folder and parent directories.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFolderGet\n\nauth_params = {...}\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\n# or\nfolder = BoxFolderGet(**auth_params, path=\"1\").execute().folder\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.create_sub_folders","title":"create_sub_folders  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_sub_folders: Optional[bool] = Field(False, description='Create sub-folders recursively if the path does not exist.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.action","title":"action","text":"<pre><code>action()\n</code></pre> <p>Get folder action</p> <p>Returns:</p> Name Type Description <code>folder</code> <code>Folder</code> <p>Box Folder object as specified in Box SDK</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    \"\"\"\n    Get folder action\n\n    Returns\n    -------\n    folder: Folder\n        Box Folder object as specified in Box SDK\n    \"\"\"\n    current_folder_object = None\n\n    if self.folder:\n        current_folder_object = self._obj_from_id\n\n    if self.path:\n        cleaned_path_parts = [p for p in PurePath(self.path).parts if p.strip() not in [None, \"\", \" \", \"/\"]]\n        current_folder_object = self.client.folder(folder_id=self.root) if isinstance(self.root, str) else self.root\n\n        for next_folder_name in cleaned_path_parts:\n            current_folder_object = self._get_or_create_folder(current_folder_object, next_folder_name)\n\n    self.log.info(f\"Folder identified or created: {current_folder_object}\")\n    return current_folder_object\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderNotFoundError","title":"koheesio.integrations.box.BoxFolderNotFoundError","text":"<p>Error when a provided box path does not exist.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxPathIsEmptyError","title":"koheesio.integrations.box.BoxPathIsEmptyError","text":"<p>Exception when provided Box path is empty or no files matched the mask.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase","title":"koheesio.integrations.box.BoxReaderBase","text":"<pre><code>BoxReaderBase(**data)\n</code></pre> <p>Base class for Box readers.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the Spark reader.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.schema_","title":"schema_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_: Optional[StructType] = Field(None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output","title":"Output","text":"<p>Make default reader output optional to gracefully handle 'no-files / folder' cases.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@abstractmethod\ndef execute(self) -&gt; Output:\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy","title":"koheesio.integrations.box.BoxToBoxFileCopy","text":"<pre><code>BoxToBoxFileCopy(**data)\n</code></pre> <p>Copy one or multiple files to the target Box path.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxToBoxFileCopy\n\nauth_params = {...}\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileCopy(**auth_params, file=[File(), File()], folder=folder).execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy.action","title":"action","text":"<pre><code>action(file: File, folder: Folder)\n</code></pre> <p>Copy file to the desired destination and extend file description with the processing info</p> <p>Parameters:</p> Name Type Description Default <code>file</code> <code>File</code> <p>File object as specified in Box SDK</p> required <code>folder</code> <code>Folder</code> <p>Folder object as specified in Box SDK</p> required Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self, file: File, folder: Folder):\n    \"\"\"\n    Copy file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Copying '{file.get()}' to '{folder.get()}'...\")\n    file.copy(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove","title":"koheesio.integrations.box.BoxToBoxFileMove","text":"<pre><code>BoxToBoxFileMove(**data)\n</code></pre> <p>Move one or multiple files to the target Box path</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxToBoxFileMove\n\nauth_params = {...}\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileMove(**auth_params, file=[File(), File()], folder=folder).execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove.action","title":"action","text":"<pre><code>action(file: File, folder: Folder)\n</code></pre> <p>Move file to the desired destination and extend file description with the processing info</p> <p>Parameters:</p> Name Type Description Default <code>file</code> <code>File</code> <p>File object as specified in Box SDK</p> required <code>folder</code> <code>Folder</code> <p>Folder object as specified in Box SDK</p> required Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self, file: File, folder: Folder):\n    \"\"\"\n    Move file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Moving '{file.get()}' to '{folder.get()}'...\")\n    file.move(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n</code></pre>"},{"location":"api_reference/integrations/spark/index.html","title":"Spark","text":""},{"location":"api_reference/integrations/spark/sftp.html","title":"Sftp","text":"<p>This module contains the SFTPWriter class and the SFTPWriteMode enum.</p> <p>The SFTPWriter class is used to write data to a file on an SFTP server. It uses the Paramiko library to establish an SFTP connection and write data to the server. The data to be written is provided by a BufferWriter, which generates the data in a buffer. See the docstring of the SFTPWriter class for more details. Refer to koheesio.spark.writers.buffer for more details on the BufferWriter interface.</p> <p>The SFTPWriteMode enum defines the different write modes that the SFTPWriter can use. These modes determine how the SFTPWriter behaves when the file it is trying to write to already exists on the server. For more details on each mode, see the docstring of the SFTPWriteMode enum.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode","title":"koheesio.integrations.spark.sftp.SFTPWriteMode","text":"<p>The different write modes for the SFTPWriter.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--overwrite","title":"OVERWRITE:","text":"<ul> <li>If the file exists, it will be overwritten.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--append","title":"APPEND:","text":"<ul> <li>If the file exists, the new data will be appended to it.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--ignore","title":"IGNORE:","text":"<ul> <li>If the file exists, the method will return without writing anything.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--exclusive","title":"EXCLUSIVE:","text":"<ul> <li>If the file exists, an error will be raised.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--backup","title":"BACKUP:","text":"<ul> <li>If the file exists and the new data is different from the existing data, a backup will be created and the file     will be overwritten.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--update","title":"UPDATE:","text":"<ul> <li>If the file exists and the new data is different from the existing data, the file will be overwritten.</li> <li>If the file exists and the new data is the same as the existing data, the method will return without writing     anything.</li> <li>If the file does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.BACKUP","title":"BACKUP  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BACKUP = 'backup'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.EXCLUSIVE","title":"EXCLUSIVE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>EXCLUSIVE = 'exclusive'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.IGNORE","title":"IGNORE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IGNORE = 'ignore'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.OVERWRITE","title":"OVERWRITE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OVERWRITE = 'overwrite'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.UPDATE","title":"UPDATE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>UPDATE = 'update'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.write_mode","title":"write_mode  <code>property</code>","text":"<pre><code>write_mode\n</code></pre> <p>Return the write mode for the given SFTPWriteMode.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(mode: str)\n</code></pre> <p>Return the SFTPWriteMode for the given string.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@classmethod\ndef from_string(cls, mode: str):\n    \"\"\"Return the SFTPWriteMode for the given string.\"\"\"\n    return cls[mode.upper()]\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter","title":"koheesio.integrations.spark.sftp.SFTPWriter","text":"<p>Write a Dataframe to SFTP through a BufferWriter</p> Concept <ul> <li>This class uses Paramiko to connect to an SFTP server and write the contents of a buffer to a file on the server.</li> <li>This implementation takes inspiration from https://github.com/springml/spark-sftp</li> </ul> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Union[str, Path]</code> <p>Path to the folder to write to</p> required <code>file_name</code> <code>Optional[str]</code> <p>Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension.</p> <code>None</code> <code>host</code> <code>str</code> <p>SFTP Host</p> required <code>port</code> <code>int</code> <p>SFTP Port</p> required <code>username</code> <code>SecretStr</code> <p>SFTP Server Username</p> <code>None</code> <code>password</code> <code>SecretStr</code> <p>SFTP Server Password</p> <code>None</code> <code>buffer_writer</code> <code>BufferWriter</code> <p>This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.</p> required <code>mode</code> <p>Write mode: overwrite, append, ignore, exclusive, backup, or update. See the docstring of SFTPWriteMode for more details.</p> required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.buffer_writer","title":"buffer_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer_writer: InstanceOf[BufferWriter] = Field(default=..., description='This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.client","title":"client  <code>property</code>","text":"<pre><code>client: SFTPClient\n</code></pre> <p>Return the SFTP client. If it doesn't exist, create it.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.file_name","title":"file_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file_name: Optional[str] = Field(default=None, description='Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension!', alias='filename')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.host","title":"host  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>host: str = Field(default=..., description='SFTP Host')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.mode","title":"mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>mode: SFTPWriteMode = Field(default=OVERWRITE, description='Write mode: overwrite, append, ignore, exclusive, backup, or update.' + __doc__)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.password","title":"password  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>password: Optional[SecretStr] = Field(default=None, description='SFTP Server Password')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Union[str, Path] = Field(default=..., description='Path to the folder to write to', alias='prefix')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.port","title":"port  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>port: int = Field(default=..., description='SFTP Port')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.transport","title":"transport  <code>property</code>","text":"<pre><code>transport\n</code></pre> <p>Return the transport for the SFTP connection. If it doesn't exist, create it.</p> <p>If the username and password are provided, use them to connect to the SFTP server.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.username","title":"username  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>username: Optional[SecretStr] = Field(default=None, description='SFTP Server Username')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_mode","title":"write_mode  <code>property</code>","text":"<pre><code>write_mode\n</code></pre> <p>Return the write mode for the given SFTPWriteMode.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.check_file_exists","title":"check_file_exists","text":"<pre><code>check_file_exists(file_path: str) -&gt; bool\n</code></pre> <p>Check if a file exists on the SFTP server.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def check_file_exists(self, file_path: str) -&gt; bool:\n    \"\"\"\n    Check if a file exists on the SFTP server.\n    \"\"\"\n    try:\n        self.client.stat(file_path)\n        return True\n    except IOError:\n        return False\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def execute(self):\n    buffer_output: InstanceOf[BufferWriter.Output] = self.buffer_writer.write(self.df)\n\n    # write buffer to the SFTP server\n    try:\n        self._handle_write_mode(self.path.as_posix(), buffer_output)\n    finally:\n        self._close_client()\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_path_and_file_name","title":"validate_path_and_file_name","text":"<pre><code>validate_path_and_file_name(data: dict) -&gt; dict\n</code></pre> <p>Validate the path, make sure path and file_name are Path objects.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_path_and_file_name(cls, data: dict) -&gt; dict:\n    \"\"\"Validate the path, make sure path and file_name are Path objects.\"\"\"\n    path_or_str = data.get(\"path\")\n\n    if isinstance(path_or_str, str):\n        # make sure the path is a Path object\n        path_or_str = Path(path_or_str)\n\n    if not isinstance(path_or_str, Path):\n        raise ValueError(f\"Invalid path: {path_or_str}\")\n\n    if file_name := data.get(\"file_name\", data.get(\"filename\")):\n        path_or_str = path_or_str / file_name\n        try:\n            del data[\"filename\"]\n        except KeyError:\n            pass\n        data[\"file_name\"] = file_name\n\n    data[\"path\"] = path_or_str\n    return data\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_sftp_host","title":"validate_sftp_host","text":"<pre><code>validate_sftp_host(v) -&gt; str\n</code></pre> <p>Validate the host</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@field_validator(\"host\")\ndef validate_sftp_host(cls, v) -&gt; str:\n    \"\"\"Validate the host\"\"\"\n    # remove the sftp:// prefix if present\n    if v.startswith(\"sftp://\"):\n        v = v.replace(\"sftp://\", \"\")\n\n    # remove the trailing slash if present\n    if v.endswith(\"/\"):\n        v = v[:-1]\n\n    return v\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_file","title":"write_file","text":"<pre><code>write_file(file_path: str, buffer_output: InstanceOf[Output])\n</code></pre> <p>Using Paramiko, write the data in the buffer to SFTP.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def write_file(self, file_path: str, buffer_output: InstanceOf[BufferWriter.Output]):\n    \"\"\"\n    Using Paramiko, write the data in the buffer to SFTP.\n    \"\"\"\n    with self.client.open(file_path, self.write_mode) as file:\n        self.log.debug(f\"Writing file {file_path} to SFTP...\")\n        file.write(buffer_output.read())\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp","title":"koheesio.integrations.spark.sftp.SendCsvToSftp","text":"<p>Write a DataFrame to an SFTP server as a CSV file.</p> <p>This class uses the PandasCsvBufferWriter to generate the CSV data and the SFTPWriter to write the data to the SFTP server.</p> Example <pre><code>from koheesio.spark.writers import SendCsvToSftp\n\nwriter = SendCsvToSftp(\n    # SFTP Parameters\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.tsv.gz\",\n    # CSV Parameters\n    header=True,\n    sep=\"   \",\n    quote='\"',\n    timestampFormat=\"%Y-%m-%d\",\n    lineSep=os.linesep,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n</code></pre> <p>In this example, the DataFrame <code>df</code> is written to the file <code>file.csv.gz</code> in the folder <code>/path/to/folder</code> on the SFTP server. The file is written as a CSV file with a tab delimiter (TSV), double quotes as the quote character, and gzip compression.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Union[str, Path]</code> <p>Path to the folder to write to.</p> required <code>file_name</code> <code>Optional[str]</code> <p>Name of the file. If not provided, it's expected to be part of the path.</p> required <code>host</code> <code>str</code> <p>SFTP Host.</p> required <code>port</code> <code>int</code> <p>SFTP Port.</p> required <code>username</code> <code>SecretStr</code> <p>SFTP Server Username.</p> required <code>password</code> <code>SecretStr</code> <p>SFTP Server Password.</p> required <code>mode</code> <p>Write mode: overwrite, append, ignore, exclusive, backup, or update.</p> required <code>header</code> <p>Whether to write column names as the first line. Default is True.</p> required <code>sep</code> <p>Field delimiter for the output file. Default is ','.</p> required <code>quote</code> <p>Character used to quote fields. Default is '\"'.</p> required <code>quoteAll</code> <p>Whether all values should be enclosed in quotes. Default is False.</p> required <code>escape</code> <p>Character used to escape sep and quote when needed. Default is '\\'.</p> required <code>timestampFormat</code> <p>Date format for datetime objects. Default is '%Y-%m-%dT%H:%M:%S.%f'.</p> required <code>lineSep</code> <p>Character used as line separator. Default is os.linesep.</p> required <code>compression</code> <p>Compression to use for the output data. Default is None.</p> required <code>For</code> required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.buffer_writer","title":"buffer_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer_writer: PandasCsvBufferWriter = Field(default=None, validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def execute(self):\n    SFTPWriter.execute(self)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"<pre><code>set_up_buffer_writer() -&gt; SendCsvToSftp\n</code></pre> <p>Set up the buffer writer, passing all CSV related options to it.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -&gt; \"SendCsvToSftp\":\n    \"\"\"Set up the buffer writer, passing all CSV related options to it.\"\"\"\n    self.buffer_writer = PandasCsvBufferWriter(**self.get_options(options_type=\"kohesio_pandas_buffer_writer\"))\n    return self\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp","title":"koheesio.integrations.spark.sftp.SendJsonToSftp","text":"<p>Write a DataFrame to an SFTP server as a JSON file.</p> <p>This class uses the PandasJsonBufferWriter to generate the JSON data and the SFTPWriter to write the data to the SFTP server.</p> Example <pre><code>from koheesio.spark.writers import SendJsonToSftp\n\nwriter = SendJsonToSftp(\n    # SFTP Parameters (Inherited from SFTPWriter)\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.json.gz\",\n    # JSON Parameters (Inherited from PandasJsonBufferWriter)\n    orient=\"records\",\n    date_format=\"iso\",\n    double_precision=2,\n    date_unit=\"ms\",\n    lines=False,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n</code></pre> <p>In this example, the DataFrame <code>df</code> is written to the file <code>file.json.gz</code> in the folder <code>/path/to/folder</code> on the SFTP server. The file is written as a JSON file with gzip compression.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Union[str, Path]</code> <p>Path to the folder on the SFTP server.</p> required <code>file_name</code> <code>Optional[str]</code> <p>Name of the file, including extension. If not provided, expected to be part of the path.</p> required <code>host</code> <code>str</code> <p>SFTP Host.</p> required <code>port</code> <code>int</code> <p>SFTP Port.</p> required <code>username</code> <code>SecretStr</code> <p>SFTP Server Username.</p> required <code>password</code> <code>SecretStr</code> <p>SFTP Server Password.</p> required <code>mode</code> <p>Write mode: overwrite, append, ignore, exclusive, backup, or update.</p> required <code>orient</code> <p>Format of the JSON string. Default is 'records'.</p> required <code>lines</code> <p>If True, output is one JSON object per line. Only used when orient='records'. Default is True.</p> required <code>date_format</code> <p>Type of date conversion. Default is 'iso'.</p> required <code>double_precision</code> <p>Decimal places for encoding floating point values. Default is 10.</p> required <code>force_ascii</code> <p>If True, encoded string is ASCII. Default is True.</p> required <code>compression</code> <p>Compression to use for output data. Default is None.</p> required See Also <p>For more details on the JSON parameters, refer to the PandasJsonBufferWriter class documentation.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.buffer_writer","title":"buffer_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer_writer: PandasJsonBufferWriter = Field(default=None, validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def execute(self):\n    SFTPWriter.execute(self)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"<pre><code>set_up_buffer_writer() -&gt; SendJsonToSftp\n</code></pre> <p>Set up the buffer writer, passing all JSON related options to it.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -&gt; \"SendJsonToSftp\":\n    \"\"\"Set up the buffer writer, passing all JSON related options to it.\"\"\"\n    self.buffer_writer = PandasJsonBufferWriter(\n        **self.get_options(), compression=self.compression, columns=self.columns\n    )\n    return self\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/index.html","title":"Dq","text":""},{"location":"api_reference/integrations/spark/dq/spark_expectations.html","title":"Spark expectations","text":"<p>Koheesio step for running data quality rules with Spark Expectations engine.</p>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","title":"koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","text":"<p>Run DQ rules for an input dataframe with Spark Expectations engine.</p> References <p>Spark Expectations: https://engineering.nike.com/spark-expectations/1.0.0/</p>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.drop_meta_column","title":"drop_meta_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>drop_meta_column: bool = Field(default=False, alias='drop_meta_columns', description='Whether to drop meta columns added by spark expectations on the output df')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.enable_debugger","title":"enable_debugger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_debugger: bool = Field(default=False, alias='debugger', description='...')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_format","title":"error_writer_format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_writer_format: Optional[str] = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the err table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_mode","title":"error_writer_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writing_options","title":"error_writing_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_writing_options: Optional[Dict[str, str]] = Field(default_factory=dict, alias='error_writing_options', description='Options for writing to the error table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the stats and err table. Separate output formats can be specified for each table using the error_writer_format and stats_writer_format params')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.mode","title":"mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>mode: Union[str, BatchOutputMode] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err and stats table. Separate output modes can be specified for each table using the error_writer_mode and stats_writer_mode params')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.product_id","title":"product_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>product_id: str = Field(default=..., description='Spark Expectations product identifier')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.rules_table","title":"rules_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rules_table: str = Field(default=..., alias='product_rules_table', description='DQ rules table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.se_user_conf","title":"se_user_conf  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>se_user_conf: Dict[str, Any] = Field(default={se_notifications_enable_email: False, se_notifications_enable_slack: False}, alias='user_conf', description='SE user provided confs', validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_streaming","title":"statistics_streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>statistics_streaming: Dict[str, Any] = Field(default={se_enable_streaming: False}, alias='stats_streaming_options', description='SE stats streaming options ', validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_table","title":"statistics_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>statistics_table: str = Field(default=..., alias='dq_stats_table_name', description='DQ stats table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_format","title":"stats_writer_format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>stats_writer_format: Optional[str] = Field(default='delta', alias='stats_writer_format', description='The format used to write to the stats table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_mode","title":"stats_writer_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>stats_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='stats_writer_mode', description='The write mode that will be used to write to the stats table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.target_table","title":"target_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_table: str = Field(default=..., alias='target_table_name', description=\"The table that will contain good records. Won't write to it, but will write to the err table with same name plus _err suffix\")\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output","title":"Output","text":"<p>Output of the SparkExpectationsTransformation step.</p>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.error_table_writer","title":"error_table_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations error table writer')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.rules_df","title":"rules_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rules_df: DataFrame = Field(default=..., description='Output dataframe')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.se","title":"se  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>se: SparkExpectations = Field(default=..., description='Spark Expectations object')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.stats_table_writer","title":"stats_table_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>stats_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations stats table writer')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Apply data quality rules to a dataframe using the out-of-the-box SE decorator</p> Source code in <code>src/koheesio/integrations/spark/dq/spark_expectations.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"\n    Apply data quality rules to a dataframe using the out-of-the-box SE decorator\n    \"\"\"\n    # read rules table\n    rules_df = self.spark.read.table(self.rules_table).cache()\n    self.output.rules_df = rules_df\n\n    @self._se.with_expectations(\n        target_table=self.target_table,\n        user_conf=self.se_user_conf,\n        # Below params are `False` by default, however exposing them here for extra visibility\n        # The writes can be handled by downstream Koheesio steps\n        write_to_table=False,\n        write_to_temp_table=False,\n    )\n    def inner(df: DataFrame) -&gt; DataFrame:\n        \"\"\"Just a wrapper to be able to use Spark Expectations decorator\"\"\"\n        return df\n\n    output_df = inner(self.df)\n\n    if self.drop_meta_column:\n        output_df = output_df.drop(\"meta_dq_run_id\", \"meta_dq_run_datetime\")\n\n    self.output.df = output_df\n</code></pre>"},{"location":"api_reference/models/index.html","title":"Models","text":"<p>Models package creates models that can be used to base other classes on.</p> <ul> <li>Every model should be at least a pydantic BaseModel, but can also be a Step, or a StepOutput.</li> <li>Every model is expected to be an ABC (Abstract Base Class)</li> <li>Optionally a model can inherit ExtraParamsMixin that provides unpacking of kwargs into <code>extra_params</code> dict property   removing need to create a dict before passing kwargs to a model initializer.</li> </ul> <p>A Model class can be exceptionally handy when you need similar Pydantic models in multiple places, for example across Transformation and Reader classes.</p>"},{"location":"api_reference/models/index.html#koheesio.models.ListOfColumns","title":"koheesio.models.ListOfColumns  <code>module-attribute</code>","text":"<pre><code>ListOfColumns = Annotated[List[str], BeforeValidator(_list_of_columns_validation)]\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel","title":"koheesio.models.BaseModel","text":"<p>Base model for all models.</p> <p>Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.</p> Additional methods and properties: Different Modes <p>This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.</p> <ul> <li> <p>Normal mode:     you need to know the values ahead of time     <pre><code>normal_mode = YourOwnModel(a=\"foo\", b=42)\n</code></pre></p> </li> <li> <p>Lazy mode:     being able to defer the validation until later     <pre><code>lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n</code></pre>     The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add     them as they become available. All while still being able to validate that you have collected all your output     at the end.</p> </li> <li> <p>With statements:     With statements are also allowed. The <code>validate_output</code> method from the earlier example will run upon exit of     the with-statement.     <pre><code>with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n</code></pre>     Note: that a lazy mode BaseModel object is required to work with a with-statement.</p> </li> </ul> <p>Examples:</p> <pre><code>from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n</code></pre> <p>In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The <code>validate_output</code> method is then called to validate the instance.</p> Koheesio specific configuration: <p>Koheesio models are configured differently from Pydantic defaults. The following configuration is used:</p> <ol> <li> <p>extra=\"allow\"</p> <p>This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.</p> </li> <li> <p>arbitrary_types_allowed=True</p> <p>This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.</p> </li> <li> <p>populate_by_name=True</p> <p>This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.</p> </li> <li> <p>validate_assignment=False</p> <p>This setting determines whether the model should be revalidated when the data is changed. If set to <code>True</code>, every time a field is assigned a new value, the entire model is validated again.</p> <p>Pydantic default is (also) <code>False</code>, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.</p> </li> <li> <p>revalidate_instances=\"subclass-instances\"</p> <p>This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is <code>never</code>, which means that the model and dataclass instances are not revalidated during validation.</p> </li> <li> <p>validate_default=True</p> <p>This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.</p> </li> <li> <p>frozen=False</p> <p>This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.</p> </li> <li> <p>coerce_numbers_to_str=True</p> <p>This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any <code>Number</code> type to <code>str</code>. Pydantic doesn't allow number types (<code>int</code>, <code>float</code>, <code>Decimal</code>) to be coerced as type <code>str</code> by default.</p> </li> <li> <p>use_enum_values=True</p> <p>This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.</p> </li> </ol>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--fields","title":"Fields","text":"<p>Every Koheesio BaseModel has two fields: <code>name</code> and <code>description</code>. These fields are used to provide a name and a description to the model.</p> <ul> <li> <p><code>name</code>: This is the name of the Model. If not provided, it defaults to the class name.</p> </li> <li> <p><code>description</code>: This is the description of the Model. It has several default behaviors:</p> <ul> <li>If not provided, it defaults to the docstring of the class.</li> <li>If the docstring is not provided, it defaults to the name of the class.</li> <li>For multi-line descriptions, it has the following behaviors:<ul> <li>Only the first non-empty line is used.</li> <li>Empty lines are removed.</li> <li>Only the first 3 lines are considered.</li> <li>Only the first 120 characters are considered.</li> </ul> </li> </ul> </li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--validators","title":"Validators","text":"<ul> <li><code>_set_name_and_description</code>: Set the name and description of the Model as per the rules mentioned above.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--properties","title":"Properties","text":"<ul> <li><code>log</code>: Returns a logger with the name of the class.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--class-methods","title":"Class Methods","text":"<ul> <li><code>from_basemodel</code>: Returns a new BaseModel instance based on the data of another BaseModel.</li> <li><code>from_context</code>: Creates BaseModel instance from a given Context.</li> <li><code>from_dict</code>: Creates BaseModel instance from a given dictionary.</li> <li><code>from_json</code>: Creates BaseModel instance from a given JSON string.</li> <li><code>from_toml</code>: Creates BaseModel object from a given toml file.</li> <li><code>from_yaml</code>: Creates BaseModel object from a given yaml file.</li> <li><code>lazy</code>: Constructs the model without doing validation.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--dunder-methods","title":"Dunder Methods","text":"<ul> <li><code>__add__</code>: Allows to add two BaseModel instances together.</li> <li><code>__enter__</code>: Allows for using the model in a with-statement.</li> <li><code>__exit__</code>: Allows for using the model in a with-statement.</li> <li><code>__setitem__</code>: Set Item dunder method for BaseModel.</li> <li><code>__getitem__</code>: Get Item dunder method for BaseModel.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--instance-methods","title":"Instance Methods","text":"<ul> <li><code>hasattr</code>: Check if given key is present in the model.</li> <li><code>get</code>: Get an attribute of the model, but don't fail if not present.</li> <li><code>merge</code>: Merge key,value map with self.</li> <li><code>set</code>: Allows for subscribing / assigning to <code>class[key]</code>.</li> <li><code>to_context</code>: Converts the BaseModel instance to a Context object.</li> <li><code>to_dict</code>: Converts the BaseModel instance to a dictionary.</li> <li><code>to_json</code>: Converts the BaseModel instance to a JSON string.</li> <li><code>to_yaml</code>: Converts the BaseModel instance to a YAML string.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: Optional[str] = Field(default=None, description='Description of the Model')\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.log","title":"log  <code>property</code>","text":"<pre><code>log: Logger\n</code></pre> <p>Returns a logger with the name of the class</p>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.name","title":"name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>name: Optional[str] = Field(default=None, description='Name of the Model')\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_basemodel","title":"from_basemodel  <code>classmethod</code>","text":"<pre><code>from_basemodel(basemodel: BaseModel, **kwargs)\n</code></pre> <p>Returns a new BaseModel instance based on the data of another BaseModel</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_context","title":"from_context  <code>classmethod</code>","text":"<pre><code>from_context(context: Context) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given Context</p> <p>You have to make sure that the Context object has the necessary attributes to create the model.</p> <p>Examples:</p> <pre><code>class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_context(cls, context: Context) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(data: Dict[str, Any]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given dictionary</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.from_json : Deserializes a JSON string to a Context object</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given toml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file, or string containing toml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given yaml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.get","title":"get","text":"<pre><code>get(key: str, default: Optional[Any] = None)\n</code></pre> <p>Get an attribute of the model, but don't fail if not present</p> <p>Similar to dict.get()</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>name of the key to get</p> required <code>default</code> <code>Optional[Any]</code> <p>Default value in case the attribute does not exist</p> <code>None</code> <p>Returns:</p> Type Description <code>Any</code> <p>The value of the attribute</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.hasattr","title":"hasattr","text":"<pre><code>hasattr(key: str) -&gt; bool\n</code></pre> <p>Check if given key is present in the model</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def hasattr(self, key: str) -&gt; bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.lazy","title":"lazy  <code>classmethod</code>","text":"<pre><code>lazy()\n</code></pre> <p>Constructs the model without doing validation</p> <p>Essentially an alias to BaseModel.construct()</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.merge","title":"merge","text":"<pre><code>merge(other: Union[Dict, BaseModel])\n</code></pre> <p>Merge key,value map with self</p> <p>Functionally similar to adding two dicts together; like running <code>{**dict_a, **dict_b}</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[Dict, BaseModel]</code> <p>Dict or another instance of a BaseModel class that will be added to self</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.set","title":"set","text":"<pre><code>set(key: str, value: Any)\n</code></pre> <p>Allows for subscribing / assigning to <code>class[key]</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>The key of the attribute to assign to</p> required <code>value</code> <code>Any</code> <p>Value that should be assigned to the given key</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_context","title":"to_context","text":"<pre><code>to_context() -&gt; Context\n</code></pre> <p>Converts the BaseModel instance to a Context object</p> <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_context(self) -&gt; Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Converts the BaseModel instance to a dictionary</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False)\n</code></pre> <p>Converts the BaseModel instance to a JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.to_json : Serializes a Context object to a JSON string</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Converts the BaseModel instance to a YAML string</p> <p>BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.validate","title":"validate","text":"<pre><code>validate() -&gt; BaseModel\n</code></pre> <p>Validate the BaseModel instance</p> <p>This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.</p> <p>This method is intended to be used with the <code>lazy</code> method. The <code>lazy</code> method is used to create an instance of the BaseModel without immediate validation. The <code>validate</code> method is then used to validate the instance after.</p> <p>Note: in the Pydantic BaseModel, the <code>validate</code> method throws a deprecated warning. This is because Pydantic recommends using the <code>validate_model</code> method instead. However, we are using the <code>validate</code> method here in a different context and a slightly different way.</p> <p>Examples:</p> <p><pre><code>class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n</code></pre> In this example, the <code>foo_model</code> instance is created without immediate validation. The attributes foo and lorem are set afterward. The <code>validate</code> method is then called to validate the instance.</p> <p>Returns:</p> Type Description <code>BaseModel</code> <p>The BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def validate(self) -&gt; BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    &gt; Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin","title":"koheesio.models.ExtraParamsMixin","text":"<p>Mixin class that adds support for arbitrary keyword arguments to Pydantic models.</p> <p>The keyword arguments are extracted from the model's <code>values</code> and moved to a <code>params</code> dictionary.</p>"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.extra_params","title":"extra_params  <code>cached</code> <code>property</code>","text":"<pre><code>extra_params: Dict[str, Any]\n</code></pre> <p>Extract params (passed as arbitrary kwargs) from values and move them to params dict</p>"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = Field(default_factory=dict)\n</code></pre>"},{"location":"api_reference/models/sql.html","title":"Sql","text":"<p>This module contains the base class for SQL steps.</p>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep","title":"koheesio.models.sql.SqlBaseStep","text":"<p>Base class for SQL steps</p> <p><code>params</code> are used as placeholders for templating. These are identified with ${placeholder} in the SQL script.</p> <p>Parameters:</p> Name Type Description Default <code>sql_path</code> <p>Path to a SQL file</p> required <code>sql</code> <p>SQL script to apply</p> required <code>params</code> <p>Placeholders (parameters) for templating. These are identified with <code>${placeholder}</code> in the SQL script.</p> <p>Note: any arbitrary kwargs passed to the class will be added to params.</p> required"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = Field(default_factory=dict, description='Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script. Note: any arbitrary kwargs passed to the class will be added to params.')\n</code></pre>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.query","title":"query  <code>property</code>","text":"<pre><code>query\n</code></pre> <p>Returns the query while performing params replacement</p>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql","title":"sql  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sql: Optional[str] = Field(default=None, description='SQL script to apply')\n</code></pre>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql_path","title":"sql_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sql_path: Optional[Union[Path, str]] = Field(default=None, description='Path to a SQL file')\n</code></pre>"},{"location":"api_reference/notifications/index.html","title":"Notifications","text":"<p>Notification module for sending messages to notification services (e.g. Slack, Email, etc.)</p>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity","title":"koheesio.notifications.NotificationSeverity","text":"<p>Enumeration of allowed message severities</p>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.ERROR","title":"ERROR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERROR = 'error'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.INFO","title":"INFO  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INFO = 'info'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.SUCCESS","title":"SUCCESS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SUCCESS = 'success'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.WARN","title":"WARN  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>WARN = 'warn'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.alert_icon","title":"alert_icon  <code>property</code>","text":"<pre><code>alert_icon: str\n</code></pre> <p>Return a colored circle in slack markup</p>"},{"location":"api_reference/notifications/slack.html","title":"Slack","text":"<p>Classes to ease interaction with Slack</p>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification","title":"koheesio.notifications.slack.SlackNotification","text":"<p>Generic Slack notification class via the <code>Blocks</code> API</p> <p>NOTE: <code>channel</code> parameter is used only with Slack Web API: https://api.slack.com/messaging/sending     If webhook is used, the channel specification is not required</p> <p>Example: <pre><code>s = SlackNotification(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\",\n)\ns.execute()\n</code></pre></p>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.channel","title":"channel  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>channel: Optional[str] = Field(default=None, description='Slack channel id')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.headers","title":"headers  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Optional[Dict[str, Any]] = {'Content-type': 'application/json'}\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.message","title":"message  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>message: str = Field(default=..., description='The message that gets posted to Slack')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Generate payload and send post request</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.get_payload","title":"get_payload","text":"<pre><code>get_payload()\n</code></pre> <p>Generate payload with <code>Block Kit</code>. More details: https://api.slack.com/block-kit</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def get_payload(self):\n    \"\"\"\n    Generate payload with `Block Kit`.\n    More details: https://api.slack.com/block-kit\n    \"\"\"\n    payload = {\n        \"attachments\": [\n            {\n                \"blocks\": [\n                    {\n                        \"type\": \"section\",\n                        \"text\": {\n                            \"type\": \"mrkdwn\",\n                            \"text\": self.message,\n                        },\n                    }\n                ],\n            }\n        ]\n    }\n\n    if self.channel:\n        payload[\"channel\"] = self.channel\n\n    return json.dumps(payload)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity","title":"koheesio.notifications.slack.SlackNotificationWithSeverity","text":"<p>Slack notification class via the <code>Blocks</code> API with etra severity information and predefined extra fields</p> <p>Example:     from koheesio.steps.integrations.notifications import NotificationSeverity</p> <pre><code>s = SlackNotificationWithSeverity(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\"\n    severity=NotificationSeverity.ERROR,\n    title=\"Title\",\n    environment=\"dev\",\n    application=\"Application\"\n)\ns.execute()\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.application","title":"application  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>application: str = Field(default=..., description='Pipeline or application name')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.environment","title":"environment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>environment: str = Field(default=..., description='Environment description, e.g. dev / qa /prod')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(use_enum_values=False)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.severity","title":"severity  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>severity: NotificationSeverity = Field(default=..., description='Severity of the message')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.timestamp","title":"timestamp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timestamp: datetime = Field(default=utcnow(), alias='execution_timestamp', description='Pipeline or application execution timestamp')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.title","title":"title  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>title: str = Field(default=..., description='Title of your message')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Generate payload and send post request</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.message = self.get_payload_message()\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.get_payload_message","title":"get_payload_message","text":"<pre><code>get_payload_message()\n</code></pre> <p>Generate payload message based on the predefined set of parameters</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def get_payload_message(self):\n    \"\"\"\n    Generate payload message based on the predefined set of parameters\n    \"\"\"\n    return dedent(\n        f\"\"\"\n            {self.severity.alert_icon}   *{self.severity.name}:*  {self.title}\n            *Environment:* {self.environment}\n            *Application:* {self.application}\n            *Message:* {self.message}\n            *Timestamp:* {self.timestamp}\n        \"\"\"\n    )\n</code></pre>"},{"location":"api_reference/secrets/index.html","title":"Secrets","text":"<p>Module for secret integrations.</p> <p>Contains abstract class for various secret integrations also known as SecretContext.</p>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret","title":"koheesio.secrets.Secret","text":"<p>Abstract class for various secret integrations. All secrets are wrapped into Context class for easy access. Either existing context can be provided, or new context will be created and returned at runtime.</p> <p>Secrets are wrapped into the pydantic.SecretStr.</p>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.context","title":"context  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>context: Optional[Context] = Field(Context({}), description='Existing `Context` instance can be used for secrets, otherwise new empty context will be created.')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.parent","title":"parent  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>parent: Optional[str] = Field(default=..., description='Group secrets from one secure path under this friendly name', pattern='^[a-zA-Z0-9_]+$')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.root","title":"root  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>root: Optional[str] = Field(default='secrets', description='All secrets will be grouped under this root.')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output","title":"Output","text":"<p>Output class for Secret.</p>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output.context","title":"context  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>context: Context = Field(default=..., description='Koheesio context')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.encode_secret_values","title":"encode_secret_values  <code>classmethod</code>","text":"<pre><code>encode_secret_values(data: dict)\n</code></pre> <p>Encode secret values in the dictionary.</p> <p>Ensures that all values in the dictionary are wrapped in SecretStr.</p> Source code in <code>src/koheesio/secrets/__init__.py</code> <pre><code>@classmethod\ndef encode_secret_values(cls, data: dict):\n    \"\"\"Encode secret values in the dictionary.\n\n    Ensures that all values in the dictionary are wrapped in SecretStr.\n    \"\"\"\n    encoded_dict = {}\n    for key, value in data.items():\n        if isinstance(value, dict):\n            encoded_dict[key] = cls.encode_secret_values(value)\n        else:\n            encoded_dict[key] = SecretStr(value)\n    return encoded_dict\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.</p> Source code in <code>src/koheesio/secrets/__init__.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.\n    \"\"\"\n    context = Context(self.encode_secret_values(data={self.root: {self.parent: self._get_secrets()}}))\n    self.output.context = self.context.merge(context=context)\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.get","title":"get","text":"<pre><code>get() -&gt; Context\n</code></pre> <p>Convenience method to return context with secrets.</p> Source code in <code>src/koheesio/secrets/__init__.py</code> <pre><code>def get(self) -&gt; Context:\n    \"\"\"\n    Convenience method to return context with secrets.\n    \"\"\"\n    self.execute()\n    return self.output.context\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html","title":"Cerberus","text":"<p>Module for retrieving secrets from Cerberus.</p> <p>Secrets are stored as SecretContext and can be accessed accordingly.</p> <p>See CerberusSecret for more information.</p>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret","title":"koheesio.secrets.cerberus.CerberusSecret","text":"<p>Retrieve secrets from Cerberus and wrap them into Context class for easy access. All secrets are stored under the \"secret\" root and \"parent\". \"Parent\" either derived from the secure data path by replacing \"/\" and \"-\", or manually provided by the user. Secrets are wrapped into the pydantic.SecretStr.</p> <p>Example: <pre><code>context = {\n    \"secrets\": {\n        \"parent\": {\n            \"webhook\": SecretStr(\"**********\"),\n            \"description\": SecretStr(\"**********\"),\n        }\n    }\n}\n</code></pre></p> <p>Values can be decoded like this: <pre><code>context.secrets.parent.webhook.get_secret_value()\n</code></pre> or if working with dictionary is preferable: <pre><code>for key, value in context.get_all().items():\n    value.get_secret_value()\n</code></pre></p>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.aws_session","title":"aws_session  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>aws_session: Optional[Session] = Field(default=None, description='AWS Session to pass to Cerberus client, can be used for local execution.')\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: str = Field(default=..., description=\"Secure data path, eg. 'app/my-sdb/my-secrets'\")\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.token","title":"token  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>token: Optional[SecretStr] = Field(default=get('CERBERUS_TOKEN', None), description='Cerberus token, can be used for local development without AWS auth mechanism.Note: Token has priority over AWS session.')\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., description='Cerberus URL, eg. https://cerberus.domain.com')\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.verbose","title":"verbose  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>verbose: bool = Field(default=False, description='Enable verbose for Cerberus client')\n</code></pre>"},{"location":"api_reference/spark/index.html","title":"Spark","text":"<p>Spark step module</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.AnalysisException","title":"koheesio.spark.AnalysisException  <code>module-attribute</code>","text":"<pre><code>AnalysisException = AnalysisException\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.DataFrame","title":"koheesio.spark.DataFrame  <code>module-attribute</code>","text":"<pre><code>DataFrame = DataFrame\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkSession","title":"koheesio.spark.SparkSession  <code>module-attribute</code>","text":"<pre><code>SparkSession = SparkSession\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep","title":"koheesio.spark.SparkStep","text":"<p>Base class for a Spark step</p> <p>Extends the Step class with SparkSession support. The following: - Spark steps are expected to return a Spark DataFrame as output. - spark property is available to access the active SparkSession instance.</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.spark","title":"spark  <code>property</code>","text":"<pre><code>spark: Optional[SparkSession]\n</code></pre> <p>Get active SparkSession instance</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output","title":"Output","text":"<p>Output class for SparkStep</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.current_timestamp_utc","title":"koheesio.spark.current_timestamp_utc","text":"<pre><code>current_timestamp_utc(spark: SparkSession) -&gt; Column\n</code></pre> <p>Get the current timestamp in UTC</p> Source code in <code>src/koheesio/spark/__init__.py</code> <pre><code>def current_timestamp_utc(spark: SparkSession) -&gt; Column:\n    \"\"\"Get the current timestamp in UTC\"\"\"\n    return F.to_utc_timestamp(F.current_timestamp(), spark.conf.get(\"spark.sql.session.timeZone\"))\n</code></pre>"},{"location":"api_reference/spark/delta.html","title":"Delta","text":"<p>Module for creating and managing Delta tables.</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep","title":"koheesio.spark.delta.DeltaTableStep","text":"<p>Class for creating and managing Delta tables.</p> <p>DeltaTable aims to provide a simple interface to create and manage Delta tables. It is a wrapper around the Spark SQL API for Delta tables.</p> Example <pre><code>from koheesio.steps import DeltaTableStep\n\nDeltaTableStep(\n    table=\"my_table\",\n    database=\"my_database\",\n    catalog=\"my_catalog\",\n    create_if_not_exists=True,\n    default_create_properties={\n        \"delta.randomizeFilePrefixes\": \"true\",\n        \"delta.checkpoint.writeStatsAsStruct\": \"true\",\n        \"delta.minReaderVersion\": \"2\",\n        \"delta.minWriterVersion\": \"5\",\n    },\n)\n</code></pre> <p>Methods:</p> Name Description <code>get_persisted_properties</code> <p>Get persisted properties of table.</p> <code>add_property</code> <p>Alter table and set table property.</p> <code>add_properties</code> <p>Alter table and add properties.</p> <code>execute</code> <p>Nothing to execute on a Table.</p> <code>max_version_ts_of_last_execution</code> <p>Max version timestamp of last execution. If no timestamp is found, returns 1900-01-01 00:00:00. Note: will raise an error if column <code>VERSION_TIMESTAMP</code> does not exist.</p> Properties <ul> <li>name -&gt; str     Deprecated. Use <code>.table_name</code> instead.</li> <li>table_name -&gt; str     Table name.</li> <li>dataframe -&gt; DataFrame     Returns a DataFrame to be able to interact with this table.</li> <li>columns -&gt; Optional[List[str]]     Returns all column names as a list.</li> <li>has_change_type -&gt; bool     Checks if a column named <code>_change_type</code> is present in the table.</li> <li>exists -&gt; bool     Check if table exists.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>str</code> <p>Table name.</p> required <code>database</code> <code>str</code> <p>Database or Schema name.</p> <code>None</code> <code>catalog</code> <code>str</code> <p>Catalog name.</p> <code>None</code> <code>create_if_not_exists</code> <code>bool</code> <p>Force table creation if it doesn't exist. Note: Default properties will be applied to the table during CREATION.</p> <code>False</code> <code>default_create_properties</code> <code>Dict[str, str]</code> <p>Default table properties to be applied during CREATION if <code>force_creation</code> True.</p> <code>{\"delta.randomizeFilePrefixes\": \"true\", \"delta.checkpoint.writeStatsAsStruct\": \"true\", \"delta.minReaderVersion\": \"2\", \"delta.minWriterVersion\": \"5\"}</code>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.catalog","title":"catalog  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>catalog: Optional[str] = Field(default=None, description='Catalog name. Note: Can be ignored if using a SparkCatalog that does not support catalog notation (e.g. Hive)')\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.columns","title":"columns  <code>property</code>","text":"<pre><code>columns: Optional[List[str]]\n</code></pre> <p>Returns all column names as a list.</p> Example <p><pre><code>DeltaTableStep(...).columns\n</code></pre> Would for example return <code>['age', 'name']</code> if the table has columns <code>age</code> and <code>name</code>.</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.create_if_not_exists","title":"create_if_not_exists  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_if_not_exists: bool = Field(default=False, alias='force_creation', description=\"Force table creation if it doesn't exist.Note: Default properties will be applied to the table during CREATION.\")\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.database","title":"database  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>database: Optional[str] = Field(default=None, description='Database or Schema name.')\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.dataframe","title":"dataframe  <code>property</code>","text":"<pre><code>dataframe: DataFrame\n</code></pre> <p>Returns a DataFrame to be able to interact with this table</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.default_create_properties","title":"default_create_properties  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>default_create_properties: Dict[str, Union[str, bool, int]] = Field(default={'delta.randomizeFilePrefixes': 'true', 'delta.checkpoint.writeStatsAsStruct': 'true', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'}, description='Default table properties to be applied during CREATION if `create_if_not_exists` True')\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.exists","title":"exists  <code>property</code>","text":"<pre><code>exists: bool\n</code></pre> <p>Check if table exists</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.has_change_type","title":"has_change_type  <code>property</code>","text":"<pre><code>has_change_type: bool\n</code></pre> <p>Checks if a column named <code>_change_type</code> is present in the table</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.is_cdf_active","title":"is_cdf_active  <code>property</code>","text":"<pre><code>is_cdf_active: bool\n</code></pre> <p>Check if CDF property is set and activated</p> <p>Returns:</p> Type Description <code>bool</code> <p>delta.enableChangeDataFeed property is set to 'true'</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table","title":"table  <code>instance-attribute</code>","text":"<pre><code>table: str\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table_name","title":"table_name  <code>property</code>","text":"<pre><code>table_name: str\n</code></pre> <p>Fully qualified table name in the form of <code>catalog.database.table</code></p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_properties","title":"add_properties","text":"<pre><code>add_properties(properties: Dict[str, Union[str, bool, int]], override: bool = False)\n</code></pre> <p>Alter table and add properties.</p> <p>Parameters:</p> Name Type Description Default <code>properties</code> <code>Dict[str, Union[str, int, bool]]</code> <p>Properties to be added to table.</p> required <code>override</code> <code>bool</code> <p>Enable override of existing value for property in table.</p> <code>False</code> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def add_properties(self, properties: Dict[str, Union[str, bool, int]], override: bool = False):\n    \"\"\"Alter table and add properties.\n\n    Parameters\n    ----------\n    properties : Dict[str, Union[str, int, bool]]\n        Properties to be added to table.\n    override : bool, optional, default=False\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    for k, v in properties.items():\n        v_str = str(v) if not isinstance(v, bool) else str(v).lower()\n        self.add_property(key=k, value=v_str, override=override)\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_property","title":"add_property","text":"<pre><code>add_property(key: str, value: Union[str, int, bool], override: bool = False)\n</code></pre> <p>Alter table and set table property.</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>Property key(name).</p> required <code>value</code> <code>Union[str, int, bool]</code> <p>Property value.</p> required <code>override</code> <code>bool</code> <p>Enable override of existing value for property in table.</p> <code>False</code> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def add_property(self, key: str, value: Union[str, int, bool], override: bool = False):\n    \"\"\"Alter table and set table property.\n\n    Parameters\n    ----------\n    key: str\n        Property key(name).\n    value: Union[str, int, bool]\n        Property value.\n    override: bool\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    persisted_properties = self.get_persisted_properties()\n    v_str = str(value) if not isinstance(value, bool) else str(value).lower()\n\n    def _alter_table() -&gt; None:\n        property_pair = f\"'{key}'='{v_str}'\"\n\n        try:\n            # noinspection SqlNoDataSourceInspection\n            self.spark.sql(f\"ALTER TABLE {self.table_name} SET TBLPROPERTIES ({property_pair})\")\n            self.log.debug(f\"Table `{self.table_name}` has been altered. Property `{property_pair}` added.\")\n        except Py4JJavaError as e:\n            msg = f\"Property `{key}` can not be applied to table `{self.table_name}`. Exception: {e}\"\n            self.log.warning(msg)\n            warnings.warn(msg)\n\n    if self.exists:\n        if key in persisted_properties and persisted_properties[key] != v_str:\n            if override:\n                self.log.debug(\n                    f\"Property `{key}` presents in `{self.table_name}` and has value `{persisted_properties[key]}`.\"\n                    f\"Override is enabled.The value will be changed to `{v_str}`.\"\n                )\n                _alter_table()\n            else:\n                self.log.debug(\n                    f\"Skipping adding property `{key}`, because it is already set \"\n                    f\"for table `{self.table_name}` to `{v_str}`. To override it, provide override=True\"\n                )\n        else:\n            _alter_table()\n    else:\n        self.default_create_properties[key] = v_str\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Nothing to execute on a Table</p> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def execute(self):\n    \"\"\"Nothing to execute on a Table\"\"\"\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_column_type","title":"get_column_type","text":"<pre><code>get_column_type(column: str) -&gt; Optional[DataType]\n</code></pre> <p>Get the type of a column in the table.</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>str</code> <p>Column name.</p> required <p>Returns:</p> Type Description <code>Optional[DataType]</code> <p>Column type.</p> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def get_column_type(self, column: str) -&gt; Optional[DataType]:\n    \"\"\"Get the type of a column in the table.\n\n    Parameters\n    ----------\n    column : str\n        Column name.\n\n    Returns\n    -------\n    Optional[DataType]\n        Column type.\n    \"\"\"\n    return self.dataframe.schema[column].dataType if self.columns and column in self.columns else None\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_persisted_properties","title":"get_persisted_properties","text":"<pre><code>get_persisted_properties() -&gt; Dict[str, str]\n</code></pre> <p>Get persisted properties of table.</p> <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Persisted properties as a dictionary.</p> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def get_persisted_properties(self) -&gt; Dict[str, str]:\n    \"\"\"Get persisted properties of table.\n\n    Returns\n    -------\n    Dict[str, str]\n        Persisted properties as a dictionary.\n    \"\"\"\n    persisted_properties = {}\n    raw_options = self.spark.sql(f\"SHOW TBLPROPERTIES {self.table_name}\").collect()\n\n    for ro in raw_options:\n        key, value = ro.asDict().values()\n        persisted_properties[key] = value\n\n    return persisted_properties\n</code></pre>"},{"location":"api_reference/spark/etl_task.html","title":"Etl task","text":"<p>ETL Task</p> <p>Extract -&gt; Transform -&gt; Load</p>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask","title":"koheesio.spark.etl_task.EtlTask","text":"<p>ETL Task</p> <p>Etl stands for: Extract -&gt; Transform -&gt; Load</p> <p>This task is a composition of a Reader (extract), a series of Transformations (transform) and a Writer (load). In other words, it reads data from a source, applies a series of transformations, and writes the result to a target.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>Name of the task</p> required <code>description</code> <code>str</code> <p>Description of the task</p> required <code>source</code> <code>Reader</code> <p>Source to read from [extract]</p> required <code>transformations</code> <code>list[Transformation]</code> <p>Series of transformations [transform]. The order of the transformations is important!</p> required <code>target</code> <code>Writer</code> <p>Target to write to [load]</p> required Example <pre><code>from koheesio.tasks import EtlTask\n\nfrom koheesio.steps.readers import CsvReader\nfrom koheesio.steps.transformations.repartition import Repartition\nfrom koheesio.steps.writers import CsvWriter\n\netl_task = EtlTask(\n    name=\"My ETL Task\",\n    description=\"This is an example ETL task\",\n    source=CsvReader(path=\"path/to/source.csv\"),\n    transformations=[Repartition(num_partitions=2)],\n    target=DummyWriter(),\n)\n\netl_task.execute()\n</code></pre> <p>This code will read from a CSV file, repartition the DataFrame to 2 partitions, and write the result to the console.</p> Extending the EtlTask <p>The EtlTask is designed to be a simple and flexible way to define ETL processes. It is not designed to be a one-size-fits-all solution, but rather a starting point for building more complex ETL processes. If you need more complex functionality, you can extend the EtlTask class and override the <code>extract</code>, <code>transform</code> and <code>load</code> methods. You can also implement your own <code>execute</code> method to define the entire ETL process from scratch should you need more flexibility.</p> Advantages of using the EtlTask <ul> <li>It is a simple way to define ETL processes</li> <li>It is easy to understand and extend</li> <li>It is easy to test and debug</li> <li>It is easy to maintain and refactor</li> <li>It is easy to integrate with other tools and libraries</li> <li>It is easy to use in a production environment</li> </ul>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.etl_date","title":"etl_date  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>etl_date: datetime = Field(default=utcnow(), description=\"Date time when this object was created as iso format. Example: '2023-01-24T09:39:23.632374'\")\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.source","title":"source  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source: InstanceOf[Reader] = Field(default=..., description='Source to read from [extract]')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.target","title":"target  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target: InstanceOf[Writer] = Field(default=..., description='Target to write to [load]')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transformations","title":"transformations  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>transformations: conlist(min_length=0, item_type=InstanceOf[Transformation]) = Field(default_factory=list, description='Series of transformations', alias='transforms')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output","title":"Output","text":"<p>Output class for EtlTask</p>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.source_df","title":"source_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .extract() method')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.target_df","title":"target_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_df: DataFrame = Field(default=..., description='The Spark DataFrame used by .load() method')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.transform_df","title":"transform_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>transform_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .transform() method')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Run the ETL process</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def execute(self):\n    \"\"\"Run the ETL process\"\"\"\n    self.log.info(f\"Task started at {self.etl_date}\")\n\n    # extract from source\n    self.output.source_df = self.extract()\n\n    # transform\n    self.output.transform_df = self.transform(self.output.source_df)\n\n    # load to target\n    self.output.target_df = self.load(self.output.transform_df)\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.extract","title":"extract","text":"<pre><code>extract() -&gt; DataFrame\n</code></pre> <p>Read from Source</p> <p>logging is handled by the Reader.execute()-method's @do_execute decorator</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def extract(self) -&gt; DataFrame:\n    \"\"\"Read from Source\n\n    logging is handled by the Reader.execute()-method's @do_execute decorator\n    \"\"\"\n    reader: Reader = self.source\n    return reader.read()\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.load","title":"load","text":"<pre><code>load(df: DataFrame) -&gt; DataFrame\n</code></pre> <p>Write to Target</p> <p>logging is handled by the Writer.execute()-method's @do_execute decorator</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def load(self, df: DataFrame) -&gt; DataFrame:\n    \"\"\"Write to Target\n\n    logging is handled by the Writer.execute()-method's @do_execute decorator\n    \"\"\"\n    writer: Writer = self.target\n    writer.write(df)\n    return df\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>alias of execute</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transform","title":"transform","text":"<pre><code>transform(df: DataFrame) -&gt; DataFrame\n</code></pre> <p>Transform recursively</p> <p>logging is handled by the Transformation.execute()-method's @do_execute decorator</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def transform(self, df: DataFrame) -&gt; DataFrame:\n    \"\"\"Transform recursively\n\n    logging is handled by the Transformation.execute()-method's @do_execute decorator\n    \"\"\"\n    for t in self.transformations:\n        df = t.transform(df)\n    return df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html","title":"Snowflake","text":"<p>Snowflake steps and tasks for Koheesio</p> <p>Every class in this module is a subclass of <code>Step</code> or <code>Task</code> and is used to perform operations on Snowflake.</p> Notes <p>Every Step in this module is based on SnowflakeBaseModel. The following parameters are available for every Step.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for <code>sfURL</code>. required <code>user</code> <code>str</code> <p>Login name for the Snowflake user. Alias for <code>sfUser</code>.</p> required <code>password</code> <code>SecretStr</code> <p>Password for the Snowflake user. Alias for <code>sfPassword</code>.</p> required <code>database</code> <code>str</code> <p>The database to use for the session after connecting. Alias for <code>sfDatabase</code>.</p> required <code>sfSchema</code> <code>str</code> <p>The schema to use for the session after connecting. Alias for <code>schema</code> (\"schema\" is a reserved name in Pydantic, so we use <code>sfSchema</code> as main name instead).</p> required <code>role</code> <code>str</code> <p>The default security role to use for the session after connecting. Alias for <code>sfRole</code>.</p> required <code>warehouse</code> <code>str</code> <p>The default virtual warehouse to use for the session after connecting. Alias for <code>sfWarehouse</code>.</p> required <code>authenticator</code> <code>Optional[str]</code> <p>Authenticator for the Snowflake user. Example: \"okta.com\".</p> <code>None</code> <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the Snowflake connector.</p> <code>{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}</code> <code>format</code> <code>str</code> <p>The default <code>snowflake</code> format can be used natively in Databricks, use <code>net.snowflake.spark.snowflake</code> in other environments and make sure to install required JARs.</p> <code>\"snowflake\"</code>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn","title":"koheesio.spark.snowflake.AddColumn","text":"<p>Add an empty column to a Snowflake table with given name and DataType</p> Example <pre><code>AddColumn(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    col=\"MY_COL\",\n    dataType=StringType(),\n).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.column","title":"column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>column: str = Field(default=..., description='The name of the new column')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The name of the Snowflake table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: DataType = Field(default=..., description='The DataType represented as a Spark DataType')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output","title":"Output","text":"<p>Output class for AddColumn</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='Query that was executed to add the column')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    query = f\"ALTER TABLE {self.table} ADD COLUMN {self.column} {map_spark_type(self.type)}\".upper()\n    self.output.query = query\n    RunQuery(**self.get_options(), query=query).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","title":"koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","text":"<p>Create (or Replace) a Snowflake table which has the same schema as a Spark DataFrame</p> <p>Can be used as any Transformation. The DataFrame is however left unchanged, and only used for determining the schema of the Snowflake Table that is to be created (or replaced).</p> Example <pre><code>CreateOrReplaceTableFromDataFrame(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=\"super-secret-password\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    df=df,\n).execute()\n</code></pre> <p>Or, as a Transformation: <pre><code>CreateOrReplaceTableFromDataFrame(\n    ...\n    table=\"MY_TABLE\",\n).transform(df)\n</code></pre></p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., alias='table_name', description='The name of the (new) table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output","title":"Output","text":"<p>Output class for CreateOrReplaceTableFromDataFrame</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.input_schema","title":"input_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>input_schema: StructType = Field(default=..., description='The original schema from the input DataFrame')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='Query that was executed to create the table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.snowflake_schema","title":"snowflake_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>snowflake_schema: str = Field(default=..., description='Derived Snowflake table schema based on the input DataFrame')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    self.output.df = self.df\n\n    input_schema = self.df.schema\n    self.output.input_schema = input_schema\n\n    snowflake_schema = \", \".join([f\"{c.name} {map_spark_type(c.dataType)}\" for c in input_schema])\n    self.output.snowflake_schema = snowflake_schema\n\n    table_name = f\"{self.database}.{self.sfSchema}.{self.table}\"\n    query = f\"CREATE OR REPLACE TABLE {table_name} ({snowflake_schema})\"\n    self.output.query = query\n\n    RunQuery(**self.get_options(), query=query).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery","title":"koheesio.spark.snowflake.DbTableQuery","text":"<p>Read table from Snowflake using the <code>dbtable</code> option instead of <code>query</code></p> Example <pre><code>DbTableQuery(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"user\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"db.schema.table\",\n).execute().df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery.dbtable","title":"dbtable  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>dbtable: str = Field(default=..., alias='table', description='The name of the table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema","title":"koheesio.spark.snowflake.GetTableSchema","text":"<p>Get the schema from a Snowflake table as a Spark Schema</p> Notes <ul> <li>This Step will execute a <code>SELECT * FROM &lt;table&gt; LIMIT 1</code> query to get the schema of the table.</li> <li>The schema will be stored in the <code>table_schema</code> attribute of the output.</li> <li><code>table_schema</code> is used as the attribute name to avoid conflicts with the <code>schema</code> attribute of Pydantic's     BaseModel.</li> </ul> Example <pre><code>schema = (\n    GetTableSchema(\n        database=\"MY_DB\",\n        schema_=\"MY_SCHEMA\",\n        warehouse=\"MY_WH\",\n        user=\"gid.account@nike.com\",\n        password=\"super-secret-password\",\n        role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n        table=\"MY_TABLE\",\n    )\n    .execute()\n    .table_schema\n)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The Snowflake table name')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output","title":"Output","text":"<p>Output class for GetTableSchema</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output.table_schema","title":"table_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table_schema: StructType = Field(default=..., serialization_alias='schema', description='The Spark Schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self) -&gt; Output:\n    query = f\"SELECT * FROM {self.table} LIMIT 1\"  # nosec B608: hardcoded_sql_expressions\n    df = Query(**self.get_options(), query=query).execute().df\n    self.output.table_schema = df.schema\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","text":"<p>Grant Snowflake privileges to a set of roles on a fully qualified object, i.e. <code>database.schema.object_name</code></p> <p>This class is a subclass of <code>GrantPrivilegesOnObject</code> and is used to grant privileges on a fully qualified object. The advantage of using this class is that it sets the object name to be fully qualified, i.e. <code>database.schema.object_name</code>.</p> <p>Meaning, you can set the <code>database</code>, <code>schema</code> and <code>object</code> separately and the object name will be set to be fully qualified, i.e. <code>database.schema.object_name</code>.</p> Example <pre><code>GrantPrivilegesOnFullyQualifiedObject(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    ...\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    ...\n)\n</code></pre> <p>In this example, the object name will be set to be fully qualified, i.e. <code>MY_DB.MY_SCHEMA.MY_TABLE</code>. If you were to use <code>GrantPrivilegesOnObject</code> instead, you would have to set the object name to be fully qualified yourself.</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject.set_object_name","title":"set_object_name","text":"<pre><code>set_object_name()\n</code></pre> <p>Set the object name to be fully qualified, i.e. database.schema.object_name</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_object_name(self):\n    \"\"\"Set the object name to be fully qualified, i.e. database.schema.object_name\"\"\"\n    # database, schema, obj_name\n    db = self.database\n    schema = self.model_dump()[\"sfSchema\"]  # since \"schema\" is a reserved name\n    obj_name = self.object\n\n    self.object = f\"{db}.{schema}.{obj_name}\"\n\n    return self\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnObject","text":"<p>A wrapper on Snowflake GRANT privileges</p> <p>With this Step, you can grant Snowflake privileges to a set of roles on a table, a view, or an object</p> See Also <p>https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html</p> <p>Parameters:</p> Name Type Description Default <code>warehouse</code> <code>str</code> <p>The name of the warehouse. Alias for <code>sfWarehouse</code></p> required <code>user</code> <code>str</code> <p>The username. Alias for <code>sfUser</code></p> required <code>password</code> <code>SecretStr</code> <p>The password. Alias for <code>sfPassword</code></p> required <code>role</code> <code>str</code> <p>The role name</p> required <code>object</code> <code>str</code> <p>The name of the object to grant privileges on</p> required <code>type</code> <code>str</code> <p>The type of object to grant privileges on, e.g. TABLE, VIEW</p> required <code>privileges</code> <code>Union[conlist(str, min_length=1), str]</code> <p>The Privilege/Permission or list of Privileges/Permissions to grant on the given object.</p> required <code>roles</code> <code>Union[conlist(str, min_length=1), str]</code> <p>The Role or list of Roles to grant the privileges to</p> required Example <pre><code>GrantPermissionsOnTable(\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    permissions=[\"SELECT\", \"INSERT\"],\n).execute()\n</code></pre> <p>In this example, the <code>APPLICATION.SNOWFLAKE.ADMIN</code> role will be granted <code>SELECT</code> and <code>INSERT</code> privileges on the <code>MY_TABLE</code> table using the <code>MY_WH</code> warehouse.</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.object","title":"object  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>object: str = Field(default=..., description='The name of the object to grant privileges on')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.privileges","title":"privileges  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>privileges: Union[conlist(str, min_length=1), str] = Field(default=..., alias='permissions', description='The Privilege/Permission or list of Privileges/Permissions to grant on the given object. See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.roles","title":"roles  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>roles: Union[conlist(str, min_length=1), str] = Field(default=..., alias='role', validation_alias='roles', description='The Role or list of Roles to grant the privileges to')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: str = Field(default=..., description='The type of object to grant privileges on, e.g. TABLE, VIEW')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output","title":"Output","text":"<p>Output class for GrantPrivilegesOnObject</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: conlist(str, min_length=1) = Field(default=..., description='Query that was executed to grant privileges', validate_default=False)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    self.output.query = []\n    roles = self.roles\n\n    for role in roles:\n        query = self.get_query(role)\n        self.output.query.append(query)\n        RunQuery(**self.get_options(), query=query).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.get_query","title":"get_query","text":"<pre><code>get_query(role: str)\n</code></pre> <p>Build the GRANT query</p> <p>Parameters:</p> Name Type Description Default <code>role</code> <code>str</code> <p>The role name</p> required <p>Returns:</p> Name Type Description <code>query</code> <code>str</code> <p>The Query that performs the grant</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_query(self, role: str):\n    \"\"\"Build the GRANT query\n\n    Parameters\n    ----------\n    role: str\n        The role name\n\n    Returns\n    -------\n    query : str\n        The Query that performs the grant\n    \"\"\"\n    query = f\"GRANT {','.join(self.privileges)} ON {self.type} {self.object} TO ROLE {role}\".upper()\n    return query\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.set_roles_privileges","title":"set_roles_privileges","text":"<pre><code>set_roles_privileges(values)\n</code></pre> <p>Coerce roles and privileges to be lists if they are not already.</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@model_validator(mode=\"before\")\ndef set_roles_privileges(cls, values):\n    \"\"\"Coerce roles and privileges to be lists if they are not already.\"\"\"\n    roles_value = values.get(\"roles\") or values.get(\"role\")\n    privileges_value = values.get(\"privileges\")\n\n    if not (roles_value and privileges_value):\n        raise ValueError(\"You have to specify roles AND privileges when using 'GrantPrivilegesOnObject'.\")\n\n    # coerce values to be lists\n    values[\"roles\"] = [roles_value] if isinstance(roles_value, str) else roles_value\n    values[\"role\"] = values[\"roles\"][0]  # hack to keep the validator happy\n    values[\"privileges\"] = [privileges_value] if isinstance(privileges_value, str) else privileges_value\n\n    return values\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.validate_object_and_object_type","title":"validate_object_and_object_type","text":"<pre><code>validate_object_and_object_type()\n</code></pre> <p>Validate that the object and type are set.</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_object_and_object_type(self):\n    \"\"\"Validate that the object and type are set.\"\"\"\n    object_value = self.object\n    if not object_value:\n        raise ValueError(\"You must provide an `object`, this should be the name of the object. \")\n\n    object_type = self.type\n    if not object_type:\n        raise ValueError(\n            \"You must provide a `type`, e.g. TABLE, VIEW, DATABASE. \"\n            \"See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html\"\n        )\n\n    return self\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable","title":"koheesio.spark.snowflake.GrantPrivilegesOnTable","text":"<p>Grant Snowflake privileges to a set of roles on a table</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.object","title":"object  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>object: str = Field(default=..., alias='table', description='The name of the Table to grant Privileges on. This should be just the name of the table; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: str = 'TABLE'\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView","title":"koheesio.spark.snowflake.GrantPrivilegesOnView","text":"<p>Grant Snowflake privileges to a set of roles on a view</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.object","title":"object  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>object: str = Field(default=..., alias='view', description='The name of the View to grant Privileges on. This should be just the name of the view; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: str = 'VIEW'\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query","title":"koheesio.spark.snowflake.Query","text":"<p>Query data from Snowflake and return the result as a DataFrame</p> Example <pre><code>Query(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"SELECT * FROM MY_TABLE\",\n).execute().df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='The query to run')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>add query to options</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    \"\"\"add query to options\"\"\"\n    options = super().get_options()\n    options[\"query\"] = self.query\n    return options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.validate_query","title":"validate_query","text":"<pre><code>validate_query(query)\n</code></pre> <p>Replace escape characters</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    query = query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n    return query\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery","title":"koheesio.spark.snowflake.RunQuery","text":"<p>Run a query on Snowflake that does not return a result, e.g. create table statement</p> <p>This is a wrapper around 'net.snowflake.spark.snowflake.Utils.runQuery' on the JVM</p> Example <pre><code>RunQuery(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"account\",\n    password=\"***\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"CREATE TABLE test (col1 string)\",\n).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='The query to run', alias='sql')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self) -&gt; None:\n    if not self.query:\n        self.log.warning(\"Empty string given as query input, skipping execution\")\n        return\n    # noinspection PyProtectedMember\n    self.spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(self.get_options(), self.query)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    # Executing the RunQuery without `host` option in Databricks throws:\n    # An error occurred while calling z:net.snowflake.spark.snowflake.Utils.runQuery.\n    # : java.util.NoSuchElementException: key not found: host\n    options = super().get_options()\n    options[\"host\"] = options[\"sfURL\"]\n    return options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.validate_query","title":"validate_query","text":"<pre><code>validate_query(query)\n</code></pre> <p>Replace escape characters</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    return query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel","title":"koheesio.spark.snowflake.SnowflakeBaseModel","text":"<p>BaseModel for setting up Snowflake Driver options.</p> Notes <ul> <li>Snowflake is supported natively in Databricks 4.2 and newer:     https://docs.snowflake.com/en/user-guide/spark-connector-databricks</li> <li>Refer to Snowflake docs for the installation instructions for non-Databricks environments:     https://docs.snowflake.com/en/user-guide/spark-connector-install</li> <li>Refer to Snowflake docs for connection options:     https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector</li> </ul> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for <code>sfURL</code>. required <code>user</code> <code>str</code> <p>Login name for the Snowflake user. Alias for <code>sfUser</code>.</p> required <code>password</code> <code>SecretStr</code> <p>Password for the Snowflake user. Alias for <code>sfPassword</code>.</p> required <code>database</code> <code>str</code> <p>The database to use for the session after connecting. Alias for <code>sfDatabase</code>.</p> required <code>sfSchema</code> <code>str</code> <p>The schema to use for the session after connecting. Alias for <code>schema</code> (\"schema\" is a reserved name in Pydantic, so we use <code>sfSchema</code> as main name instead).</p> required <code>role</code> <code>str</code> <p>The default security role to use for the session after connecting. Alias for <code>sfRole</code>.</p> required <code>warehouse</code> <code>str</code> <p>The default virtual warehouse to use for the session after connecting. Alias for <code>sfWarehouse</code>.</p> required <code>authenticator</code> <code>Optional[str]</code> <p>Authenticator for the Snowflake user. Example: \"okta.com\".</p> <code>None</code> <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the Snowflake connector.</p> <code>{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}</code> <code>format</code> <code>str</code> <p>The default <code>snowflake</code> format can be used natively in Databricks, use <code>net.snowflake.spark.snowflake</code> in other environments and make sure to install required JARs.</p> <code>\"snowflake\"</code>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.authenticator","title":"authenticator  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>authenticator: Optional[str] = Field(default=None, description='Authenticator for the Snowflake user', examples=['okta.com'])\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.database","title":"database  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>database: str = Field(default=..., alias='sfDatabase', description='The database to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='snowflake', description='The default `snowflake` format can be used natively in Databricks, use `net.snowflake.spark.snowflake` in other environments and make sure to install required JARs.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field(default={'sfCompress': 'on', 'continue_on_error': 'off'}, description='Extra options to pass to the Snowflake connector')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.password","title":"password  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>password: SecretStr = Field(default=..., alias='sfPassword', description='Password for the Snowflake user')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.role","title":"role  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>role: str = Field(default=..., alias='sfRole', description='The default security role to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.sfSchema","title":"sfSchema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sfSchema: str = Field(default=..., alias='schema', description='The schema to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., alias='sfURL', description='Hostname for the Snowflake account, e.g. &lt;account&gt;.snowflakecomputing.com', examples=['example.snowflakecomputing.com'])\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.user","title":"user  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>user: str = Field(default=..., alias='sfUser', description='Login name for the Snowflake user')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.warehouse","title":"warehouse  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>warehouse: str = Field(default=..., alias='sfWarehouse', description='The default virtual warehouse to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Get the sfOptions as a dictionary.</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    \"\"\"Get the sfOptions as a dictionary.\"\"\"\n    return {\n        key: value\n        for key, value in {\n            \"sfURL\": self.url,\n            \"sfUser\": self.user,\n            \"sfPassword\": self.password.get_secret_value(),\n            \"authenticator\": self.authenticator,\n            \"sfDatabase\": self.database,\n            \"sfSchema\": self.sfSchema,\n            \"sfRole\": self.role,\n            \"sfWarehouse\": self.warehouse,\n            **self.options,\n        }.items()\n        if value is not None\n    }\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader","title":"koheesio.spark.snowflake.SnowflakeReader","text":"<p>Wrapper around JdbcReader for Snowflake.</p> Example <pre><code>sr = SnowflakeReader(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n)\ndf = sr.read()\n</code></pre> Notes <ul> <li>Snowflake is supported natively in Databricks 4.2 and newer:     https://docs.snowflake.com/en/user-guide/spark-connector-databricks</li> <li>Refer to Snowflake docs for the installation instructions for non-Databricks environments:     https://docs.snowflake.com/en/user-guide/spark-connector-install</li> <li>Refer to Snowflake docs for connection options:     https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector</li> </ul>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: Optional[str] = None\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeStep","title":"koheesio.spark.snowflake.SnowflakeStep","text":"<p>Expands the SnowflakeBaseModel so that it can be used as a Step</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep","title":"koheesio.spark.snowflake.SnowflakeTableStep","text":"<p>Expands the SnowflakeStep, adding a 'table' parameter</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The name of the table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    options = super().get_options()\n    options[\"table\"] = self.table\n    return options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTransformation","title":"koheesio.spark.snowflake.SnowflakeTransformation","text":"<p>Adds Snowflake parameters to the Transformation class</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter","title":"koheesio.spark.snowflake.SnowflakeWriter","text":"<p>Class for writing to Snowflake</p> See Also <ul> <li>koheesio.steps.writers.Writer</li> <li>koheesio.steps.writers.BatchOutputMode</li> </ul>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.insert_type","title":"insert_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>insert_type: Optional[BatchOutputMode] = Field(APPEND, alias='mode', description='The insertion type, append or overwrite')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='Target table name')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Write to Snowflake</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    \"\"\"Write to Snowflake\"\"\"\n    self.log.debug(f\"writing to {self.table} with mode {self.insert_type}\")\n    self.df.write.format(self.format).options(**self.get_options()).option(\"dbtable\", self.table).mode(\n        self.insert_type\n    ).save()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema","title":"koheesio.spark.snowflake.SyncTableAndDataFrameSchema","text":"<p>Sync the schema's of a Snowflake table and a DataFrame. This will add NULL columns for the columns that are not in both and perform type casts where needed.</p> <p>The Snowflake table will take priority in case of type conflicts.</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: DataFrame = Field(default=..., description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.dry_run","title":"dry_run  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>dry_run: Optional[bool] = Field(default=False, description='Only show schema differences, do not apply changes')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The table name')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output","title":"Output","text":"<p>Output class for SyncTableAndDataFrameSchema</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_df_schema","title":"new_df_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>new_df_schema: StructType = Field(default=..., description='New DataFrame schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_sf_schema","title":"new_sf_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>new_sf_schema: StructType = Field(default=..., description='New Snowflake schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_df_schema","title":"original_df_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>original_df_schema: StructType = Field(default=..., description='Original DataFrame schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_sf_schema","title":"original_sf_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>original_sf_schema: StructType = Field(default=..., description='Original Snowflake schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.sf_table_altered","title":"sf_table_altered  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sf_table_altered: bool = Field(default=False, description='Flag to indicate whether Snowflake schema has been altered')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    self.log.warning(\"Snowflake table will always take a priority in case of data type conflicts!\")\n\n    # spark side\n    df_schema = self.df.schema\n    self.output.original_df_schema = deepcopy(df_schema)  # using deepcopy to avoid storing in place changes\n    df_cols = [c.name.lower() for c in df_schema]\n\n    # snowflake side\n    sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n    self.output.original_sf_schema = sf_schema\n    sf_cols = [c.name.lower() for c in sf_schema]\n\n    if self.dry_run:\n        # Display differences between Spark DataFrame and Snowflake schemas\n        # and provide dummy values that are expected as class outputs.\n        self.log.warning(f\"Columns to be added to Snowflake table: {set(df_cols) - set(sf_cols)}\")\n        self.log.warning(f\"Columns to be added to Spark DataFrame: {set(sf_cols) - set(df_cols)}\")\n\n        self.output.new_df_schema = t.StructType()\n        self.output.new_sf_schema = t.StructType()\n        self.output.df = self.df\n        self.output.sf_table_altered = False\n\n    else:\n        # Add columns to SnowFlake table that exist in DataFrame\n        for df_column in df_schema:\n            if df_column.name.lower() not in sf_cols:\n                AddColumn(\n                    **self.get_options(),\n                    table=self.table,\n                    column=df_column.name,\n                    type=df_column.dataType,\n                ).execute()\n                self.output.sf_table_altered = True\n\n        if self.output.sf_table_altered:\n            sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n            sf_cols = [c.name.lower() for c in sf_schema]\n\n        self.output.new_sf_schema = sf_schema\n\n        # Add NULL columns to the DataFrame if they exist in SnowFlake but not in the df\n        df = self.df\n        for sf_col in self.output.original_sf_schema:\n            sf_col_name = sf_col.name.lower()\n            if sf_col_name not in df_cols:\n                sf_col_type = sf_col.dataType\n                df = df.withColumn(sf_col_name, f.lit(None).cast(sf_col_type))\n\n        # Put DataFrame columns in the same order as the Snowflake table\n        df = df.select(*sf_cols)\n\n        self.output.df = df\n        self.output.new_df_schema = df.schema\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","title":"koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","text":"<p>Synchronize a Delta table to a Snowflake table</p> <ul> <li>Overwrite - only in batch mode</li> <li>Append - supports batch and streaming mode</li> <li>Merge - only in streaming mode</li> </ul> Example <pre><code>SynchronizeDeltaToSnowflakeTask(\n    url=\"acme.snowflakecomputing.com\",\n    user=\"admin\",\n    role=\"ADMIN\",\n    warehouse=\"SF_WAREHOUSE\",\n    database=\"SF_DATABASE\",\n    schema=\"SF_SCHEMA\",\n    source_table=DeltaTableStep(...),\n    target_table=\"my_sf_table\",\n    key_columns=[\n        \"id\",\n    ],\n    streaming=False,\n).run()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.checkpoint_location","title":"checkpoint_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>checkpoint_location: Optional[str] = Field(default=None, description='Checkpoint location to use')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.enable_deletion","title":"enable_deletion  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_deletion: Optional[bool] = Field(default=False, description='In case of merge synchronisation_mode add deletion statement in merge query.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.key_columns","title":"key_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>key_columns: Optional[List[str]] = Field(default_factory=list, description='Key columns on which merge statements will be MERGE statement will be applied.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.non_key_columns","title":"non_key_columns  <code>property</code>","text":"<pre><code>non_key_columns: List[str]\n</code></pre> <p>Columns of source table that aren't part of the (composite) primary key</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.persist_staging","title":"persist_staging  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>persist_staging: Optional[bool] = Field(default=False, description='In case of debugging, set `persist_staging` to True to retain the staging table for inspection after synchronization.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.reader","title":"reader  <code>property</code>","text":"<pre><code>reader\n</code></pre> <p>DeltaTable reader</p> Returns: <pre><code>DeltaTableReader the will yield source delta table\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.schema_tracking_location","title":"schema_tracking_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_tracking_location: Optional[str] = Field(default=None, description='Schema tracking location to use. Info: https://docs.delta.io/latest/delta-streaming.html#-schema-tracking')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.source_table","title":"source_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source_table: DeltaTableStep = Field(default=..., description='Source delta table to synchronize')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table","title":"staging_table  <code>property</code>","text":"<pre><code>staging_table\n</code></pre> <p>Intermediate table on snowflake where staging results are stored</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table_name","title":"staging_table_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>staging_table_name: Optional[str] = Field(default=None, alias='staging_table', description='Optional snowflake staging name', validate_default=False)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: Optional[bool] = Field(default=False, description=\"Should synchronisation happen in streaming or in batch mode. Streaming is supported in 'APPEND' and 'MERGE' mode. Batch is supported in 'OVERWRITE' and 'APPEND' mode.\")\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.synchronisation_mode","title":"synchronisation_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>synchronisation_mode: BatchOutputMode = Field(default=MERGE, description=\"Determines if synchronisation will 'overwrite' any existing table, 'append' new rows or 'merge' with existing rows.\")\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.target_table","title":"target_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_table: str = Field(default=..., description='Target table in snowflake to synchronize to')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[ForEachBatchStreamWriter, SnowflakeWriter]\n</code></pre> <p>Writer to persist to snowflake</p> <p>Depending on configured options, this returns an SnowflakeWriter or ForEachBatchStreamWriter: - OVERWRITE/APPEND mode yields SnowflakeWriter - MERGE mode yields ForEachBatchStreamWriter</p> <p>Returns:</p> Type Description <code>    Union[ForEachBatchStreamWriter, SnowflakeWriter]</code>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer_","title":"writer_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>writer_: Optional[Union[ForEachBatchStreamWriter, SnowflakeWriter]] = None\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.drop_table","title":"drop_table","text":"<pre><code>drop_table(snowflake_table)\n</code></pre> <p>Drop a given snowflake table</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def drop_table(self, snowflake_table):\n    \"\"\"Drop a given snowflake table\"\"\"\n    self.log.warning(f\"Dropping table {snowflake_table} from snowflake\")\n    drop_table_query = f\"\"\"DROP TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(**self.get_options(), query=drop_table_query)\n    query_executor.execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self) -&gt; None:\n    # extract\n    df = self.extract()\n    self.output.source_df = df\n\n    # synchronize\n    self.output.target_df = df\n    self.load(df)\n    if not self.persist_staging:\n        # If it's a streaming job, await for termination before dropping staging table\n        if self.streaming:\n            self.writer.await_termination()\n        self.drop_table(self.staging_table)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.extract","title":"extract","text":"<pre><code>extract() -&gt; DataFrame\n</code></pre> <p>Extract source table</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def extract(self) -&gt; DataFrame:\n    \"\"\"\n    Extract source table\n    \"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        if not self.source_table.is_cdf_active:\n            raise RuntimeError(\n                f\"Source table {self.source_table.table_name} does not have CDF enabled. \"\n                f\"Set TBLPROPERTIES ('delta.enableChangeDataFeed' = true) to enable. \"\n                f\"Current properties = {self.source_table_properties}\"\n            )\n\n    df = self.reader.read()\n    self.output.source_df = df\n    return df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.load","title":"load","text":"<pre><code>load(df) -&gt; DataFrame\n</code></pre> <p>Load source table into snowflake</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def load(self, df) -&gt; DataFrame:\n    \"\"\"Load source table into snowflake\"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        self.log.info(f\"Truncating staging table {self.staging_table}\")\n        self.truncate_table(self.staging_table)\n    self.writer.write(df)\n    self.output.target_df = df\n    return df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>alias of execute</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.truncate_table","title":"truncate_table","text":"<pre><code>truncate_table(snowflake_table)\n</code></pre> <p>Truncate a given snowflake table</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def truncate_table(self, snowflake_table):\n    \"\"\"Truncate a given snowflake table\"\"\"\n    truncate_query = f\"\"\"TRUNCATE TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(\n        **self.get_options(),\n        query=truncate_query,\n    )\n    query_executor.execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists","title":"koheesio.spark.snowflake.TableExists","text":"<p>Check if the table exists in Snowflake by using INFORMATION_SCHEMA.</p> Example <pre><code>k = TableExists(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n    table=\"table\",\n)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output","title":"Output","text":"<p>Output class for TableExists</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output.exists","title":"exists  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>exists: bool = Field(default=..., description='Whether or not the table exists')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    query = (\n        dedent(\n            # Force upper case, due to case-sensitivity of where clause\n            f\"\"\"\n        SELECT *\n        FROM INFORMATION_SCHEMA.TABLES\n        WHERE TABLE_CATALOG     = '{self.database}'\n          AND TABLE_SCHEMA      = '{self.sfSchema}'\n          AND TABLE_TYPE        = 'BASE TABLE'\n          AND upper(TABLE_NAME) = '{self.table.upper()}'\n        \"\"\"  # nosec B608: hardcoded_sql_expressions\n        )\n        .upper()\n        .strip()\n    )\n\n    self.log.debug(f\"Query that was executed to check if the table exists:\\n{query}\")\n\n    df = Query(**self.get_options(), query=query).read()\n\n    exists = df.count() &gt; 0\n    self.log.info(f\"Table {self.table} {'exists' if exists else 'does not exist'}\")\n    self.output.exists = exists\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery","title":"koheesio.spark.snowflake.TagSnowflakeQuery","text":"<p>Provides Snowflake query tag pre-action that can be used to easily find queries through SF history search and further group them for debugging and cost tracking purposes.</p> <p>Takes in query tag attributes as kwargs and additional Snowflake options dict that can optionally contain other set of pre-actions to be applied to a query, in that case existing pre-action aren't dropped, query tag pre-action will be added to them.</p> <p>Passed Snowflake options dictionary is not modified in-place, instead anew dictionary containing updated pre-actions is returned.</p> Notes <p>See this article for explanation: https://select.dev/posts/snowflake-query-tags</p> <p>Arbitrary tags can be applied, such as team, dataset names, business capability, etc.</p> Example <pre><code>query_tag = AddQueryTag(\n    options={\"preactions\": ...},\n    task_name=\"cleanse_task\",\n    pipeline_name=\"ingestion-pipeline\",\n    etl_date=\"2022-01-01\",\n    pipeline_execution_time=\"2022-01-01T00:00:00\",\n    task_execution_time=\"2022-01-01T01:00:00\",\n    environment=\"dev\",\n    trace_id=\"e0fdec43-a045-46e5-9705-acd4f3f96045\",\n    span_id=\"cb89abea-1c12-471f-8b12-546d2d66f6cb\",\n    ),\n).execute().options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Dict = Field(default_factory=dict, description='Additional Snowflake options, optionally containing additional preactions')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output","title":"Output","text":"<p>Output class for AddQueryTag</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Dict = Field(default=..., description='Copy of provided SF options, with added query tag preaction')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Add query tag preaction to Snowflake options</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    \"\"\"Add query tag preaction to Snowflake options\"\"\"\n    tag_json = json.dumps(self.extra_params, indent=4, sort_keys=True)\n    tag_preaction = f\"ALTER SESSION SET QUERY_TAG = '{tag_json}';\"\n    preactions = self.options.get(\"preactions\", \"\")\n    preactions = f\"{preactions}\\n{tag_preaction}\".strip()\n    updated_options = dict(self.options)\n    updated_options[\"preactions\"] = preactions\n    self.output.options = updated_options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.map_spark_type","title":"koheesio.spark.snowflake.map_spark_type","text":"<pre><code>map_spark_type(spark_type: DataType)\n</code></pre> <p>Translates Spark DataFrame Schema type to SnowFlake type</p> Basic Types Snowflake Type StringType STRING NullType STRING BooleanType BOOLEAN Numeric Types Snowflake Type LongType BIGINT IntegerType INT ShortType SMALLINT DoubleType DOUBLE FloatType FLOAT NumericType FLOAT ByteType BINARY Date / Time Types Snowflake Type DateType DATE TimestampType TIMESTAMP Advanced Types Snowflake Type DecimalType DECIMAL MapType VARIANT ArrayType VARIANT StructType VARIANT References <ul> <li>Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html</li> <li>Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html</li> </ul> <p>Parameters:</p> Name Type Description Default <code>spark_type</code> <code>DataType</code> <p>DataType taken out of the StructField</p> required <p>Returns:</p> Type Description <code>str</code> <p>The Snowflake data type</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def map_spark_type(spark_type: t.DataType):\n    \"\"\"\n    Translates Spark DataFrame Schema type to SnowFlake type\n\n    | Basic Types       | Snowflake Type |\n    |-------------------|----------------|\n    | StringType        | STRING         |\n    | NullType          | STRING         |\n    | BooleanType       | BOOLEAN        |\n\n    | Numeric Types     | Snowflake Type |\n    |-------------------|----------------|\n    | LongType          | BIGINT         |\n    | IntegerType       | INT            |\n    | ShortType         | SMALLINT       |\n    | DoubleType        | DOUBLE         |\n    | FloatType         | FLOAT          |\n    | NumericType       | FLOAT          |\n    | ByteType          | BINARY         |\n\n    | Date / Time Types | Snowflake Type |\n    |-------------------|----------------|\n    | DateType          | DATE           |\n    | TimestampType     | TIMESTAMP      |\n\n    | Advanced Types    | Snowflake Type |\n    |-------------------|----------------|\n    | DecimalType       | DECIMAL        |\n    | MapType           | VARIANT        |\n    | ArrayType         | VARIANT        |\n    | StructType        | VARIANT        |\n\n    References\n    ----------\n    - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html\n    - Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html\n\n    Parameters\n    ----------\n    spark_type : pyspark.sql.types.DataType\n        DataType taken out of the StructField\n\n    Returns\n    -------\n    str\n        The Snowflake data type\n    \"\"\"\n    # StructField means that the entire Field was passed, we need to extract just the dataType before continuing\n    if isinstance(spark_type, t.StructField):\n        spark_type = spark_type.dataType\n\n    # Check if the type is DayTimeIntervalType\n    if isinstance(spark_type, t.DayTimeIntervalType):\n        warn(\n            \"DayTimeIntervalType is being converted to STRING. \"\n            \"Consider converting to a more supported date/time/timestamp type in Snowflake.\"\n        )\n\n    # fmt: off\n    # noinspection PyUnresolvedReferences\n    data_type_map = {\n        # Basic Types\n        t.StringType: \"STRING\",\n        t.NullType: \"STRING\",\n        t.BooleanType: \"BOOLEAN\",\n\n        # Numeric Types\n        t.LongType: \"BIGINT\",\n        t.IntegerType: \"INT\",\n        t.ShortType: \"SMALLINT\",\n        t.DoubleType: \"DOUBLE\",\n        t.FloatType: \"FLOAT\",\n        t.NumericType: \"FLOAT\",\n        t.ByteType: \"BINARY\",\n        t.BinaryType: \"VARBINARY\",\n\n        # Date / Time Types\n        t.DateType: \"DATE\",\n        t.TimestampType: \"TIMESTAMP\",\n        t.DayTimeIntervalType: \"STRING\",\n\n        # Advanced Types\n        t.DecimalType:\n            f\"DECIMAL({spark_type.precision},{spark_type.scale})\"  # pylint: disable=no-member\n            if isinstance(spark_type, t.DecimalType) else \"DECIMAL(38,0)\",\n        t.MapType: \"VARIANT\",\n        t.ArrayType: \"VARIANT\",\n        t.StructType: \"VARIANT\",\n    }\n    return data_type_map.get(type(spark_type), 'STRING')\n</code></pre>"},{"location":"api_reference/spark/utils.html","title":"Utils","text":"<p>Spark Utility functions</p>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_minor_version","title":"koheesio.spark.utils.spark_minor_version  <code>module-attribute</code>","text":"<pre><code>spark_minor_version: float = get_spark_minor_version()\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype","title":"koheesio.spark.utils.SparkDatatype","text":"<p>Allowed spark datatypes</p> <p>The following table lists the data types that are supported by Spark SQL.</p> Data type SQL name ByteType BYTE, TINYINT ShortType SHORT, SMALLINT IntegerType INT, INTEGER LongType LONG, BIGINT FloatType FLOAT, REAL DoubleType DOUBLE DecimalType DECIMAL, DEC, NUMERIC StringType STRING BinaryType BINARY BooleanType BOOLEAN TimestampType TIMESTAMP, TIMESTAMP_LTZ DateType DATE ArrayType ARRAY MapType MAP NullType VOID Not supported yet <ul> <li>TimestampNTZType     TIMESTAMP_NTZ</li> <li>YearMonthIntervalType     INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH</li> <li>DayTimeIntervalType     INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR,     INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND</li> </ul> See Also <p>https://spark.apache.org/docs/latest/sql-ref-datatypes.html#supported-data-types</p>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.ARRAY","title":"ARRAY  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ARRAY = 'array'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BIGINT","title":"BIGINT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BIGINT = 'long'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BINARY","title":"BINARY  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BINARY = 'binary'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BOOLEAN","title":"BOOLEAN  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BOOLEAN = 'boolean'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BYTE","title":"BYTE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BYTE = 'byte'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DATE","title":"DATE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DATE = 'date'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DEC","title":"DEC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DEC = 'decimal'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DECIMAL","title":"DECIMAL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DECIMAL = 'decimal'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DOUBLE","title":"DOUBLE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DOUBLE = 'double'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.FLOAT","title":"FLOAT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FLOAT = 'float'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INT","title":"INT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INT = 'integer'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INTEGER","title":"INTEGER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INTEGER = 'integer'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.LONG","title":"LONG  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LONG = 'long'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.MAP","title":"MAP  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MAP = 'map'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.NUMERIC","title":"NUMERIC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>NUMERIC = 'decimal'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.REAL","title":"REAL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>REAL = 'float'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SHORT","title":"SHORT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SHORT = 'short'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SMALLINT","title":"SMALLINT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SMALLINT = 'short'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.STRING","title":"STRING  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>STRING = 'string'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP","title":"TIMESTAMP  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TIMESTAMP = 'timestamp'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP_LTZ","title":"TIMESTAMP_LTZ  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TIMESTAMP_LTZ = 'timestamp'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TINYINT","title":"TINYINT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TINYINT = 'byte'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.VOID","title":"VOID  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>VOID = 'void'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.spark_type","title":"spark_type  <code>property</code>","text":"<pre><code>spark_type: DataType\n</code></pre> <p>Returns the spark type for the given enum value</p>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(value: str) -&gt; SparkDatatype\n</code></pre> <p>Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>@classmethod\ndef from_string(cls, value: str) -&gt; \"SparkDatatype\":\n    \"\"\"Allows for getting the right Enum value by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.get_spark_minor_version","title":"koheesio.spark.utils.get_spark_minor_version","text":"<pre><code>get_spark_minor_version() -&gt; float\n</code></pre> <p>Returns the minor version of the spark instance.</p> <p>For example, if the spark version is 3.3.2, this function would return 3.3</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def get_spark_minor_version() -&gt; float:\n    \"\"\"Returns the minor version of the spark instance.\n\n    For example, if the spark version is 3.3.2, this function would return 3.3\n    \"\"\"\n    return float(\".\".join(spark_version.split(\".\")[:2]))\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.on_databricks","title":"koheesio.spark.utils.on_databricks","text":"<pre><code>on_databricks() -&gt; bool\n</code></pre> <p>Retrieve if we're running on databricks or elsewhere</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def on_databricks() -&gt; bool:\n    \"\"\"Retrieve if we're running on databricks or elsewhere\"\"\"\n    dbr_version = os.getenv(\"DATABRICKS_RUNTIME_VERSION\", None)\n    return dbr_version is not None and dbr_version != \"\"\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.schema_struct_to_schema_str","title":"koheesio.spark.utils.schema_struct_to_schema_str","text":"<pre><code>schema_struct_to_schema_str(schema: StructType) -&gt; str\n</code></pre> <p>Converts a StructType to a schema str</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def schema_struct_to_schema_str(schema: StructType) -&gt; str:\n    \"\"\"Converts a StructType to a schema str\"\"\"\n    if not schema:\n        return \"\"\n    return \",\\n\".join([f\"{field.name} {field.dataType.typeName().upper()}\" for field in schema.fields])\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_array","title":"koheesio.spark.utils.spark_data_type_is_array","text":"<pre><code>spark_data_type_is_array(data_type: DataType) -&gt; bool\n</code></pre> <p>Check if the column's dataType is of type ArrayType</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def spark_data_type_is_array(data_type: DataType) -&gt; bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, ArrayType)\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_numeric","title":"koheesio.spark.utils.spark_data_type_is_numeric","text":"<pre><code>spark_data_type_is_numeric(data_type: DataType) -&gt; bool\n</code></pre> <p>Check if the column's dataType is of type ArrayType</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def spark_data_type_is_numeric(data_type: DataType) -&gt; bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, (IntegerType, LongType, FloatType, DoubleType, DecimalType))\n</code></pre>"},{"location":"api_reference/spark/readers/index.html","title":"Readers","text":"<p>Readers are a type of Step that read data from a source based on the input parameters and stores the result in self.output.df.</p> <p>For a comprehensive guide on the usage, examples, and additional features of Reader classes, please refer to the reference/concepts/steps/readers section of the Koheesio documentation.</p>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader","title":"koheesio.spark.readers.Reader","text":"<p>Base class for all Readers</p> <p>A Reader is a Step that reads data from a source based on the input parameters and stores the result in self.output.df (DataFrame).</p> <p>When implementing a Reader, the execute() method should be implemented. The execute() method should read from the source and store the result in self.output.df.</p> <p>The Reader class implements a standard read() method that calls the execute() method and returns the result. This method can be used to read data from a Reader without having to call the execute() method directly. Read method does not need to be implemented in the child class.</p> <p>Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession.</p> <p>The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the output.df is None, .execute() will be run first.</p>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.df","title":"df  <code>property</code>","text":"<pre><code>df: Optional[DataFrame]\n</code></pre> <p>Shorthand for accessing self.output.df If the output.df is None, .execute() will be run first</p>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Execute on a Reader should handle self.output.df (output) as a minimum Read from whichever source -&gt; store result in self.output.df</p> Source code in <code>src/koheesio/spark/readers/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Reader should handle self.output.df (output) as a minimum\n    Read from whichever source -&gt; store result in self.output.df\n    \"\"\"\n    # self.output.df  # output dataframe\n    ...\n</code></pre>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.read","title":"read","text":"<pre><code>read() -&gt; Optional[DataFrame]\n</code></pre> <p>Read from a Reader without having to call the execute() method directly</p> Source code in <code>src/koheesio/spark/readers/__init__.py</code> <pre><code>def read(self) -&gt; Optional[DataFrame]:\n    \"\"\"Read from a Reader without having to call the execute() method directly\"\"\"\n    self.execute()\n    return self.output.df\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html","title":"Delta","text":"<p>Read data from a Delta table and return a DataFrame or DataStream</p> <p>Classes:</p> Name Description <code>DeltaTableReader</code> <p>Reads data from a Delta table and returns a DataFrame</p> <code>DeltaTableStreamReader</code> <p>Reads data from a Delta table and returns a DataStream</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS","title":"koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS  <code>module-attribute</code>","text":"<pre><code>STREAMING_ONLY_OPTIONS = ['ignore_deletes', 'ignore_changes', 'starting_version', 'starting_timestamp', 'schema_tracking_location']\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING","title":"koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING  <code>module-attribute</code>","text":"<pre><code>STREAMING_SCHEMA_WARNING = '\\nImportant!\\nAlthough you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the data with an incorrect schema.'\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader","title":"koheesio.spark.readers.delta.DeltaTableReader","text":"<p>Reads data from a Delta table and returns a DataFrame Delta Table can be read in batch or streaming mode It also supports reading change data feed (CDF) in both batch mode and streaming mode</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Union[DeltaTableStep, str]</code> <p>The table to read</p> required <code>filter_cond</code> <code>Optional[Union[Column, str]]</code> <p>Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions. For example: <code>f.col('state') == 'Ohio'</code>, <code>state = 'Ohio'</code> or  <code>(col('col1') &gt; 3) &amp; (col('col2') &lt; 9)</code></p> required <code>columns</code> <p>Columns to select from the table. One or many columns can be provided as strings. For example: <code>['col1', 'col2']</code>, <code>['col1']</code> or <code>'col1'</code></p> required <code>streaming</code> <code>Optional[bool]</code> <p>Whether to read the table as a Stream or not</p> required <code>read_change_feed</code> <code>bool</code> <p>readChangeFeed: Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html</p> required <code>starting_version</code> <code>str</code> <p>startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.</p> required <code>starting_timestamp</code> <code>str</code> <p>startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)</p> required <code>ignore_deletes</code> <code>bool</code> <p>ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes</p> required <code>ignore_changes</code> <code>bool</code> <p>ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.</p> required"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[ListOfColumns] = Field(default=None, description=\"Columns to select from the table. One or many columns can be provided as strings. For example: `['col1', 'col2']`, `['col1']` or `'col1'` \")\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.filter_cond","title":"filter_cond  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>filter_cond: Optional[Union[Column, str]] = Field(default=None, alias='filterCondition', description=\"Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions For example: `f.col('state') == 'Ohio'`, `state = 'Ohio'` or  `(col('col1') &gt; 3) &amp; (col('col2') &lt; 9)`\")\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_changes","title":"ignore_changes  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ignore_changes: bool = Field(default=False, alias='ignoreChanges', description='ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_deletes","title":"ignore_deletes  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ignore_deletes: bool = Field(default=False, alias='ignoreDeletes', description='ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.read_change_feed","title":"read_change_feed  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>read_change_feed: bool = Field(default=False, alias='readChangeFeed', description=\"Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html\")\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.reader","title":"reader  <code>property</code>","text":"<pre><code>reader: Union[DataStreamReader, DataFrameReader]\n</code></pre> <p>Return the reader for the DeltaTableReader based on the <code>streaming</code> attribute</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.schema_tracking_location","title":"schema_tracking_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_tracking_location: Optional[str] = Field(default=None, alias='schemaTrackingLocation', description='schemaTrackingLocation: Track the location of source schema. Note: Recommend to enable Delta reader version: 3 and writer version: 7 for this option. For more info see https://docs.delta.io/latest/delta-column-mapping.html' + STREAMING_SCHEMA_WARNING)\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.skip_change_commits","title":"skip_change_commits  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>skip_change_commits: bool = Field(default=False, alias='skipChangeCommits', description='skipChangeCommits: Skip processing of change commits. Note: Only supported for streaming tables. (not supported in Open Source Delta Implementation). Prefer using skipChangeCommits over ignoreDeletes and ignoreChanges starting DBR12.1 and above. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#skip-change-commits')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_timestamp","title":"starting_timestamp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>starting_timestamp: Optional[str] = Field(default=None, alias='startingTimestamp', description='startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)' + STREAMING_SCHEMA_WARNING)\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_version","title":"starting_version  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>starting_version: Optional[str] = Field(default=None, alias='startingVersion', description='startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.' + STREAMING_SCHEMA_WARNING)\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: Optional[bool] = Field(default=False, description='Whether to read the table as a Stream or not')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: Union[DeltaTableStep, str] = Field(default=..., description='The table to read')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.temp_view_name","title":"temp_view_name  <code>property</code>","text":"<pre><code>temp_view_name\n</code></pre> <p>Get the temporary view name for the dataframe for SQL queries</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.view","title":"view  <code>property</code>","text":"<pre><code>view\n</code></pre> <p>Create a temporary view of the dataframe for SQL queries</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/delta.py</code> <pre><code>def execute(self):\n    df = self.reader.table(self.table.table_name)\n    if self.filter_cond is not None:\n        df = df.filter(f.expr(self.filter_cond) if isinstance(self.filter_cond, str) else self.filter_cond)\n    if self.columns is not None:\n        df = df.select(*self.columns)\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.get_options","title":"get_options","text":"<pre><code>get_options() -&gt; Dict[str, Any]\n</code></pre> <p>Get the options for the DeltaTableReader based on the <code>streaming</code> attribute</p> Source code in <code>src/koheesio/spark/readers/delta.py</code> <pre><code>def get_options(self) -&gt; Dict[str, Any]:\n    \"\"\"Get the options for the DeltaTableReader based on the `streaming` attribute\"\"\"\n    options = {\n        # Enable Change Data Feed (CDF) feature\n        \"readChangeFeed\": self.read_change_feed,\n        # Initial position, one of:\n        \"startingVersion\": self.starting_version,\n        \"startingTimestamp\": self.starting_timestamp,\n    }\n\n    # Streaming only options\n    if self.streaming:\n        options = {\n            **options,\n            # Ignore updates and deletes, one of:\n            \"ignoreDeletes\": self.ignore_deletes,\n            \"ignoreChanges\": self.ignore_changes,\n            \"skipChangeCommits\": self.skip_change_commits,\n            \"schemaTrackingLocation\": self.schema_tracking_location,\n        }\n    # Batch only options\n    else:\n        pass  # there are none... for now :)\n\n    def normalize(v: Union[str, bool]):\n        \"\"\"normalize values\"\"\"\n        # True becomes \"true\", False becomes \"false\"\n        v = str(v).lower() if isinstance(v, bool) else v\n        return v\n\n    # Any options with `value == None` are filtered out\n    return {k: normalize(v) for k, v in options.items() if v is not None}\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.set_temp_view_name","title":"set_temp_view_name","text":"<pre><code>set_temp_view_name()\n</code></pre> <p>Set a temporary view name for the dataframe for SQL queries</p> Source code in <code>src/koheesio/spark/readers/delta.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_temp_view_name(self):\n    \"\"\"Set a temporary view name for the dataframe for SQL queries\"\"\"\n    table_name = self.table.table\n    vw_name = get_random_string(prefix=f\"tmp_{table_name}\")\n    self.__temp_view_name__ = vw_name\n    return self\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader","title":"koheesio.spark.readers.delta.DeltaTableStreamReader","text":"<p>Reads data from a Delta table and returns a DataStream</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: bool = True\n</code></pre>"},{"location":"api_reference/spark/readers/dummy.html","title":"Dummy","text":"<p>A simple DummyReader that returns a DataFrame with an id-column of the given range</p>"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader","title":"koheesio.spark.readers.dummy.DummyReader","text":"<p>A simple DummyReader that returns a DataFrame with an id-column of the given range</p> <p>Can be used in place of any Reader without having to read from a real source.</p> <p>Wraps SparkSession.range(). Output DataFrame will have a single column named \"id\" of type Long and length of the given range.</p> <p>Parameters:</p> Name Type Description Default <code>range</code> <code>int</code> <p>How large to make the Dataframe</p> required Example <pre><code>from koheesio.spark.readers.dummy import DummyReader\n\noutput_df = DummyReader(range=100).read()\n</code></pre> <p>output_df: Output DataFrame will have a single column named \"id\" of type <code>Long</code> containing 100 rows (0-99).</p> id 0 1 ... 99"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.range","title":"range  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>range: int = Field(default=100, description='How large to make the Dataframe')\n</code></pre>"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/dummy.py</code> <pre><code>def execute(self):\n    self.output.df = self.spark.range(self.range)\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html","title":"File loader","text":"<p>Generic file Readers for different file formats.</p> <p>Supported file formats: - CSV - Parquet - Avro - JSON - ORC - Text</p> <p>Examples: <pre><code>from koheesio.spark.readers import (\n    CsvReader,\n    ParquetReader,\n    AvroReader,\n    JsonReader,\n    OrcReader,\n)\n\ncsv_reader = CsvReader(path=\"path/to/file.csv\", header=True)\nparquet_reader = ParquetReader(path=\"path/to/file.parquet\")\navro_reader = AvroReader(path=\"path/to/file.avro\")\njson_reader = JsonReader(path=\"path/to/file.json\")\norc_reader = OrcReader(path=\"path/to/file.orc\")\n</code></pre></p> <p>For more information about the available options, see Spark's official documentation.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader","title":"koheesio.spark.readers.file_loader.AvroReader","text":"<p>Reads an Avro file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.avro</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = AvroReader(path=\"path/to/file.avro\", mergeSchema=True)\n</code></pre></p> <p>Make sure to have the <code>spark-avro</code> package installed in your environment.</p> <p>For more information about the available options, see the official documentation.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = avro\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader","title":"koheesio.spark.readers.file_loader.CsvReader","text":"<p>Reads a CSV file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.csv</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = CsvReader(path=\"path/to/file.csv\", header=True)\n</code></pre></p> <p>For more information about the available options, see the official pyspark documentation and read about CSV data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = csv\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat","title":"koheesio.spark.readers.file_loader.FileFormat","text":"<p>Supported file formats.</p> <p>This enum represents the supported file formats that can be used with the FileLoader class. The available file formats are: - csv: Comma-separated values format - parquet: Apache Parquet format - avro: Apache Avro format - json: JavaScript Object Notation format - orc: Apache ORC format - text: Plain text format</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.avro","title":"avro  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>avro = 'avro'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.csv","title":"csv  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>csv = 'csv'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.json","title":"json  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>json = 'json'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.orc","title":"orc  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>orc = 'orc'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.parquet","title":"parquet  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>parquet = 'parquet'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.text","title":"text  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>text = 'text'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader","title":"koheesio.spark.readers.file_loader.FileLoader","text":"<p>Generic file reader.</p> <pre><code>Available file formats:\n- CSV\n- Parquet\n- Avro\n- JSON\n- ORC\n- Text (default)\n\nExtra parameters can be passed to the reader using the `extra_params` attribute or as keyword arguments.\n\nExample:\n```python\nreader = FileLoader(path=\"path/to/textfile.txt\", format=\"text\", header=True, lineSep=\"\n</code></pre> <p>\")     ```</p> <pre><code>For more information about the available options, see Spark's\n[official pyspark documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.text.html)\nand [read about text data source](https://spark.apache.org/docs/latest/sql-data-sources-text.html).\n\nAlso see the [data sources generic options](https://spark.apache.org/docs/3.5.0/sql-data-sources-generic-options.html).\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = Field(default=text, description='File format to read')\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Union[Path, str] = Field(default=..., description='Path to the file to read')\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.schema_","title":"schema_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_: Optional[Union[StructType, str]] = Field(default=None, description='Schema to use when reading the file', validate_default=False, alias='schema')\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.ensure_path_is_str","title":"ensure_path_is_str","text":"<pre><code>ensure_path_is_str(v)\n</code></pre> <p>Ensure that the path is a string as required by Spark.</p> Source code in <code>src/koheesio/spark/readers/file_loader.py</code> <pre><code>@field_validator(\"path\")\ndef ensure_path_is_str(cls, v):\n    \"\"\"Ensure that the path is a string as required by Spark.\"\"\"\n    if isinstance(v, Path):\n        return str(v.absolute().as_posix())\n    return v\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Reads the file using the specified format, schema, while applying any extra parameters.</p> Source code in <code>src/koheesio/spark/readers/file_loader.py</code> <pre><code>def execute(self):\n    \"\"\"Reads the file using the specified format, schema, while applying any extra parameters.\"\"\"\n    reader = self.spark.read.format(self.format)\n\n    if self.schema_:\n        reader.schema(self.schema_)\n\n    if self.extra_params:\n        reader = reader.options(**self.extra_params)\n\n    self.output.df = reader.load(self.path)\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader","title":"koheesio.spark.readers.file_loader.JsonReader","text":"<p>Reads a JSON file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.json</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = JsonReader(path=\"path/to/file.json\", allowComments=True)\n</code></pre></p> <p>For more information about the available options, see the official pyspark documentation and read about JSON data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = json\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader","title":"koheesio.spark.readers.file_loader.OrcReader","text":"<p>Reads an ORC file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.orc</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = OrcReader(path=\"path/to/file.orc\", mergeSchema=True)\n</code></pre></p> <p>For more information about the available options, see the official documentation and read about ORC data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = orc\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader","title":"koheesio.spark.readers.file_loader.ParquetReader","text":"<p>Reads a Parquet file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.parquet</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = ParquetReader(path=\"path/to/file.parquet\", mergeSchema=True)\n</code></pre></p> <p>For more information about the available options, see the official pyspark documentation and read about Parquet data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = parquet\n</code></pre>"},{"location":"api_reference/spark/readers/hana.html","title":"Hana","text":"<p>HANA reader.</p>"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader","title":"koheesio.spark.readers.hana.HanaReader","text":"<p>Wrapper around JdbcReader for SAP HANA</p> Notes <ul> <li>Refer to JdbcReader for the list of all available parameters.</li> <li>Refer to SAP HANA Client Interface Programming Reference docs for the list of all available connection string     parameters:     https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/109397c2206a4ab2a5386d494f4cf75e.html</li> </ul> Example <p>Note: jars should be added to the Spark session manually. This class does not take care of that.</p> <p>This example depends on the SAP HANA <code>ngdbc</code> JAR. e.g. ngdbc-2.5.49.</p> <pre><code>from koheesio.spark.readers.hana import HanaReader\njdbc_hana = HanaReader(\n    url=\"jdbc:sap://&lt;domain_or_ip&gt;:&lt;port&gt;/?&lt;options&gt;\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\"\n)\ndf = jdbc_hana.read()\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>JDBC connection string. Refer to SAP HANA docs for the list of all available connection string parameters. Example: jdbc:sap://:[/?] required <code>user</code> <code>str</code> required <code>password</code> <code>SecretStr</code> required <code>dbtable</code> <code>str</code> <p>Database table name, also include schema name</p> required <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the SAP HANA JDBC driver. Refer to SAP HANA docs for the list of all available connection string parameters. Example: {\"fetchsize\": 2000, \"numPartitions\": 10}</p> required <code>query</code> <code>Optional[str]</code> <p>Query</p> required <code>format</code> <code>str</code> <p>The type of format to load. Defaults to 'jdbc'. Should not be changed.</p> required <code>driver</code> <code>str</code> <p>Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.</p> required"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: str = Field(default='com.sap.db.jdbc.Driver', description='Make sure that the necessary JARs are available in the cluster: ngdbc-2-x.x.x.x')\n</code></pre>"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field(default={'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the SAP HANA JDBC driver')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html","title":"Jdbc","text":"<p>Module for reading data from JDBC sources.</p> <p>Classes:</p> Name Description <code>JdbcReader</code> <p>Reader for JDBC tables.</p>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader","title":"koheesio.spark.readers.jdbc.JdbcReader","text":"<p>Reader for JDBC tables.</p> <p>Wrapper around Spark's jdbc read format</p> Notes <ul> <li>Query has precedence over dbtable. If query and dbtable both are filled in, dbtable will be ignored!</li> <li>Extra options to the spark reader can be passed through the <code>options</code> input. Refer to Spark documentation     for details: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html</li> <li>Consider using <code>fetchsize</code> as one of the options, as it is greatly increases the performance of the reader</li> <li>Consider using <code>numPartitions</code>, <code>partitionColumn</code>, <code>lowerBound</code>, <code>upperBound</code> together with real or synthetic     partitioning column as it will improve the reader performance</li> </ul> <p>When implementing a JDBC reader, the <code>get_options()</code> method should be implemented. The method should return a dict of options required for the specific JDBC driver. The <code>get_options()</code> method can be overridden in the child class. Additionally, the <code>driver</code> parameter should be set to the name of the JDBC driver. Be aware that the driver jar needs to be included in the Spark session; this class does not (and can not) take care of that!</p> Example <p>Note: jars should be added to the Spark session manually. This class does not take care of that.</p> <p>This example depends on the jar for MS SQL:  <code>https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre8/mssql-jdbc-9.2.1.jre8.jar</code></p> <pre><code>from koheesio.spark.readers.jdbc import JdbcReader\n\njdbc_mssql = JdbcReader(\n    driver=\"com.microsoft.sqlserver.jdbc.SQLServerDriver\",\n    url=\"jdbc:sqlserver://10.xxx.xxx.xxx:1433;databaseName=YOUR_DATABASE\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n    options={\"fetchsize\": 100},\n)\ndf = jdbc_mssql.read()\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.dbtable","title":"dbtable  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>dbtable: Optional[str] = Field(default=None, description='Database table name, also include schema name')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: str = Field(default=..., description='Driver name. Be aware that the driver jar needs to be passed to the task')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='jdbc', description=\"The type of format to load. Defaults to 'jdbc'.\")\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field(default_factory=dict, description='Extra options to pass to spark reader')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.password","title":"password  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>password: SecretStr = Field(default=..., description='Password belonging to the username')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: Optional[str] = Field(default=None, description='Query')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., description='URL for the JDBC driver. Note, in some environments you need to use the IP Address instead of the hostname of the server.')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.user","title":"user  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>user: str = Field(default=..., description='User to authenticate to the server')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Wrapper around Spark's jdbc read format</p> Source code in <code>src/koheesio/spark/readers/jdbc.py</code> <pre><code>def execute(self):\n    \"\"\"Wrapper around Spark's jdbc read format\"\"\"\n\n    # Can't have both dbtable and query empty\n    if not self.dbtable and not self.query:\n        raise ValueError(\"Please do not leave dbtable and query both empty!\")\n\n    if self.query and self.dbtable:\n        self.log.info(\"Both 'query' and 'dbtable' are filled in, 'dbtable' will be ignored!\")\n\n    options = self.get_options()\n\n    if pw := self.password:\n        options[\"password\"] = pw.get_secret_value()\n\n    if query := self.query:\n        options[\"query\"] = query\n        self.log.info(f\"Executing query: {self.query}\")\n    else:\n        options[\"dbtable\"] = self.dbtable\n\n    self.output.df = self.spark.read.format(self.format).options(**options).load()\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Dictionary of options required for the specific JDBC driver.</p> <p>Note: override this method if driver requires custom names, e.g. Snowflake: <code>sfUrl</code>, <code>sfUser</code>, etc.</p> Source code in <code>src/koheesio/spark/readers/jdbc.py</code> <pre><code>def get_options(self):\n    \"\"\"\n    Dictionary of options required for the specific JDBC driver.\n\n    Note: override this method if driver requires custom names, e.g. Snowflake: `sfUrl`, `sfUser`, etc.\n    \"\"\"\n    return {\n        \"driver\": self.driver,\n        \"url\": self.url,\n        \"user\": self.user,\n        \"password\": self.password,\n        **self.options,\n    }\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html","title":"Kafka","text":"<p>Module for KafkaReader and KafkaStreamReader.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader","title":"koheesio.spark.readers.kafka.KafkaReader","text":"<p>Reader for Kafka topics.</p> <p>Wrapper around Spark's kafka read format. Supports both batch and streaming reads.</p> <p>Parameters:</p> Name Type Description Default <code>read_broker</code> <code>str</code> <p>Kafka brokers to read from. Should be passed as a single string with multiple brokers passed in a comma separated list</p> required <code>topic</code> <code>str</code> <p>Kafka topic to consume.</p> required <code>streaming</code> <code>Optional[bool]</code> <p>Whether to read the kafka topic as a stream or not.</p> required <code>params</code> <code>Optional[Dict[str, str]]</code> <p>Arbitrary options to be applied when creating NSP Reader. If a user provides values for <code>subscribe</code> or <code>kafka.bootstrap.servers</code>, they will be ignored in favor of configuration passed through <code>topic</code> and <code>read_broker</code> respectively. Defaults to an empty dictionary.</p> required Notes <ul> <li>The <code>read_broker</code> and <code>topic</code> parameters are required.</li> <li>The <code>streaming</code> parameter defaults to <code>False</code>.</li> <li>The <code>params</code> parameter defaults to an empty dictionary. This parameter is also aliased as <code>kafka_options</code>.</li> <li>Any extra kafka options can also be passed as key-word arguments; these will be merged with the <code>params</code> parameter</li> </ul> Example <pre><code>from koheesio.spark.readers.kafka import KafkaReader\n\nkafka_reader = KafkaReader(\n    read_broker=\"kafka-broker-1:9092,kafka-broker-2:9092\",\n    topic=\"my-topic\",\n    streaming=True,\n    # extra kafka options can be passed as key-word arguments\n    startingOffsets=\"earliest\",\n)\n</code></pre> <p>In the example above, the <code>KafkaReader</code> will read from the <code>my-topic</code> Kafka topic, using the brokers <code>kafka-broker-1:9092</code> and <code>kafka-broker-2:9092</code>. The reader will read the topic as a stream and will start reading from the earliest available offset.</p> <p>The stream can be started by calling the <code>read</code> or <code>execute</code> method on the <code>kafka_reader</code> object.</p> <p>Note: The <code>KafkaStreamReader</code> could be used in the example above to achieve the same result. <code>streaming</code> would     default to <code>True</code> in that case and could be omitted from the parameters.</p> See Also <ul> <li>Official Spark Documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html</li> </ul>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.batch_reader","title":"batch_reader  <code>property</code>","text":"<pre><code>batch_reader\n</code></pre> <p>Returns the Spark read object for batch processing.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.logged_option_keys","title":"logged_option_keys  <code>property</code>","text":"<pre><code>logged_option_keys\n</code></pre> <p>Keys that are allowed to be logged for the options.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.options","title":"options  <code>property</code>","text":"<pre><code>options\n</code></pre> <p>Merge fixed parameters with arbitrary options provided by user.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, str]] = Field(default_factory=dict, alias='kafka_options', description=\"Arbitrary options to be applied when creating NSP Reader. If a user provides values for 'subscribe' or 'kafka.bootstrap.servers', they will be ignored in favor of configuration passed through 'topic' and 'read_broker' respectively.\")\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.read_broker","title":"read_broker  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>read_broker: str = Field(..., description='Kafka brokers to read from, should be passed as a single string with multiple brokers passed in a comma separated list')\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.reader","title":"reader  <code>property</code>","text":"<pre><code>reader\n</code></pre> <p>Returns the appropriate reader based on the streaming flag.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.stream_reader","title":"stream_reader  <code>property</code>","text":"<pre><code>stream_reader\n</code></pre> <p>Returns the Spark readStream object.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: Optional[bool] = Field(default=False, description='Whether to read the kafka topic as a stream or not. Defaults to False.')\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.topic","title":"topic  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>topic: str = Field(default=..., description='Kafka topic to consume.')\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/kafka.py</code> <pre><code>def execute(self):\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self.output.df = self.reader.format(\"kafka\").options(**self.options).load()\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader","title":"koheesio.spark.readers.kafka.KafkaStreamReader","text":"<p>KafkaStreamReader is a KafkaReader that reads data as a stream</p> <p>This class is identical to KafkaReader, with the <code>streaming</code> parameter defaulting to <code>True</code>.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: bool = True\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html","title":"Memory","text":"<p>Create Spark DataFrame directly from the data stored in a Python variable</p>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat","title":"koheesio.spark.readers.memory.DataFormat","text":"<p>Data formats supported by the InMemoryDataReader</p>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.CSV","title":"CSV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CSV = 'csv'\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'json'\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader","title":"koheesio.spark.readers.memory.InMemoryDataReader","text":"<p>Directly read data from a Python variable and convert it to a Spark DataFrame.</p> <p>Read the data, that is stored in one of the supported formats (see <code>DataFormat</code>) directly from the variable and convert it to the Spark DataFrame. The use cases include converting JSON output of the API into the dataframe; reading the CSV data via the API (e.g. Box API).</p> <p>The advantage of using this reader is that it allows to read the data directly from the Python variable, without the need to store it on the disk. This can be useful when the data is small and does not need to be stored permanently.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Union[str, list, dict, bytes]</code> <p>Source data</p> required <code>format</code> <code>DataFormat</code> <p>File / data format</p> required <code>schema_</code> <code>Optional[StructType]</code> <p>Schema that will be applied during the creation of Spark DataFrame</p> <code>None</code> <code>params</code> <code>Optional[Dict[str, Any]]</code> <p>Set of extra parameters that should be passed to the appropriate reader (csv / json). Optionally, the user can pass the parameters that are specific to the reader (e.g. <code>multiLine</code> for JSON reader) as key-word arguments. These will be merged with the <code>params</code> parameter.</p> <code>dict</code> Example <pre><code># Read CSV data from a string\ndf1 = InMemoryDataReader(format=DataFormat.CSV, data='foo,bar\\nA,1\\nB,2')\n\n# Read JSON data from a string\ndf2 = InMemoryDataReader(format=DataFormat.JSON, data='{\"foo\": A, \"bar\": 1}'\ndf3 = InMemoryDataReader(format=DataFormat.JSON, data=['{\"foo\": \"A\", \"bar\": 1}', '{\"foo\": \"B\", \"bar\": 2}']\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.data","title":"data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data: Union[str, list, dict, bytes] = Field(default=..., description='Source data')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: DataFormat = Field(default=..., description='File / data format')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the appropriate reader (csv / json)')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.schema_","title":"schema_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_: Optional[StructType] = Field(default=None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Execute method appropriate to the specific data format</p> Source code in <code>src/koheesio/spark/readers/memory.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Execute method appropriate to the specific data format\n    \"\"\"\n    _func = getattr(InMemoryDataReader, f\"_{self.format}\")\n    _df = partial(_func, self, self._rdd)()\n    self.output.df = _df\n</code></pre>"},{"location":"api_reference/spark/readers/metastore.html","title":"Metastore","text":"<p>Create Spark DataFrame from table in Metastore</p>"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader","title":"koheesio.spark.readers.metastore.MetastoreReader","text":"<p>Reader for tables/views from Spark Metastore</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>str</code> <p>Table name in spark metastore</p> required"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='Table name in spark metastore')\n</code></pre>"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/metastore.py</code> <pre><code>def execute(self):\n    self.output.df = self.spark.table(self.table)\n</code></pre>"},{"location":"api_reference/spark/readers/rest_api.html","title":"Rest api","text":"<p>This module provides the RestApiReader class for interacting with RESTful APIs.</p> <p>The RestApiReader class is designed to fetch data from RESTful APIs and store the response in a DataFrame. It supports different transports, e.g. Paginated Http or Async HTTP. The main entry point is the <code>execute</code> method, which performs transport.execute() call and provide data from the API calls.</p> <p>For more details on how to use this class and its methods, refer to the class docstring.</p>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader","title":"koheesio.spark.readers.rest_api.RestApiReader","text":"<p>A reader class that executes an API call and stores the response in a DataFrame.</p> <p>Parameters:</p> Name Type Description Default <code>transport</code> <code>Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]</code> <p>The HTTP transport step.</p> required <code>spark_schema</code> <code>Union[str, StructType, List[str], Tuple[str, ...], AtomicType]</code> <p>The pyspark schema of the response.</p> required <p>Attributes:</p> Name Type Description <code>transport</code> <code>Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]</code> <p>The HTTP transport step.</p> <code>spark_schema</code> <code>Union[str, StructType, List[str], Tuple[str, ...], AtomicType]</code> <p>The pyspark schema of the response.</p> <p>Returns:</p> Type Description <code>Output</code> <p>The output of the reader, which includes the DataFrame.</p> <p>Examples:</p> <p>Here are some examples of how to use this class:</p> <p>Example 1: Paginated Transport <pre><code>import requests\nfrom urllib3 import Retry\n\nfrom koheesio.steps.http import HttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = requests.Session()\nretry_logic = Retry(total=max_retries, status_forcelist=[503])\nsession.mount(\"https://\", HTTPAdapter(max_retries=retry_logic))\nsession.mount(\"http://\", HTTPAdapter(max_retries=retry_logic))\n\ntransport = PaginatedHtppGetStep(\n    url=\"https://api.example.com/data?page={page}\",\n    paginate=True,\n    pages=3,\n    session=session,\n)\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n</code></pre></p> <p>Example 2: Async Transport <pre><code>from aiohttp import ClientSession, TCPConnector\nfrom aiohttp_retry import ExponentialRetry\nfrom yarl import URL\n\nfrom koheesio.steps.asyncio.http import AsyncHttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = ClientSession()\nurls = [URL(\"http://httpbin.org/get\"), URL(\"http://httpbin.org/get\")]\nretry_options = ExponentialRetry()\nconnector = TCPConnector(limit=10)\ntransport = AsyncHttpGetStep(\n    client_session=session,\n    url=urls,\n    retry_options=retry_options,\n    connector=connector,\n)\n\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n</code></pre></p>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.spark_schema","title":"spark_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>spark_schema: Union[str, StructType, List[str], Tuple[str, ...], AtomicType] = Field(..., description='The pyspark schema of the response')\n</code></pre>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.transport","title":"transport  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>transport: Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]] = Field(..., description='HTTP transport step', exclude=True)\n</code></pre>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Executes the API call and stores the response in a DataFrame.</p> <p>Returns:</p> Type Description <code>Output</code> <p>The output of the reader, which includes the DataFrame.</p> Source code in <code>src/koheesio/spark/readers/rest_api.py</code> <pre><code>def execute(self) -&gt; Reader.Output:\n    \"\"\"\n    Executes the API call and stores the response in a DataFrame.\n\n    Returns\n    -------\n    Reader.Output\n        The output of the reader, which includes the DataFrame.\n    \"\"\"\n    raw_data = self.transport.execute()\n\n    if isinstance(raw_data, HttpGetStep.Output):\n        data = raw_data.response_json\n    elif isinstance(raw_data, AsyncHttpGetStep.Output):\n        data = [d for d, _ in raw_data.responses_urls]  # type: ignore\n\n    if data:\n        self.output.df = self.spark.createDataFrame(data=data, schema=self.spark_schema)  # type: ignore\n</code></pre>"},{"location":"api_reference/spark/readers/snowflake.html","title":"Snowflake","text":"<p>Module containing Snowflake reader classes.</p> <p>This module contains classes for reading data from Snowflake. The classes are used to create a Spark DataFrame from a Snowflake table or a query.</p> <p>Classes:</p> Name Description <code>SnowflakeReader</code> <p>Reader for Snowflake tables.</p> <code>Query</code> <p>Reader for Snowflake queries.</p> <code>DbTableQuery</code> <p>Reader for Snowflake queries that return a single row.</p> Notes <p>The classes are defined in the koheesio.steps.integrations.snowflake module; this module simply inherits from the classes defined there.</p> See Also <ul> <li>koheesio.spark.readers.Reader     Base class for all Readers.</li> <li>koheesio.steps.integrations.snowflake     Module containing Snowflake classes.</li> </ul> <p>More detailed class descriptions can be found in the class docstrings.</p>"},{"location":"api_reference/spark/readers/spark_sql_reader.html","title":"Spark sql reader","text":"<p>This module contains the SparkSqlReader class which reads the SparkSQL compliant query and returns the dataframe.</p>"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader","title":"koheesio.spark.readers.spark_sql_reader.SparkSqlReader","text":"<p>SparkSqlReader reads the SparkSQL compliant query and returns the dataframe.</p> <p>This SQL can originate from a string or a file and may contain placeholder (parameters) for templating. - Placeholders are identified with ${placeholder}. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).</p> Example <p>SQL script (example.sql): <pre><code>SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n</code></pre></p> <p>Python code: <pre><code>from koheesio.spark.readers import SparkSqlReader\n\nreader = SparkSqlReader(\n    sql_path=\"example.sql\",\n    # params can also be passed as kwargs\n    dynamic_column\"=\"name\",\n    \"table_name\"=\"my_table\"\n)\nreader.execute()\n</code></pre></p> <p>In this example, the SQL script is read from a file and the placeholders are replaced with the given params. The resulting SQL query is: <pre><code>SELECT id, id + 1 AS incremented_id, name AS extra_column\nFROM my_table\n</code></pre></p> <p>The query is then executed and the resulting DataFrame is stored in the <code>output.df</code> attribute.</p> <p>Parameters:</p> Name Type Description Default <code>sql_path</code> <code>str or Path</code> <p>Path to a SQL file</p> required <code>sql</code> <code>str</code> <p>SQL query to execute</p> required <code>params</code> <code>dict</code> <p>Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.</p> required Notes <p>Any arbitrary kwargs passed to the class will be added to params.</p>"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/spark_sql_reader.py</code> <pre><code>def execute(self):\n    self.output.df = self.spark.sql(self.query)\n</code></pre>"},{"location":"api_reference/spark/readers/teradata.html","title":"Teradata","text":"<p>Teradata reader.</p>"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader","title":"koheesio.spark.readers.teradata.TeradataReader","text":"<p>Wrapper around JdbcReader for Teradata.</p> Notes <ul> <li>Consider using synthetic partitioning column when using partitioned read:     <code>MOD(HASHBUCKET(HASHROW(&lt;TABLE&gt;.&lt;COLUMN&gt;)), &lt;NUM_PARTITIONS&gt;)</code></li> <li>Relevant jars should be added to the Spark session manually. This class does not take care of that.</li> </ul> See Also <ul> <li>Refer to JdbcReader for the list of all available     parameters.</li> <li>Refer to Teradata docs for the list of all available connection string parameters:     https://teradata-docs.s3.amazonaws.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_2.html#BABJIHBJ</li> </ul> Example <p>This example depends on the Teradata <code>terajdbc4</code> JAR. e.g. terajdbc4-17.20.00.15. Keep in mind that older versions of <code>terajdbc4</code> drivers also require <code>tdgssconfig</code> JAR.</p> <pre><code>from koheesio.spark.readers.teradata import TeradataReader\n\ntd = TeradataReader(\n    url=\"jdbc:teradata://&lt;domain_or_ip&gt;/logmech=ldap,charset=utf8,database=&lt;db&gt;,type=fastexport, maybenull=on\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n)\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>JDBC connection string. Refer to Teradata docs for the list of all available connection string parameters. Example: <code>jdbc:teradata://&lt;domain_or_ip&gt;/logmech=ldap,charset=utf8,database=&lt;db&gt;,type=fastexport, maybenull=on</code></p> required <code>user</code> <code>str</code> <p>Username</p> required <code>password</code> <code>SecretStr</code> <p>Password</p> required <code>dbtable</code> <code>str</code> <p>Database table name, also include schema name</p> required <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the Teradata JDBC driver. Refer to Teradata docs for the list of all available connection string parameters.</p> <code>{\"fetchsize\": 2000, \"numPartitions\": 10}</code> <code>query</code> <code>Optional[str]</code> <p>Query</p> <code>None</code> <code>format</code> <code>str</code> <p>The type of format to load. Defaults to 'jdbc'. Should not be changed.</p> required <code>driver</code> <code>str</code> <p>Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.</p> required"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: str = Field('com.teradata.jdbc.TeraDriver', description='Make sure that the necessary JARs are available in the cluster: terajdbc4-x.x.x.x')\n</code></pre>"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field({'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the Teradata JDBC driver')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/index.html","title":"Databricks","text":""},{"location":"api_reference/spark/readers/databricks/autoloader.html","title":"Autoloader","text":"<p>Read from a location using Databricks' <code>autoloader</code></p> <p>Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader","title":"koheesio.spark.readers.databricks.autoloader.AutoLoader","text":"<p>Read from a location using Databricks' <code>autoloader</code></p> <p>Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p> Notes <p><code>autoloader</code> is a <code>Spark Structured Streaming</code> function!</p> <p>Although most transformations are compatible with <code>Spark Structured Streaming</code>, not all of them are. As a result, be mindful with your downstream transformations.</p> <p>Parameters:</p> Name Type Description Default <code>format</code> <code>Union[str, AutoLoaderFormat]</code> <p>The file format, used in <code>cloudFiles.format</code>. Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p> required <code>location</code> <code>str</code> <p>The location where the files are located, used in <code>cloudFiles.location</code></p> required <code>schema_location</code> <code>str</code> <p>The location for storing inferred schema and supporting schema evolution, used in <code>cloudFiles.schemaLocation</code>.</p> required <code>options</code> <code>Optional[Dict[str, str]]</code> <p>Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html</p> <code>{}</code> Example <pre><code>from koheesio.spark.readers.databricks import AutoLoader, AutoLoaderFormat\n\nresult_df = AutoLoader(\n    format=AutoLoaderFormat.JSON,\n    location=\"some_s3_path\",\n    schema_location=\"other_s3_path\",\n    options={\"multiLine\": \"true\"},\n).read()\n</code></pre> See Also <p>Some other useful documentation:</p> <ul> <li>autoloader: https://docs.databricks.com/ingestion/auto-loader/index.html</li> <li>Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html</li> </ul>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: Union[str, AutoLoaderFormat] = Field(default=..., description=__doc__)\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.location","title":"location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>location: str = Field(default=..., description='The location where the files are located, used in `cloudFiles.location`')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, str]] = Field(default_factory=dict, description='Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.schema_location","title":"schema_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_location: str = Field(default=..., alias='schemaLocation', description='The location for storing inferred schema and supporting schema evolution, used in `cloudFiles.schemaLocation`.')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Reads from the given location with the given options using Autoloader</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>def execute(self):\n    \"\"\"Reads from the given location with the given options using Autoloader\"\"\"\n    self.output.df = self.reader().load(self.location)\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Get the options for the autoloader</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>def get_options(self):\n    \"\"\"Get the options for the autoloader\"\"\"\n    self.options.update(\n        {\n            \"cloudFiles.format\": self.format,\n            \"cloudFiles.schemaLocation\": self.schema_location,\n        }\n    )\n    return self.options\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.reader","title":"reader","text":"<pre><code>reader()\n</code></pre> <p>Return the reader for the autoloader</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>def reader(self):\n    \"\"\"Return the reader for the autoloader\"\"\"\n    return self.spark.readStream.format(\"cloudFiles\").options(**self.get_options())\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.validate_format","title":"validate_format","text":"<pre><code>validate_format(format_specified)\n</code></pre> <p>Validate <code>format</code> value</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>@field_validator(\"format\")\ndef validate_format(cls, format_specified):\n    \"\"\"Validate `format` value\"\"\"\n    if isinstance(format_specified, str):\n        if format_specified.upper() in [f.value.upper() for f in AutoLoaderFormat]:\n            format_specified = getattr(AutoLoaderFormat, format_specified.upper())\n    return str(format_specified.value)\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","title":"koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","text":"<p>The file format, used in <code>cloudFiles.format</code> Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.AVRO","title":"AVRO  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>AVRO = 'avro'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.BINARYFILE","title":"BINARYFILE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BINARYFILE = 'binaryfile'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.CSV","title":"CSV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CSV = 'csv'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'json'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.ORC","title":"ORC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ORC = 'orc'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.PARQUET","title":"PARQUET  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PARQUET = 'parquet'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.TEXT","title":"TEXT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TEXT = 'text'\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html","title":"Transformations","text":"<p>This module contains the base classes for all transformations.</p> <p>See class docstrings for more information.</p> References <p>For a comprehensive guide on the usage, examples, and additional features of Transformation classes, please refer to the reference/concepts/steps/transformations section of the Koheesio documentation.</p> <p>Classes:</p> Name Description <code>Transformation</code> <p>Base class for all transformations</p> <code>ColumnsTransformation</code> <p>Extended Transformation class with a preset validator for handling column(s) data</p> <code>ColumnsTransformationWithTarget</code> <p>Extended ColumnsTransformation class with an additional <code>target_column</code> field</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation","title":"koheesio.spark.transformations.ColumnsTransformation","text":"<p>Extended Transformation class with a preset validator for handling column(s) data with a standardized input for a single column or multiple columns.</p> Concept <p>A ColumnsTransformation is a Transformation with a standardized input for column or columns. The <code>columns</code> are stored as a list. Either a single string, or a list of strings can be passed to enter the <code>columns</code>. <code>column</code> and <code>columns</code> are aliases to one another - internally the name <code>columns</code> should be used though.</p> <ul> <li><code>columns</code> are stored as a list</li> <li>either a single string, or a list of strings can be passed to enter the <code>columns</code></li> <li><code>column</code> and <code>columns</code> are aliases to one another - internally the name <code>columns</code> should be used though.</li> </ul> <p>If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns</p> Configuring the ColumnsTransformation <p>The ColumnsTransformation class has a <code>ColumnConfig</code> class that can be used to configure the behavior of the class. This class has the following fields: - <code>run_for_all_data_type</code>     allows to run the transformation for all columns of a given type.</p> <ul> <li> <p><code>limit_data_type</code>     allows to limit the transformation to a specific data type.</p> </li> <li> <p><code>data_type_strict_mode</code>     Toggles strict mode for data type validation. Will only work if <code>limit_data_type</code> is set.</p> </li> </ul> <p>Note that Data types need to be specified as a SparkDatatype enum.</p> <p>See the docstrings of the <code>ColumnConfig</code> class for more information. See the SparkDatatype enum for a list of available data types.</p> <p>Users should not have to interact with the <code>ColumnConfig</code> class directly.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The column (or list of columns) to apply the transformation to. Alias: column</p> required Example <pre><code>from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: ListOfColumns = Field(default='', alias='column', description='The column (or list of columns) to apply the transformation to. Alias: column')\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.data_type_strict_mode_is_set","title":"data_type_strict_mode_is_set  <code>property</code>","text":"<pre><code>data_type_strict_mode_is_set: bool\n</code></pre> <p>Returns True if data_type_strict_mode is set</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.limit_data_type_is_set","title":"limit_data_type_is_set  <code>property</code>","text":"<pre><code>limit_data_type_is_set: bool\n</code></pre> <p>Returns True if limit_data_type is set</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.run_for_all_is_set","title":"run_for_all_is_set  <code>property</code>","text":"<pre><code>run_for_all_is_set: bool\n</code></pre> <p>Returns True if the transformation should be run for all columns of a given type</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig","title":"ColumnConfig","text":"<p>Koheesio ColumnsTransformation specific Config</p> <p>Parameters:</p> Name Type Description Default <code>run_for_all_data_type</code> <p>allows to run the transformation for all columns of a given type. A user can trigger this behavior by either omitting the <code>columns</code> parameter or by passing a single <code>*</code> as a column name. In both cases, the <code>run_for_all_data_type</code> will be used to determine the data type. Value should be be passed as a SparkDatatype enum. (default: [None])</p> required <code>limit_data_type</code> <p>allows to limit the transformation to a specific data type. Value should be passed as a SparkDatatype enum. (default: [None])</p> required <code>data_type_strict_mode</code> <p>Toggles strict mode for data type validation. Will only work if <code>limit_data_type</code> is set. - when True, a ValueError will be raised if any column does not adhere to the <code>limit_data_type</code> - when False, a warning will be thrown and the column will be skipped instead (default: False)</p> required"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data_type_strict_mode: bool = False\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type: Optional[List[SparkDatatype]] = [None]\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type: Optional[List[SparkDatatype]] = [None]\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.column_type_of_col","title":"column_type_of_col","text":"<pre><code>column_type_of_col(col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True) -&gt; Union[DataType, str]\n</code></pre> <p>Returns the dataType of a Column object as a string.</p> <p>The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type based on the column name. We retrieve the name of the column from the Column object by calling toString() from the JVM.</p> <p>Examples:</p> <p>input_df: | str_column | int_column | |------------|------------| | hello      | 1          | | world      | 2          |</p> <pre><code># using the AddOne transformation from the example above\nadd_one = AddOne(\n    columns=[\"str_column\", \"int_column\"],\n    df=input_df,\n)\nadd_one.column_type_of_col(\"str_column\")  # returns \"string\"\nadd_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n# returns IntegerType\nadd_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>col</code> <code>Union[str, Column]</code> <p>The column to check the type of</p> required <code>df</code> <code>Optional[DataFrame]</code> <p>The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor will be used.</p> <code>None</code> <code>simple_return_mode</code> <code>bool</code> <p>If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.</p> <code>True</code> <p>Returns:</p> Name Type Description <code>datatype</code> <code>str</code> <p>The type of the column as a string</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def column_type_of_col(\n    self, col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True\n) -&gt; Union[DataType, str]:\n    \"\"\"\n    Returns the dataType of a Column object as a string.\n\n    The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type\n    based on the column name. We retrieve the name of the column from the Column object by calling toString() from\n    the JVM.\n\n    Examples\n    --------\n    __input_df:__\n    | str_column | int_column |\n    |------------|------------|\n    | hello      | 1          |\n    | world      | 2          |\n\n    ```python\n    # using the AddOne transformation from the example above\n    add_one = AddOne(\n        columns=[\"str_column\", \"int_column\"],\n        df=input_df,\n    )\n    add_one.column_type_of_col(\"str_column\")  # returns \"string\"\n    add_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n    # returns IntegerType\n    add_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n    ```\n\n    Parameters\n    ----------\n    col: Union[str, Column]\n        The column to check the type of\n\n    df: Optional[DataFrame]\n        The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    simple_return_mode: bool\n        If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.\n\n    Returns\n    -------\n    datatype: str\n        The type of the column as a string\n    \"\"\"\n    df = df or self.df\n    if not df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n\n    if not isinstance(col, Column):\n        col = f.col(col)\n\n    # ask the JVM for the name of the column\n    # noinspection PyProtectedMember\n    col_name = col._jc.toString()\n\n    # In order to check the datatype of the column, we have to ask the DataFrame its schema\n    df_col = [c for c in df.schema if c.name == col_name][0]\n\n    if simple_return_mode:\n        return SparkDatatype(df_col.dataType.typeName()).value\n\n    return df_col.dataType\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_all_columns_of_specific_type","title":"get_all_columns_of_specific_type","text":"<pre><code>get_all_columns_of_specific_type(data_type: Union[str, SparkDatatype]) -&gt; List[str]\n</code></pre> <p>Get all columns from the dataframe of a given type</p> <p>A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will be raised.</p> <p>Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you have to call this method multiple times.</p> <p>Parameters:</p> Name Type Description Default <code>data_type</code> <code>Union[str, SparkDatatype]</code> <p>The data type to get the columns for</p> required <p>Returns:</p> Type Description <code>List[str]</code> <p>A list of column names of the given data type</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_all_columns_of_specific_type(self, data_type: Union[str, SparkDatatype]) -&gt; List[str]:\n    \"\"\"Get all columns from the dataframe of a given type\n\n    A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will\n    be raised.\n\n    Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you\n    have to call this method multiple times.\n\n    Parameters\n    ----------\n    data_type: Union[str, SparkDatatype]\n        The data type to get the columns for\n\n    Returns\n    -------\n    List[str]\n        A list of column names of the given data type\n    \"\"\"\n    if not self.df:\n        raise ValueError(\"No dataframe available - cannot get columns\")\n\n    expected_data_type = (SparkDatatype.from_string(data_type) if isinstance(data_type, str) else data_type).value\n\n    columns_of_given_type: List[str] = [\n        col for col in self.df.columns if self.df.schema[col].dataType.typeName() == expected_data_type\n    ]\n    return columns_of_given_type\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_columns","title":"get_columns","text":"<pre><code>get_columns() -&gt; iter\n</code></pre> <p>Return an iterator of the columns</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_columns(self) -&gt; iter:\n    \"\"\"Return an iterator of the columns\"\"\"\n    # If `run_for_all_is_set` is True, we want to run the transformation for all columns of a given type\n    if self.run_for_all_is_set:\n        columns = []\n        for data_type in self.ColumnConfig.run_for_all_data_type:\n            columns += self.get_all_columns_of_specific_type(data_type)\n    else:\n        columns = self.columns\n\n    for column in columns:\n        if self.is_column_type_correct(column):\n            yield column\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_limit_data_types","title":"get_limit_data_types","text":"<pre><code>get_limit_data_types()\n</code></pre> <p>Get the limit_data_type as a list of strings</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_limit_data_types(self):\n    \"\"\"Get the limit_data_type as a list of strings\"\"\"\n    return [dt.value for dt in self.ColumnConfig.limit_data_type]\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.is_column_type_correct","title":"is_column_type_correct","text":"<pre><code>is_column_type_correct(column)\n</code></pre> <p>Check if column type is correct and handle it if not, when limit_data_type is set</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def is_column_type_correct(self, column):\n    \"\"\"Check if column type is correct and handle it if not, when limit_data_type is set\"\"\"\n    if not self.limit_data_type_is_set:\n        return True\n\n    if self.column_type_of_col(column) in (limit_data_types := self.get_limit_data_types()):\n        return True\n\n    # Raises a ValueError if the Column object is not of a given type and data_type_strict_mode is set\n    if self.data_type_strict_mode_is_set:\n        raise ValueError(\n            f\"Critical error: {column} is not of type {limit_data_types}. Exception is raised because \"\n            f\"`data_type_strict_mode` is set to True for {self.name}.\"\n        )\n\n    # Otherwise, throws a warning that the Column object is not of a given type\n    self.log.warning(f\"Column `{column}` is not of type `{limit_data_types}` and will be skipped.\")\n    return False\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.set_columns","title":"set_columns","text":"<pre><code>set_columns(columns_value)\n</code></pre> <p>Validate columns through the columns configuration provided</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>@field_validator(\"columns\", mode=\"before\")\ndef set_columns(cls, columns_value):\n    \"\"\"Validate columns through the columns configuration provided\"\"\"\n    columns = columns_value\n    run_for_all_data_type = cls.ColumnConfig.run_for_all_data_type\n\n    if run_for_all_data_type and len(columns) == 0:\n        columns = [\"*\"]\n\n    if columns[0] == \"*\" and not run_for_all_data_type:\n        raise ValueError(\"Cannot use '*' as a column name when no run_for_all_data_type is set\")\n\n    return columns\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget","title":"koheesio.spark.transformations.ColumnsTransformationWithTarget","text":"<p>Extended ColumnsTransformation class with an additional <code>target_column</code> field</p> <p>Using this class makes implementing Transformations significantly easier.</p> Concept <p>A <code>ColumnsTransformationWithTarget</code> is a <code>ColumnsTransformation</code> with an additional <code>target_column</code> field. This field can be used to store the result of the transformation in a new column.</p> <p>If the <code>target_column</code> is not provided, the result will be stored in the source column.</p> <p>If more than one column is passed, the behavior of the Class changes this way:</p> <ul> <li>the transformation will be run in a loop against all the given columns</li> <li>automatically handles the renaming of the columns when more than one column is passed</li> <li>the <code>target_column</code> will be used as a suffix. Leaving this blank will result in the original columns being renamed</li> </ul> <p>The <code>func</code> method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the <code>get_columns_with_target</code> method to loop over all the columns and apply this function to transform the DataFrame.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>The column (or list of columns) to apply the transformation to. Alias: column. If not provided, the <code>run_for_all_data_type</code> will be used to determine the data type. If <code>run_for_all_data_type</code> is not set, the transformation will be run for all columns of a given type.</p> <code>*</code> <code>target_column</code> <code>Optional[str]</code> <p>The name of the column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this input will be used as a suffix instead.</p> <code>None</code> Example <p>Writing your own transformation using the <code>ColumnsTransformationWithTarget</code> class:</p> <pre><code>from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n</code></pre> <p>In the above example, the <code>func</code> method is implemented to add 1 to the values of a given column.</p> <p>In order to use this transformation, we can call the <code>transform</code> method: <pre><code>from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOneWithTarget(column=\"id\", target_column=\"new_id\").transform(df)\n</code></pre></p> <p>The <code>output_df</code> will now contain the original DataFrame with an additional column called <code>new_id</code> with the values of <code>id</code> + 1.</p> <p>output_df:</p> id new_id 0 1 1 2 2 3 <p>Note: The <code>target_column</code> will be used as a suffix when more than one column is given as source. Leaving this blank will result in the original columns being renamed.</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: Optional[str] = Field(default=None, alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output) This can be left unchanged, and hence should not be implemented in the child class.</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def execute(self):\n    \"\"\"Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output)\n    This can be left unchanged, and hence should not be implemented in the child class.\n    \"\"\"\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.func","title":"func  <code>abstractmethod</code>","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>The function that will be run on a single Column of the DataFrame</p> <p>The <code>func</code> method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the <code>get_columns_with_target</code> method to loop over all the columns and apply this function to transform the DataFrame.</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Column</code> <p>The column to apply the transformation to</p> required <p>Returns:</p> Type Description <code>Column</code> <p>The transformed column</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>@abstractmethod\ndef func(self, column: Column) -&gt; Column:\n    \"\"\"The function that will be run on a single Column of the DataFrame\n\n    The `func` method should be implemented in the child class. This method should return the transformation that\n    will be applied to the column(s). The execute method (already preset) will use the `get_columns_with_target`\n    method to loop over all the columns and apply this function to transform the DataFrame.\n\n    Parameters\n    ----------\n    column: Column\n        The column to apply the transformation to\n\n    Returns\n    -------\n    Column\n        The transformed column\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.get_columns_with_target","title":"get_columns_with_target","text":"<pre><code>get_columns_with_target() -&gt; iter\n</code></pre> <p>Return an iterator of the columns</p> <p>Works just like in get_columns from the  ColumnsTransformation class except that it handles the <code>target_column</code> as well.</p> <p>If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns - the target_column will be used as a suffix. Leaving this blank will result in the original columns being     renamed.</p> <p>Returns:</p> Type Description <code>iter</code> <p>An iterator of tuples containing the target column name and the original column name</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_columns_with_target(self) -&gt; iter:\n    \"\"\"Return an iterator of the columns\n\n    Works just like in get_columns from the  ColumnsTransformation class except that it handles the `target_column`\n    as well.\n\n    If more than one column is passed, the behavior of the Class changes this way:\n    - the transformation will be run in a loop against all the given columns\n    - the target_column will be used as a suffix. Leaving this blank will result in the original columns being\n        renamed.\n\n    Returns\n    -------\n    iter\n        An iterator of tuples containing the target column name and the original column name\n    \"\"\"\n    columns = [*self.get_columns()]\n\n    for column in columns:\n        # ensures that we at least use the original column name\n        target_column = self.target_column or column\n\n        if len(columns) &gt; 1:  # target_column becomes a suffix when more than 1 column is given\n            # dict.fromkeys is used to avoid duplicates in the name while maintaining order\n            _cols = [column, target_column]\n            target_column = \"_\".join(list(dict.fromkeys(_cols)))\n\n        yield target_column, column\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation","title":"koheesio.spark.transformations.Transformation","text":"<p>Base class for all transformations</p> Concept <p>A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is transformed based on the logic implemented in the <code>execute</code> method. Any additional parameters that are needed for the transformation can be passed to the constructor.</p> <p>Parameters:</p> Name Type Description Default <code>df</code> <p>The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the transform-method.</p> required Example <pre><code>from koheesio.steps.transformations import Transformation\nfrom pyspark.sql import functions as f\n\n\nclass AddOne(Transformation):\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n</code></pre> <p>In the example above, the <code>execute</code> method is implemented to add 1 to the values of the <code>old_column</code> and store the result in a new column called <code>new_column</code>.</p> <p>In order to use this transformation, we can call the <code>transform</code> method:</p> <pre><code>from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOne().transform(df)\n</code></pre> <p>The <code>output_df</code> will now contain the original DataFrame with an additional column called <code>new_column</code> with the values of <code>old_column</code> + 1.</p> <p>output_df:</p> id new_column 0 1 1 2 2 3 ... <p>Alternatively, we can pass the DataFrame to the constructor and call the <code>execute</code> or <code>transform</code> method without any arguments:</p> <pre><code>output_df = AddOne(df).transform()\n# or\noutput_df = AddOne(df).execute().output.df\n</code></pre> <p>Note: that the transform method was not implemented explicitly in the AddOne class. This is because the <code>transform</code> method is already implemented in the <code>Transformation</code> class. This means that all classes that inherit from the Transformation class will have the <code>transform</code> method available. Only the execute method needs to be implemented.</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute on a Transformation should handle self.df (input) and set self.output.df (output)</p> <p>This method should be implemented in the child class. The input DataFrame is available as <code>self.df</code> and the output DataFrame should be stored in <code>self.output.df</code>.</p> <p>For example: <pre><code>def execute(self):\n    self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n</code></pre></p> <p>The transform method will call this method and return the output DataFrame.</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self) -&gt; SparkStep.Output:\n    \"\"\"Execute on a Transformation should handle self.df (input) and set self.output.df (output)\n\n    This method should be implemented in the child class. The input DataFrame is available as `self.df` and the\n    output DataFrame should be stored in `self.output.df`.\n\n    For example:\n    ```python\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n    ```\n\n    The transform method will call this method and return the output DataFrame.\n    \"\"\"\n    # self.df  # input dataframe\n    # self.output.df # output dataframe\n    self.output.df = ...  # implement the transformation logic\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.transform","title":"transform","text":"<pre><code>transform(df: Optional[DataFrame] = None) -&gt; DataFrame\n</code></pre> <p>Execute the transformation and return the output DataFrame</p> <p>Note: when creating a child from this, don't implement this transform method. Instead, implement execute!</p> See Also <p><code>Transformation.execute</code></p> <p>Parameters:</p> Name Type Description Default <code>df</code> <code>Optional[DataFrame]</code> <p>The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor will be used.</p> <code>None</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>The transformed DataFrame</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def transform(self, df: Optional[DataFrame] = None) -&gt; DataFrame:\n    \"\"\"Execute the transformation and return the output DataFrame\n\n    Note: when creating a child from this, don't implement this transform method. Instead, implement execute!\n\n    See Also\n    --------\n    `Transformation.execute`\n\n    Parameters\n    ----------\n    df: Optional[DataFrame]\n        The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    Returns\n    -------\n    DataFrame\n        The transformed DataFrame\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output.df\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html","title":"Arrays","text":"<p>A collection of classes for performing various transformations on arrays in PySpark.</p> <p>These transformations include operations such as removing duplicates, exploding arrays into separate rows, reversing the order of elements, sorting elements, removing certain values, and calculating aggregate statistics like minimum, maximum, sum, mean, and median.</p> Concept <ul> <li>Every transformation in this module is implemented as a class that inherits from the <code>ArrayTransformation</code> class.</li> <li>The <code>ArrayTransformation</code> class is a subclass of <code>ColumnsTransformationWithTarget</code></li> <li>The <code>ArrayTransformation</code> class implements the <code>func</code> method, which is used to define the transformation logic.</li> <li>The <code>func</code> method takes a <code>column</code> as input and returns a <code>Column</code> object.</li> <li>The <code>Column</code> object is a PySpark column that can be used to perform transformations on a DataFrame column.</li> <li>The <code>ArrayTransformation</code> limits the data type of the transformation to array by setting the <code>ColumnConfig</code> class to     <code>run_for_all_data_type = [SparkDatatype.ARRAY]</code> and <code>limit_data_type = [SparkDatatype.ARRAY]</code>.</li> </ul> See Also <ul> <li>koheesio.spark.transformations     Module containing all transformation classes.</li> <li>koheesio.spark.transformations.ColumnsTransformationWithTarget     Base class for all transformations that operate on columns and have a target column.</li> </ul>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortAsc","title":"koheesio.spark.transformations.arrays.ArraySortAsc  <code>module-attribute</code>","text":"<pre><code>ArraySortAsc = ArraySort\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct","title":"koheesio.spark.transformations.arrays.ArrayDistinct","text":"<p>Remove duplicates from array</p> Example <pre><code>ArrayDistinct(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.filter_empty","title":"filter_empty  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>filter_empty: bool = Field(default=True, description='Remove null, nan, and empty values from array. Default is True.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    _fn = F.array_distinct(column)\n\n    # noinspection PyUnresolvedReferences\n    element_type = self.column_type_of_col(column, None, False).elementType\n    is_numeric = spark_data_type_is_numeric(element_type)\n\n    if self.filter_empty:\n        # Remove null values from array\n        if spark_minor_version &gt;= 3.4:\n            # Run array_compact if spark version is 3.4 or higher\n            # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_compact.html\n            # pylint: disable=E0611\n            from pyspark.sql.functions import array_compact as _array_compact\n\n            _fn = _array_compact(_fn)\n            # pylint: enable=E0611\n        else:\n            # Otherwise, remove null from array using array_except\n            _fn = F.array_except(_fn, F.array(F.lit(None)))\n\n        # Remove nan or empty values from array (depends on the type of the elements in array)\n        if is_numeric:\n            # Remove nan from array (float/int/numbers)\n            _fn = F.array_except(_fn, F.array(F.lit(float(\"nan\")).cast(element_type)))\n        else:\n            # Remove empty values from array (string/text)\n            _fn = F.array_except(_fn, F.array(F.lit(\"\"), F.lit(\" \")))\n\n    return _fn\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax","title":"koheesio.spark.transformations.arrays.ArrayMax","text":"<p>Return the maximum value in the array</p> Example <pre><code>ArrayMax(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    # Call for processing of nan values\n    column = super().func(column)\n\n    return F.array_max(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean","title":"koheesio.spark.transformations.arrays.ArrayMean","text":"<p>Return the mean of the values in the array.</p> <p>Note: Only numeric values are supported for calculating the mean.</p> Example <pre><code>ArrayMean(column=\"array_column\", target_column=\"average\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Calculate the mean of the values in the array</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"Calculate the mean of the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(col=column, df=None, simple_return_mode=False).elementType\n\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for calculating a mean.\"\n        )\n\n    _sum = ArraySum.from_step(self).func(column)\n    # Call for processing of nan values\n    column = super().func(column)\n    _size = F.size(column)\n    # return 0 if the size of the array is 0 to avoid division by zero\n    return F.when(_size == 0, F.lit(0)).otherwise(_sum / _size)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian","title":"koheesio.spark.transformations.arrays.ArrayMedian","text":"<p>Return the median of the values in the array.</p> <p>The median is the middle value in a sorted, ascending or descending, list of numbers.</p> <ul> <li>If the size of the array is even, the median is the average of the two middle numbers.</li> <li>If the size of the array is odd, the median is the middle number.</li> </ul> <p>Note: Only numeric values are supported for calculating the median.</p> Example <pre><code>ArrayMedian(column=\"array_column\", target_column=\"median\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Calculate the median of the values in the array</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"Calculate the median of the values in the array\"\"\"\n    # Call for processing of nan values\n    column = super().func(column)\n\n    sorted_array = ArraySort.from_step(self).func(column)\n    _size: Column = F.size(sorted_array)\n\n    # Calculate the middle index. If the size is odd, PySpark discards the fractional part.\n    # Use floor function to ensure the result is an integer\n    middle: Column = F.floor((_size + 1) / 2).cast(\"int\")\n\n    # Define conditions\n    is_size_zero: Column = _size == 0\n    is_column_null: Column = column.isNull()\n    is_size_even: Column = _size % 2 == 0\n\n    # Define actions / responses\n    # For even-sized arrays, calculate the average of the two middle elements\n    average_of_middle_elements = (F.element_at(sorted_array, middle) + F.element_at(sorted_array, middle + 1)) / 2\n    # For odd-sized arrays, select the middle element\n    middle_element = F.element_at(sorted_array, middle)\n    # In case the array is empty, return either None or 0\n    none_value = F.lit(None)\n    zero_value = F.lit(0)\n\n    median = (\n        # Check if the size of the array is 0\n        F.when(\n            is_size_zero,\n            # If the size of the array is 0 and the column is null, return None\n            # If the size of the array is 0 and the column is not null, return 0\n            F.when(is_column_null, none_value).otherwise(zero_value),\n        ).otherwise(\n            # If the size of the array is not 0, calculate the median\n            F.when(is_size_even, average_of_middle_elements).otherwise(middle_element)\n        )\n    )\n\n    return median\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin","title":"koheesio.spark.transformations.arrays.ArrayMin","text":"<p>Return the minimum value in the array</p> Example <pre><code>ArrayMin(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return F.array_min(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess","title":"koheesio.spark.transformations.arrays.ArrayNullNanProcess","text":"<p>Process an array by removing NaN and/or NULL values from elements.</p> <p>Parameters:</p> Name Type Description Default <code>keep_nan</code> <code>bool</code> <p>Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.</p> <code>False</code> <code>keep_null</code> <code>bool</code> <p>Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>column</code> <code>Column</code> <p>The processed column with NaN and/or NULL values removed from elements.</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; input_data = [(1, [1.1, 2.1, 4.1, float(\"nan\")])]\n&gt;&gt;&gt; input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n&gt;&gt;&gt; spark = SparkSession.builder.getOrCreate()\n&gt;&gt;&gt; df = spark.createDataFrame(input_data, schema=input_schema)\n&gt;&gt;&gt; transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=False)\n&gt;&gt;&gt; transformer.transform(df)\n&gt;&gt;&gt; print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1]\n\n&gt;&gt;&gt; input_data = [(1, [1.1, 2.2, 4.1, float(\"nan\")])]\n&gt;&gt;&gt; input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n&gt;&gt;&gt; spark = SparkSession.builder.getOrCreate()\n&gt;&gt;&gt; df = spark.createDataFrame(input_data, schema=input_schema)\n&gt;&gt;&gt; transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=True)\n&gt;&gt;&gt; transformer.transform(df)\n&gt;&gt;&gt; print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1, nan]\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_nan","title":"keep_nan  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>keep_nan: bool = Field(False, description='Whether to keep nan values in the array. Default is False. If set to True, the nan values will be kept in the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_null","title":"keep_null  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>keep_null: bool = Field(False, description='Whether to keep null values in the array. Default is False. If set to True, the null values will be kept in the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Process the given column by removing NaN and/or NULL values from elements.</p> Parameters: <p>column : Column     The column to be processed.</p> Returns: <p>column : Column     The processed column with NaN and/or NULL values removed from elements.</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"\n    Process the given column by removing NaN and/or NULL values from elements.\n\n    Parameters:\n    -----------\n    column : Column\n        The column to be processed.\n\n    Returns:\n    --------\n    column : Column\n        The processed column with NaN and/or NULL values removed from elements.\n    \"\"\"\n\n    def apply_logic(x: Column):\n        if self.keep_nan is False and self.keep_null is False:\n            logic = x.isNotNull() &amp; ~F.isnan(x)\n        elif self.keep_nan is False:\n            logic = ~F.isnan(x)\n        elif self.keep_null is False:\n            logic = x.isNotNull()\n\n        return logic\n\n    if self.keep_nan is False or self.keep_null is False:\n        column = F.filter(column, apply_logic)\n\n    return column\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove","title":"koheesio.spark.transformations.arrays.ArrayRemove","text":"<p>Remove a certain value from the array</p> <p>Parameters:</p> Name Type Description Default <code>keep_nan</code> <code>bool</code> <p>Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.</p> <code>False</code> <code>keep_null</code> <code>bool</code> <p>Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.</p> <code>False</code> Example <pre><code>ArrayRemove(column=\"array_column\", value=\"value_to_remove\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.make_distinct","title":"make_distinct  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>make_distinct: bool = Field(default=False, description='Whether to remove duplicates from the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.value","title":"value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>value: Any = Field(default=None, description='The value to remove from the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    value = self.value\n\n    column = super().func(column)\n\n    def filter_logic(x: Column, _val: Any):\n        if self.keep_null and self.keep_nan:\n            logic = (x != F.lit(_val)) | x.isNull() | F.isnan(x)\n        elif self.keep_null:\n            logic = (x != F.lit(_val)) | x.isNull()\n        elif self.keep_nan:\n            logic = (x != F.lit(_val)) | F.isnan(x)\n        else:\n            logic = x != F.lit(_val)\n\n        return logic\n\n    # Check if the value is iterable (i.e., a list, tuple, or set)\n    if isinstance(value, (list, tuple, set)):\n        result = reduce(lambda res, val: F.filter(res, lambda x: filter_logic(x, val)), value, column)\n    else:\n        # If the value is not iterable, simply remove the value from the array\n        result = F.filter(column, lambda x: filter_logic(x, value))\n\n    if self.make_distinct:\n        result = F.array_distinct(result)\n\n    return result\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse","title":"koheesio.spark.transformations.arrays.ArrayReverse","text":"<p>Reverse the order of elements in the array</p> Example <pre><code>ArrayReverse(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return F.reverse(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort","title":"koheesio.spark.transformations.arrays.ArraySort","text":"<p>Sort the elements in the array</p> <p>By default, the elements are sorted in ascending order. To sort the elements in descending order, set the <code>reverse</code> parameter to True.</p> Example <pre><code>ArraySort(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.reverse","title":"reverse  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>reverse: bool = Field(default=False, description='Sort the elements in the array in a descending order. Default is False.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    column = F.array_sort(column)\n    if self.reverse:\n        # Reverse the order of elements in the array\n        column = ArrayReverse.from_step(self).func(column)\n    return column\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc","title":"koheesio.spark.transformations.arrays.ArraySortDesc","text":"<p>Sort the elements in the array in descending order</p>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc.reverse","title":"reverse  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>reverse: bool = True\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum","title":"koheesio.spark.transformations.arrays.ArraySum","text":"<p>Return the sum of the values in the array</p> <p>Parameters:</p> Name Type Description Default <code>keep_nan</code> <code>bool</code> <p>Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.</p> <code>False</code> <code>keep_null</code> <code>bool</code> <p>Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.</p> <code>False</code> Example <pre><code>ArraySum(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Using the <code>aggregate</code> function to sum the values in the array</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"Using the `aggregate` function to sum the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(column, None, False).elementType\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for summing.\"\n        )\n\n    # remove na values from array.\n    column = super().func(column)\n\n    # Using the `aggregate` function to sum the values in the array by providing the initial value as 0.0 and the\n    # lambda function to add the elements together. Pyspark will automatically infer the type of the initial value\n    # making 0.0 valid for both integer and float types.\n    initial_value = F.lit(0.0)\n    return F.aggregate(column, initial_value, lambda accumulator, x: accumulator + x)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation","title":"koheesio.spark.transformations.arrays.ArrayTransformation","text":"<p>Base class for array transformations</p>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data type of the Transformation to array</p>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [ARRAY]\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [ARRAY]\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    raise NotImplementedError(\"This is an abstract class\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode","title":"koheesio.spark.transformations.arrays.Explode","text":"<p>Explode the array into separate rows</p> Example <pre><code>Explode(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.distinct","title":"distinct  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>distinct: bool = Field(False, description='Remove duplicates from the exploded array. Default is False.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.preserve_nulls","title":"preserve_nulls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>preserve_nulls: bool = Field(True, description='Preserve rows with null values in the exploded array by using explode_outer instead of explode.Default is True.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    if self.distinct:\n        column = ArrayDistinct.from_step(self).func(column)\n    return F.explode_outer(column) if self.preserve_nulls else F.explode(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct","title":"koheesio.spark.transformations.arrays.ExplodeDistinct","text":"<p>Explode the array into separate rows while removing duplicates and empty values</p> Example <pre><code>ExplodeDistinct(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct.distinct","title":"distinct  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>distinct: bool = True\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html","title":"Camel to snake","text":"<p>Class for converting DataFrame column names from camel case to snake case.</p>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.camel_to_snake_re","title":"koheesio.spark.transformations.camel_to_snake.camel_to_snake_re  <code>module-attribute</code>","text":"<pre><code>camel_to_snake_re = compile('([a-z0-9])([A-Z])')\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","title":"koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","text":"<p>Converts column names from camel case to snake cases</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Optional[ListOfColumns]</code> <p>The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: <code>[\"column1\", \"column2\"]</code> or <code>\"column1\"</code></p> <code>None</code> Example <p>input_df:</p> camelCaseColumn snake_case_column ... ... <pre><code>output_df = CamelToSnakeTransformation(column=\"camelCaseColumn\").transform(input_df)\n</code></pre> <p>output_df:</p> camel_case_column snake_case_column ... ... <p>In this example, the column <code>camelCaseColumn</code> is converted to <code>camel_case_column</code>.</p> <p>Note: the data in the columns is not changed, only the column names.</p>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[ListOfColumns] = Field(default='', alias='column', description=\"The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'` \")\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/camel_to_snake.py</code> <pre><code>def execute(self):\n    _df = self.df\n\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n\n    for column in columns:\n        _df = _df.withColumnRenamed(column, convert_camel_to_snake(column))\n\n    self.output.df = _df\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","title":"koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","text":"<pre><code>convert_camel_to_snake(name: str)\n</code></pre> <p>Converts a string from camelCase to snake_case.</p> Parameters: <p>name : str     The string to be converted.</p> Returns: <p>str     The converted string in snake_case.</p> Source code in <code>src/koheesio/spark/transformations/camel_to_snake.py</code> <pre><code>def convert_camel_to_snake(name: str):\n    \"\"\"\n    Converts a string from camelCase to snake_case.\n\n    Parameters:\n    ----------\n    name : str\n        The string to be converted.\n\n    Returns:\n    --------\n    str\n        The converted string in snake_case.\n    \"\"\"\n    return camel_to_snake_re.sub(r\"\\1_\\2\", name).lower()\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html","title":"Cast to datatype","text":"<p>Transformations to cast a column or set of columns to a given datatype.</p> <p>Each one of these have been vetted to throw warnings when wrong datatypes are passed (to skip erroring any job or pipeline).</p> <p>Furthermore, detailed tests have been added to ensure that types are actually compatible as prescribed.</p> Concept <ul> <li>One can use the CastToDataType class directly, or use one of the more specific subclasses.</li> <li>Each subclass is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run     for all compatible data types.</li> <li>Compatible Data types are specified in the ColumnConfig class of each subclass, and are documented in the docstring     of each subclass.</li> </ul> <p>See class docstrings for more information</p> Note <p>Dates, Arrays and Maps are not supported by this module.</p> <ul> <li>for dates, use the koheesio.spark.transformations.date_time module</li> <li>for arrays, use the koheesio.spark.transformations.arrays module</li> </ul> <p>Classes:</p> Name Description <code>CastToDatatype:</code> <p>Cast a column or set of columns to a given datatype</p> <code>CastToByte</code> <p>Cast to Byte (a.k.a. tinyint)</p> <code>CastToShort</code> <p>Cast to Short (a.k.a. smallint)</p> <code>CastToInteger</code> <p>Cast to Integer (a.k.a. int)</p> <code>CastToLong</code> <p>Cast to Long (a.k.a. bigint)</p> <code>CastToFloat</code> <p>Cast to Float (a.k.a. real)</p> <code>CastToDouble</code> <p>Cast to Double</p> <code>CastToDecimal</code> <p>Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)</p> <code>CastToString</code> <p>Cast to String</p> <code>CastToBinary</code> <p>Cast to Binary (a.k.a. byte array)</p> <code>CastToBoolean</code> <p>Cast to Boolean</p> <code>CastToTimestamp</code> <p>Cast to Timestamp</p> Note <p>The following parameters are common to all classes in this module:</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>Name of the source column(s). Alias: column</p> required <code>target_column</code> <code>str</code> <p>Name of the target column or alias if more than one column is specified. Alias: target_alias</p> required <code>datatype</code> <code>str or SparkDatatype</code> <p>Datatype to cast to. Choose from SparkDatatype Enum (only needs to be specified in CastToDatatype, all other classes have a fixed datatype)</p> required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary","title":"koheesio.spark.transformations.cast_to_datatype.CastToBinary","text":"<p>Cast to Binary (a.k.a. byte array)</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>float</li> <li>double</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>string</li> </ul> <p>Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = BINARY\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToBinary class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean","title":"koheesio.spark.transformations.cast_to_datatype.CastToBoolean","text":"<p>Cast to Boolean</p> Unsupported datatypes: <p>Following casts are not supported</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = BOOLEAN\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToBoolean class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte","title":"koheesio.spark.transformations.cast_to_datatype.CastToByte","text":"<p>Cast to Byte (a.k.a. tinyint)</p> <p>Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>boolean</li> <li>timestamp</li> <li>decimal</li> <li>double</li> <li>float</li> <li>long</li> <li>integer</li> <li>short</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>timestamp     range of values too small for timestamp to have any meaning</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = BYTE\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToByte class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype","title":"koheesio.spark.transformations.cast_to_datatype.CastToDatatype","text":"<p>Cast a column or set of columns to a given datatype</p> <p>Wrapper around pyspark.sql.Column.cast</p> Concept <p>This class acts as the base class for all the other CastTo* classes. It is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.</p> Example <p>input_df:</p> c1 c2 1 2 3 4 <pre><code>output_df = CastToDatatype(\n    column=\"c1\",\n    datatype=\"string\",\n    target_alias=\"c1\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> c1 c2 \"1\" 2 \"3\" 4 <p>In the example above, the column <code>c1</code> is cast to a string datatype. The column <code>c2</code> is not affected.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>Name of the source column(s). Alias: column</p> required <code>datatype</code> <code>str or SparkDatatype</code> <p>Datatype to cast to. Choose from SparkDatatype Enum</p> required <code>target_column</code> <code>str</code> <p>Name of the target column or alias if more than one column is specified. Alias: target_alias</p> required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = Field(default=..., description='Datatype. Choose from SparkDatatype Enum')\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    # This is to let the IDE explicitly know that the datatype is not a string, but a `SparkDatatype` Enum\n    datatype: SparkDatatype = self.datatype\n    return column.cast(datatype.spark_type())\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.validate_datatype","title":"validate_datatype","text":"<pre><code>validate_datatype(datatype_value) -&gt; SparkDatatype\n</code></pre> <p>Validate the datatype.</p> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>@field_validator(\"datatype\")\ndef validate_datatype(cls, datatype_value) -&gt; SparkDatatype:\n    \"\"\"Validate the datatype.\"\"\"\n    # handle string input\n    try:\n        if isinstance(datatype_value, str):\n            datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value)\n            return datatype_value\n\n        # and let SparkDatatype handle the rest\n        datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value.value)\n\n    except AttributeError as e:\n        raise AttributeError(f\"Invalid datatype: {datatype_value}\") from e\n\n    return datatype_value\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal","title":"koheesio.spark.transformations.cast_to_datatype.CastToDecimal","text":"<p>Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)</p> <p>Represents arbitrary-precision signed decimal numbers. Backed internally by <code>java.math.BigDecimal</code>. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.</p> <p>The DecimalType must have fixed precision (the maximum total number of digits) and scale (the number of digits on the right of dot). For example, (5, 2) can support the value from [-999.99 to 999.99].</p> <p>The precision can be up to 38, the scale must be less or equal to precision.</p> <p>Spark sets the default precision and scale to (10, 0) when creating a DecimalType. However, when inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).</p> <p>For compatibility reasons, Koheesio sets the default precision and scale to (38, 18) when creating a DecimalType.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>string</li> <li>void</li> <li>decimal     spark will convert existing decimals to null if the precision and scale doesn't fit the data</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>Name of the source column(s). Alias: column</p> <code>*</code> <code>target_column</code> <code>str</code> <p>Name of the target column or alias if more than one column is specified. Alias: target_alias</p> required <code>precision</code> <code>conint(gt=0, le=38)</code> <p>the maximum (i.e. total) number of digits (default: 38). Must be &gt; 0.</p> <code>38</code> <code>scale</code> <code>conint(ge=0, le=18)</code> <p>the number of digits on right side of dot. (default: 18). Must be &gt;= 0.</p> <code>18</code>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = DECIMAL\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.precision","title":"precision  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>precision: conint(gt=0, le=38) = Field(default=38, description='The maximum total number of digits (precision) of the decimal. Must be &gt; 0. Default is 38')\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.scale","title":"scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: conint(ge=0, le=18) = Field(default=18, description='The number of digits to the right of the decimal point (scale). Must be &gt;= 0. Default is 18')\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToDecimal class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return column.cast(self.datatype.spark_type(precision=self.precision, scale=self.scale))\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.validate_scale_and_precisions","title":"validate_scale_and_precisions","text":"<pre><code>validate_scale_and_precisions()\n</code></pre> <p>Validate the precision and scale values.</p> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_scale_and_precisions(self):\n    \"\"\"Validate the precision and scale values.\"\"\"\n    precision_value = self.precision\n    scale_value = self.scale\n\n    if scale_value == precision_value:\n        self.log.warning(\"scale and precision are equal, this will result in a null value\")\n    if scale_value &gt; precision_value:\n        raise ValueError(\"scale must be &lt; precision\")\n\n    return self\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble","title":"koheesio.spark.transformations.cast_to_datatype.CastToDouble","text":"<p>Cast to Double</p> <p>Represents 8-byte double-precision floating point numbers. The range of numbers is from -1.7976931348623157E308 to 1.7976931348623157E308.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = DOUBLE\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToDouble class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat","title":"koheesio.spark.transformations.cast_to_datatype.CastToFloat","text":"<p>Cast to Float (a.k.a. real)</p> <p>Represents 4-byte single-precision floating point numbers. The range of numbers is from -3.402823E38 to 3.402823E38.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>double</li> <li>decimal</li> <li>boolean</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>timestamp     precision is lost (use CastToDouble instead)</li> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = FLOAT\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToFloat class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, DOUBLE, DECIMAL, BOOLEAN]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger","title":"koheesio.spark.transformations.cast_to_datatype.CastToInteger","text":"<p>Cast to Integer (a.k.a. int)</p> <p>Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = INTEGER\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToInteger class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong","title":"koheesio.spark.transformations.cast_to_datatype.CastToLong","text":"<p>Cast to Long (a.k.a. bigint)</p> <p>Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = LONG\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToLong class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort","title":"koheesio.spark.transformations.cast_to_datatype.CastToShort","text":"<p>Cast to Short (a.k.a. smallint)</p> <p>Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>string</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>void</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>timestamp     range of values too small for timestamp to have any meaning</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = SHORT\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToShort class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString","title":"koheesio.spark.transformations.cast_to_datatype.CastToString","text":"<p>Cast to String</p> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>binary</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>array</li> <li>map</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = STRING\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToString class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BINARY, BOOLEAN, TIMESTAMP, DATE, ARRAY, MAP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","title":"koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","text":"<p>Cast to Timestamp</p> <p>Numeric time stamp is the number of seconds since January 1, 1970 00:00:00.000000000 UTC. Not advised to be used on small integers, as the range of values is too small for timestamp to have any meaning.</p> <p>For more fine-grained control over the timestamp format, use the <code>date_time</code> module. This allows for parsing strings to timestamps and vice versa.</p> See Also <ul> <li>koheesio.spark.transformations.date_time</li> <li>https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#timestamp-pattern</li> </ul> Unsupported datatypes: <p>Following casts are not supported</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>date</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>boolean:     range of values too small for timestamp to have any meaning</li> <li>byte:     range of values too small for timestamp to have any meaning</li> <li>string:     converts to null in most cases, use <code>date_time</code> module instead</li> <li>short:     range of values too small for timestamp to have any meaning</li> <li>void:     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = TIMESTAMP\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToTimestamp class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, BOOLEAN, BYTE, SHORT, STRING, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, DATE]\n</code></pre>"},{"location":"api_reference/spark/transformations/drop_column.html","title":"Drop column","text":"<p>This module defines the DropColumn class, a subclass of ColumnsTransformation.</p>"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn","title":"koheesio.spark.transformations.drop_column.DropColumn","text":"<p>Drop one or more columns</p> <p>The DropColumn class is used to drop one or more columns from a PySpark DataFrame. It wraps the <code>pyspark.DataFrame.drop</code> function and can handle either a single string or a list of strings as input.</p> <p>If a specified column does not exist in the DataFrame, no error or warning is thrown, and all existing columns will remain.</p> Expected behavior <ul> <li>When the <code>column</code> does not exist, all columns will remain (no error or warning is thrown)</li> <li>Either a single string, or a list of strings can be specified</li> </ul> Example <p>df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = DropColumn(column=\"product\").transform(df)\n</code></pre> <p>output_df:</p> amount country 1000 USA 1500 USA 1600 USA <p>In this example, the <code>product</code> column is dropped from the DataFrame <code>df</code>.</p>"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/drop_column.py</code> <pre><code>def execute(self):\n    self.log.info(f\"{self.column=}\")\n    self.output.df = self.df.drop(*self.columns)\n</code></pre>"},{"location":"api_reference/spark/transformations/dummy.html","title":"Dummy","text":"<p>Dummy transformation for testing purposes.</p>"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation","title":"koheesio.spark.transformations.dummy.DummyTransformation","text":"<p>Dummy transformation for testing purposes.</p> <p>This transformation adds a new column <code>hello</code> to the DataFrame with the value <code>world</code>.</p> <p>It is intended for testing purposes or for use in examples or reference documentation.</p> Example <p>input_df:</p> id 1 <pre><code>output_df = DummyTransformation().transform(input_df)\n</code></pre> <p>output_df:</p> id hello 1 world <p>In this example, the <code>hello</code> column is added to the DataFrame <code>input_df</code>.</p>"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/dummy.py</code> <pre><code>def execute(self):\n    self.output.df = self.df.withColumn(\"hello\", lit(\"world\"))\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html","title":"Get item","text":"<p>Transformation to wrap around the pyspark getItem function</p>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem","title":"koheesio.spark.transformations.get_item.GetItem","text":"<p>Get item from list or map (dictionary)</p> <p>Wrapper around <code>pyspark.sql.functions.getItem</code></p> <p><code>GetItem</code> is strict about the data type of the column. If the column is not a list or a map, an error will be raised.</p> Note <p>Only MapType and ArrayType are supported.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to get the item from. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>key</code> <code>Union[int, str]</code> <p>The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index</p> required Example"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-list-arraytype","title":"Example with list (ArrayType)","text":"<p>By specifying an integer for the parameter \"key\", getItem knows to get the element at index n of a list (index starts at 0).</p> <p>input_df:</p> id content 1 [1, 2, 3] 2 [4, 5] 3 [6] 4 [] <pre><code>output_df = GetItem(\n    column=\"content\",\n    index=1,  # get the second element of the list\n    target_column=\"item\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> id content item 1 [1, 2, 3] 2 2 [4, 5] 5 3 [6] null 4 [] null"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-a-dict-maptype","title":"Example with a dict (MapType)","text":"<p>input_df:</p> id content 1 {key1 -&gt; value1} 2 {key1 -&gt; value2} 3 {key2 -&gt; hello} 4 {key2 -&gt; world} <p><pre><code>output_df = GetItem(\n    column= \"content\",\n    key=\"key2,\n    target_column=\"item\",\n).transform(input_df)\n</code></pre> As we request the key to be \"key2\", the first 2 rows will be null, because it does not have \"key2\".</p> <p>output_df:</p> id content item 1 {key1 -&gt; value1} null 2 {key1 -&gt; value2} null 3 {key2 -&gt; hello} hello 4 {key2 -&gt; world} world"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.key","title":"key  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>key: Union[int, str] = Field(default=..., alias='index', description='The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index')\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig","title":"ColumnConfig","text":"<p>Limit the data types to ArrayType and MapType.</p>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data_type_strict_mode = True\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = run_for_all_data_type\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [ARRAY, MAP]\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/get_item.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return get_item(column, self.key)\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.get_item","title":"koheesio.spark.transformations.get_item.get_item","text":"<pre><code>get_item(column: Column, key: Union[str, int])\n</code></pre> <p>Wrapper around pyspark.sql.functions.getItem</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Column</code> <p>The column to get the item from</p> required <code>key</code> <code>Union[str, int]</code> <p>The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string.</p> required <p>Returns:</p> Type Description <code>Column</code> <p>The column with the item</p> Source code in <code>src/koheesio/spark/transformations/get_item.py</code> <pre><code>def get_item(column: Column, key: Union[str, int]):\n    \"\"\"\n    Wrapper around pyspark.sql.functions.getItem\n\n    Parameters\n    ----------\n    column : Column\n        The column to get the item from\n    key : Union[str, int]\n        The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer.\n        If the column is a dict (MapType), this should be a string.\n\n    Returns\n    -------\n    Column\n        The column with the item\n    \"\"\"\n    return column.getItem(key)\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html","title":"Hash","text":"<p>Module for hashing data using SHA-2 family of hash functions</p> <p>See the docstring of the Sha2Hash class for more information.</p>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.HASH_ALGORITHM","title":"koheesio.spark.transformations.hash.HASH_ALGORITHM  <code>module-attribute</code>","text":"<pre><code>HASH_ALGORITHM = Literal[224, 256, 384, 512]\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.STRING","title":"koheesio.spark.transformations.hash.STRING  <code>module-attribute</code>","text":"<pre><code>STRING = STRING\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash","title":"koheesio.spark.transformations.hash.Sha2Hash","text":"<p>hash the value of 1 or more columns using SHA-2 family of hash functions</p> <p>Mild wrapper around pyspark.sql.functions.sha2</p> <ul> <li>https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html</li> </ul> <p>Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).</p> Note <p>This function allows concatenating the values of multiple columns together prior to hashing.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to hash. Alias: column</p> required <code>delimiter</code> <code>Optional[str]</code> <p>Optional separator for the string that will eventually be hashed. Defaults to '|'</p> <code>|</code> <code>num_bits</code> <code>Optional[HASH_ALGORITHM]</code> <p>Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512</p> <code>256</code> <code>target_column</code> <code>str</code> <p>The generated hash will be written to the column name specified here</p> required"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.delimiter","title":"delimiter  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>delimiter: Optional[str] = Field(default='|', description=\"Optional separator for the string that will eventually be hashed. Defaults to '|'\")\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.num_bits","title":"num_bits  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>num_bits: Optional[HASH_ALGORITHM] = Field(default=256, description='Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512')\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: str = Field(default=..., description='The generated hash will be written to the column name specified here')\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/hash.py</code> <pre><code>def execute(self):\n    columns = list(self.get_columns())\n    self.output.df = (\n        self.df.withColumn(\n            self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)\n        )\n        if columns\n        else self.df\n    )\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.sha2_hash","title":"koheesio.spark.transformations.hash.sha2_hash","text":"<pre><code>sha2_hash(columns: List[str], delimiter: Optional[str] = '|', num_bits: Optional[HASH_ALGORITHM] = 256)\n</code></pre> <p>hash the value of 1 or more columns using SHA-2 family of hash functions</p> <p>Mild wrapper around pyspark.sql.functions.sha2</p> <ul> <li>https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html</li> </ul> <p>Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.</p> <p>If a null is passed, the result will also be null.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>List[str]</code> <p>The columns to hash</p> required <code>delimiter</code> <code>Optional[str]</code> <p>Optional separator for the string that will eventually be hashed. Defaults to '|'</p> <code>|</code> <code>num_bits</code> <code>Optional[HASH_ALGORITHM]</code> <p>Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512</p> <code>256</code> Source code in <code>src/koheesio/spark/transformations/hash.py</code> <pre><code>def sha2_hash(columns: List[str], delimiter: Optional[str] = \"|\", num_bits: Optional[HASH_ALGORITHM] = 256):\n    \"\"\"\n    hash the value of 1 or more columns using SHA-2 family of hash functions\n\n    Mild wrapper around pyspark.sql.functions.sha2\n\n    - https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html\n\n    Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).\n    This function allows concatenating the values of multiple columns together prior to hashing.\n\n    If a null is passed, the result will also be null.\n\n    Parameters\n    ----------\n    columns : List[str]\n        The columns to hash\n    delimiter : Optional[str], optional, default=|\n        Optional separator for the string that will eventually be hashed. Defaults to '|'\n    num_bits : Optional[HASH_ALGORITHM], optional, default=256\n        Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512\n    \"\"\"\n    # make sure all columns are of type pyspark.sql.Column and cast to string\n    _columns = []\n    for c in columns:\n        if isinstance(c, str):\n            c: Column = col(c)\n        _columns.append(c.cast(STRING.spark_type()))\n\n    # concatenate columns if more than 1 column is provided\n    if len(_columns) &gt; 1:\n        column = concat_ws(delimiter, *_columns)\n    else:\n        column = _columns[0]\n\n    return sha2(column, num_bits)\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html","title":"Lookup","text":"<p>Lookup transformation for joining two dataframes together</p> <p>Classes:</p> Name Description <code>JoinMapping</code> <code>TargetColumn</code> <code>JoinType</code> <code>JoinHint</code> <code>DataframeLookup</code>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup","title":"koheesio.spark.transformations.lookup.DataframeLookup","text":"<p>Lookup transformation for joining two dataframes together</p> <p>Parameters:</p> Name Type Description Default <code>df</code> <code>DataFrame</code> <p>The left Spark DataFrame</p> required <code>other</code> <code>DataFrame</code> <p>The right Spark DataFrame</p> required <code>on</code> <code>List[JoinMapping] | JoinMapping</code> <p>List of join mappings. If only one mapping is passed, it can be passed as a single object.</p> required <code>targets</code> <code>List[TargetColumn] | TargetColumn</code> <p>List of target columns. If only one target is passed, it can be passed as a single object.</p> required <code>how</code> <code>JoinType</code> <p>What type of join to perform. Defaults to left. See JoinType for more information.</p> required <code>hint</code> <code>JoinHint</code> <p>What type of join hint to use. Defaults to None. See JoinHint for more information.</p> required Example <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.spark.transformations.lookup import (\n    DataframeLookup,\n    JoinMapping,\n    TargetColumn,\n    JoinType,\n)\n\nspark = SparkSession.builder.getOrCreate()\n\n# create the dataframes\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\n# perform the lookup\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", joined_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.transform()\n</code></pre> <p>output_df:</p> id value right_value 1 A A 2 B null <p>In this example, the <code>left_df</code> and <code>right_df</code> dataframes are joined together using the <code>id</code> column. The <code>value</code> column from the <code>right_df</code> is aliased as <code>right_value</code> in the output dataframe.</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: DataFrame = Field(default=None, description='The left Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.hint","title":"hint  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>hint: Optional[JoinHint] = Field(default=None, description='What type of join hint to use. Defaults to None. ' + __doc__)\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.how","title":"how  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>how: Optional[JoinType] = Field(default=LEFT, description='What type of join to perform. Defaults to left. ' + __doc__)\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.on","title":"on  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>on: Union[List[JoinMapping], JoinMapping] = Field(default=..., alias='join_mapping', description='List of join mappings. If only one mapping is passed, it can be passed as a single object.')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.other","title":"other  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>other: DataFrame = Field(default=None, description='The right Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.targets","title":"targets  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>targets: Union[List[TargetColumn], TargetColumn] = Field(default=..., alias='target_columns', description='List of target columns. If only one target is passed, it can be passed as a single object.')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output","title":"Output","text":"<p>Output for the lookup transformation</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.left_df","title":"left_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>left_df: DataFrame = Field(default=..., description='The left Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.right_df","title":"right_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>right_df: DataFrame = Field(default=..., description='The right Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute the lookup transformation</p> Source code in <code>src/koheesio/spark/transformations/lookup.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"Execute the lookup transformation\"\"\"\n    # prepare the right dataframe\n    prepared_right_df = self.get_right_df().select(\n        *[join_mapping.column for join_mapping in self.on],\n        *[target.column for target in self.targets],\n    )\n    if self.hint:\n        prepared_right_df = prepared_right_df.hint(self.hint)\n\n    # generate the output\n    self.output.left_df = self.df\n    self.output.right_df = prepared_right_df\n    self.output.df = self.df.join(\n        prepared_right_df,\n        on=[join_mapping.source_column for join_mapping in self.on],\n        how=self.how,\n    )\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.get_right_df","title":"get_right_df","text":"<pre><code>get_right_df() -&gt; DataFrame\n</code></pre> <p>Get the right side dataframe</p> Source code in <code>src/koheesio/spark/transformations/lookup.py</code> <pre><code>def get_right_df(self) -&gt; DataFrame:\n    \"\"\"Get the right side dataframe\"\"\"\n    return self.other\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.set_list","title":"set_list","text":"<pre><code>set_list(value)\n</code></pre> <p>Ensure that we can pass either a single object, or a list of objects</p> Source code in <code>src/koheesio/spark/transformations/lookup.py</code> <pre><code>@field_validator(\"on\", \"targets\")\ndef set_list(cls, value):\n    \"\"\"Ensure that we can pass either a single object, or a list of objects\"\"\"\n    return [value] if not isinstance(value, list) else value\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint","title":"koheesio.spark.transformations.lookup.JoinHint","text":"<p>Supported join hints</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.BROADCAST","title":"BROADCAST  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BROADCAST = 'broadcast'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.MERGE","title":"MERGE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE = 'merge'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping","title":"koheesio.spark.transformations.lookup.JoinMapping","text":"<p>Mapping for joining two dataframes together</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.column","title":"column  <code>property</code>","text":"<pre><code>column: Column\n</code></pre> <p>Get the join mapping as a pyspark.sql.Column object</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.other_column","title":"other_column  <code>instance-attribute</code>","text":"<pre><code>other_column: str\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.source_column","title":"source_column  <code>instance-attribute</code>","text":"<pre><code>source_column: str\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType","title":"koheesio.spark.transformations.lookup.JoinType","text":"<p>Supported join types</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.ANTI","title":"ANTI  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ANTI = 'anti'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.CROSS","title":"CROSS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CROSS = 'cross'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.FULL","title":"FULL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FULL = 'full'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.INNER","title":"INNER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INNER = 'inner'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.LEFT","title":"LEFT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LEFT = 'left'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.RIGHT","title":"RIGHT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>RIGHT = 'right'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.SEMI","title":"SEMI  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SEMI = 'semi'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn","title":"koheesio.spark.transformations.lookup.TargetColumn","text":"<p>Target column for the joined dataframe</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.column","title":"column  <code>property</code>","text":"<pre><code>column: Column\n</code></pre> <p>Get the target column as a pyspark.sql.Column object</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column","title":"target_column  <code>instance-attribute</code>","text":"<pre><code>target_column: str\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column_alias","title":"target_column_alias  <code>instance-attribute</code>","text":"<pre><code>target_column_alias: str\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html","title":"Repartition","text":"<p>Repartition Transformation</p>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition","title":"koheesio.spark.transformations.repartition.Repartition","text":"<p>Wrapper around DataFrame.repartition</p> <p>With repartition, the number of partitions can be given as an optional value. If this is not provided, a default value is used. The default number of partitions is defined by the spark config 'spark.sql.shuffle.partitions', for which the default value is 200 and will never exceed the number or rows in the DataFrame (whichever is value is lower).</p> <p>If columns are omitted, the entire DataFrame is repartitioned without considering the particular values in the columns.</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Optional[Union[str, List[str]]]</code> <p>Name of the source column(s). If omitted, the entire DataFrame is repartitioned without considering the particular values in the columns. Alias: columns</p> <code>None</code> <code>num_partitions</code> <code>Optional[int]</code> <p>The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.</p> <code>None</code> Example <pre><code>Repartition(column=[\"c1\", \"c2\"], num_partitions=3)  # results in 3 partitions\nRepartition(column=\"c1\", num_partitions=2)  # results in 2 partitions\nRepartition(column=[\"c1\", \"c2\"])  # results in &lt;= 200 partitions\nRepartition(num_partitions=5)  # results in 5 partitions\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[ListOfColumns] = Field(default='', alias='column', description='Name of the source column(s)')\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.numPartitions","title":"numPartitions  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>numPartitions: Optional[int] = Field(default=None, alias='num_partitions', description=\"The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/repartition.py</code> <pre><code>def execute(self):\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n    # Prepare repartition input:\n    #  num_partitions comes first, but if it is not provided it should not be included as None.\n    repartition_inputs = [i for i in [self.numPartitions, *columns] if i]\n    self.output.df = self.df.repartition(*repartition_inputs)\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html","title":"Replace","text":"<p>Transformation to replace a particular value in a column with another one</p>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace","title":"koheesio.spark.transformations.replace.Replace","text":"<p>Replace a particular value in a column with another one</p> <p>Can handle empty strings (\"\") as well as NULL / None values.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>boolean</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>timestamp</li> <li>date</li> <li>string</li> <li>void     skipped by default</li> </ul> <p>Any supported none-string datatype will be cast to string before the replacement is done.</p> Example <p>input_df:</p> id string 1 hello 2 world 3 <pre><code>output_df = Replace(\n    column=\"string\",\n    from_value=\"hello\",\n    to_value=\"programmer\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> id string 1 programmer 2 world 3 <p>In this example, the value \"hello\" in the column \"string\" is replaced with \"programmer\".</p>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.from_value","title":"from_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>from_value: Optional[str] = Field(default=None, alias='from', description=\"The original value that needs to be replaced. If no value is given, all 'null' values will be replaced with the to_value\")\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.to_value","title":"to_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>to_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig","title":"ColumnConfig","text":"<p>Column type configurations for the column to be replaced</p>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP, DATE]\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/replace.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return replace(column=column, from_value=self.from_value, to_value=self.to_value)\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.replace","title":"koheesio.spark.transformations.replace.replace","text":"<pre><code>replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None)\n</code></pre> <p>Function to replace a particular value in a column with another one</p> Source code in <code>src/koheesio/spark/transformations/replace.py</code> <pre><code>def replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None):\n    \"\"\"Function to replace a particular value in a column with another one\"\"\"\n    # make sure we have a Column object\n    if isinstance(column, str):\n        column = col(column)\n\n    if not from_value:\n        condition = column.isNull()\n    else:\n        condition = column == from_value\n\n    return when(condition, lit(to_value)).otherwise(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html","title":"Row number dedup","text":"<p>This module contains the RowNumberDedup class, which performs a row_number deduplication operation on a DataFrame.</p> <p>See the docstring of the RowNumberDedup class for more information.</p>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup","title":"koheesio.spark.transformations.row_number_dedup.RowNumberDedup","text":"<p>A class used to perform a row_number deduplication operation on a DataFrame.</p> <p>This class is a specialized transformation that extends the ColumnsTransformation class. It sorts the DataFrame based on the provided sort columns and assigns a row_number to each row. It then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row can be stored in a specified target column or a default column named \"meta_row_number_column\". The class also provides an option to preserve meta columns (like the row_numberk column) in the output DataFrame.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>list</code> <p>List of columns to apply the transformation to. If a single '*' is passed as a column name or if the columns parameter is omitted, the transformation will be applied to all columns of the data types specified in <code>run_for_all_data_type</code> of the ColumnConfig. (inherited from ColumnsTransformation)</p> <code>sort_columns</code> <code>list</code> <p>List of columns that the DataFrame will be sorted by.</p> <code>target_column</code> <code>(str, optional)</code> <p>Column where the row_number of each row will be stored.</p> <code>preserve_meta</code> <code>(bool, optional)</code> <p>Flag that determines whether the meta columns should be kept in the output DataFrame.</p>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.preserve_meta","title":"preserve_meta  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>preserve_meta: bool = Field(default=False, description=\"If true, meta columns are kept in output dataframe. Defaults to 'False'\")\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.sort_columns","title":"sort_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sort_columns: conlist(Union[str, Column], min_length=0) = Field(default_factory=list, alias='sort_column', description='List of orderBy columns. If only one column is passed, it can be passed as a single object.')\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: Optional[Union[str, Column]] = Field(default='meta_row_number_column', alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.window_spec","title":"window_spec  <code>property</code>","text":"<pre><code>window_spec: WindowSpec\n</code></pre> <p>Builds a WindowSpec object based on the columns defined in the configuration.</p> <p>The WindowSpec object is used to define a window frame over which functions are applied in Spark. This method partitions the data by the columns returned by the <code>get_columns</code> method and then orders the partitions by the columns specified in <code>sort_columns</code>.</p> Notes <p>The order of the columns in the WindowSpec object is preserved. If a column is passed as a string, it is converted to a Column object with DESC ordering.</p> <p>Returns:</p> Type Description <code>WindowSpec</code> <p>A WindowSpec object that can be used to define a window frame in Spark.</p>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Performs the row_number deduplication operation on the DataFrame.</p> <p>This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row, and then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row is stored in the target column. If preserve_meta is False, the method also drops the target column from the DataFrame.</p> Source code in <code>src/koheesio/spark/transformations/row_number_dedup.py</code> <pre><code>def execute(self) -&gt; RowNumberDedup.Output:\n    \"\"\"\n    Performs the row_number deduplication operation on the DataFrame.\n\n    This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row,\n    and then filters the DataFrame to keep only the top-row_number row for each group of duplicates.\n    The row_number of each row is stored in the target column. If preserve_meta is False,\n    the method also drops the target column from the DataFrame.\n    \"\"\"\n    df = self.df\n    window_spec = self.window_spec\n\n    # if target_column is a string, convert it to a Column object\n    if isinstance((target_column := self.target_column), str):\n        target_column = col(target_column)\n\n    # dedup the dataframe based on the window spec\n    df = df.withColumn(self.target_column, row_number().over(window_spec)).filter(target_column == 1).select(\"*\")\n\n    if not self.preserve_meta:\n        df = df.drop(target_column)\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.set_sort_columns","title":"set_sort_columns","text":"<pre><code>set_sort_columns(columns_value)\n</code></pre> <p>Validates and optimizes the sort_columns parameter.</p> <p>This method ensures that sort_columns is a list (or single object) of unique strings or Column objects. It removes any empty strings or None values from the list and deduplicates the columns.</p> <p>Parameters:</p> Name Type Description Default <code>columns_value</code> <code>Union[str, Column, List[Union[str, Column]]]</code> <p>The value of the sort_columns parameter.</p> required <p>Returns:</p> Type Description <code>List[Union[str, Column]]</code> <p>The optimized and deduplicated list of sort columns.</p> Source code in <code>src/koheesio/spark/transformations/row_number_dedup.py</code> <pre><code>@field_validator(\"sort_columns\", mode=\"before\")\ndef set_sort_columns(cls, columns_value):\n    \"\"\"\n    Validates and optimizes the sort_columns parameter.\n\n    This method ensures that sort_columns is a list (or single object) of unique strings or Column objects.\n    It removes any empty strings or None values from the list and deduplicates the columns.\n\n    Parameters\n    ----------\n    columns_value : Union[str, Column, List[Union[str, Column]]]\n        The value of the sort_columns parameter.\n\n    Returns\n    -------\n    List[Union[str, Column]]\n        The optimized and deduplicated list of sort columns.\n    \"\"\"\n    # Convert single string or Column object to a list\n    columns = [columns_value] if isinstance(columns_value, (str, Column)) else [*columns_value]\n\n    # Remove empty strings, None, etc.\n    columns = [c for c in columns if (isinstance(c, Column) and c is not None) or (isinstance(c, str) and c)]\n\n    dedup_columns = []\n    seen = set()\n\n    # Deduplicate the columns while preserving the order\n    for column in columns:\n        if str(column) not in seen:\n            dedup_columns.append(column)\n            seen.add(str(column))\n\n    return dedup_columns\n</code></pre>"},{"location":"api_reference/spark/transformations/sql_transform.html","title":"Sql transform","text":"<p>SQL Transform module</p> <p>SQL Transform module provides an easy interface to transform a dataframe using SQL. This SQL can originate from a string or a file and may contain placeholders for templating.</p>"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform","title":"koheesio.spark.transformations.sql_transform.SqlTransform","text":"<p>SQL Transform module provides an easy interface to transform a dataframe using SQL.</p> <p>This SQL can originate from a string or a file and may contain placeholder (parameters) for templating.</p> <ul> <li>Placeholders are identified with <code>${placeholder}</code>.</li> <li>Placeholders can be passed as explicit params (params) or as implicit params (kwargs).</li> </ul> <p>Example sql script:</p> <pre><code>SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n</code></pre>"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/sql_transform.py</code> <pre><code>def execute(self):\n    table_name = get_random_string(prefix=\"sql_transform\")\n    self.params = {**self.params, \"table_name\": table_name}\n\n    df = self.df\n    df.createOrReplaceTempView(table_name)\n    query = self.query\n\n    self.output.df = self.spark.sql(query)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html","title":"Transform","text":"<p>Transform module</p> <p>Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.</p>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform","title":"koheesio.spark.transformations.transform.Transform","text":"<pre><code>Transform(func: Callable, params: Dict = None, df: DataFrame = None, **kwargs)\n</code></pre> <p>Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.</p> <p>The implementation is inspired by and based upon: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html</p> <p>Parameters:</p> Name Type Description Default <code>func</code> <code>Callable</code> <p>The function to be called on the DataFrame.</p> required <code>params</code> <code>Dict</code> <p>The keyword arguments to be passed to the function. Defaults to None. Alternatively, keyword arguments can be passed directly as keyword arguments - they will be merged with the <code>params</code> dictionary.</p> <code>None</code> Example Source code in <code>src/koheesio/spark/transformations/transform.py</code> <pre><code>def __init__(self, func: Callable, params: Dict = None, df: DataFrame = None, **kwargs):\n    params = {**(params or {}), **kwargs}\n    super().__init__(func=func, params=params, df=df)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--a-function-compatible-with-transform","title":"a function compatible with Transform:","text":"<pre><code>def some_func(df, a: str, b: str):\n    return df.withColumn(a, f.lit(b))\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--verbose-style-input-in-transform","title":"verbose style input in Transform","text":"<pre><code>Transform(func=some_func, params={\"a\": \"foo\", \"b\": \"bar\"})\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--shortened-style-notation-easier-to-read","title":"shortened style notation (easier to read)","text":"<pre><code>Transform(some_func, a=\"foo\", b=\"bar\")\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--when-too-much-input-is-given-transform-will-ignore-extra-input","title":"when too much input is given, Transform will ignore extra input","text":"<pre><code>Transform(\n    some_func,\n    a=\"foo\",\n    # ignored input\n    c=\"baz\",\n    title=42,\n    author=\"Adams\",\n    # order of params input should not matter\n    b=\"bar\",\n)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--using-the-from_func-classmethod","title":"using the from_func classmethod","text":"<pre><code>SomeFunc = Transform.from_func(some_func, a=\"foo\")\nsome_func = SomeFunc(b=\"bar\")\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.func","title":"func  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>func: Callable = Field(default=None, description='The function to be called on the DataFrame.')\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Call the function on the DataFrame with the given keyword arguments.</p> Source code in <code>src/koheesio/spark/transformations/transform.py</code> <pre><code>def execute(self):\n    \"\"\"Call the function on the DataFrame with the given keyword arguments.\"\"\"\n    func, kwargs = get_args_for_func(self.func, self.params)\n    self.output.df = self.df.transform(func=func, **kwargs)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.from_func","title":"from_func  <code>classmethod</code>","text":"<pre><code>from_func(func: Callable, **kwargs) -&gt; Callable[..., Transform]\n</code></pre> <p>Create a Transform class from a function. Useful for creating a new class with a different name.</p> <p>This method uses the <code>functools.partial</code> function to create a new class with the given function and keyword arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for the specific use case.</p> Example <pre><code>CustomTransform = Transform.from_func(some_func, a=\"foo\")\nsome_func = CustomTransform(b=\"bar\")\n</code></pre> <p>In this example, <code>CustomTransform</code> is a Transform class with the function <code>some_func</code> and the keyword argument <code>a</code> set to \"foo\". When calling <code>some_func(b=\"bar\")</code>, the function <code>some_func</code> will be called with the keyword arguments <code>a=\"foo\"</code> and <code>b=\"bar\"</code>.</p> Source code in <code>src/koheesio/spark/transformations/transform.py</code> <pre><code>@classmethod\ndef from_func(cls, func: Callable, **kwargs) -&gt; Callable[..., Transform]:\n    \"\"\"Create a Transform class from a function. Useful for creating a new class with a different name.\n\n    This method uses the `functools.partial` function to create a new class with the given function and keyword\n    arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for\n    the specific use case.\n\n    Example\n    -------\n    ```python\n    CustomTransform = Transform.from_func(some_func, a=\"foo\")\n    some_func = CustomTransform(b=\"bar\")\n    ```\n\n    In this example, `CustomTransform` is a Transform class with the function `some_func` and the keyword argument\n    `a` set to \"foo\". When calling `some_func(b=\"bar\")`, the function `some_func` will be called with the keyword\n    arguments `a=\"foo\"` and `b=\"bar\"`.\n    \"\"\"\n    return partial(cls, func=func, **kwargs)\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html","title":"Uuid5","text":"<p>Ability to generate UUID5 using native pyspark (no udf)</p>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5","title":"koheesio.spark.transformations.uuid5.HashUUID5","text":"<p>Generate a UUID with the UUID5 algorithm</p> <p>Spark does not provide inbuilt API to generate version 5 UUID, hence we have to use a custom implementation to provide this capability.</p> <p>Prerequisites: this function has no side effects. But be aware that in most cases, the expectation is that your data is clean (e.g. trimmed of leading and trailing spaces)</p> Concept <p>UUID5 is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). https://docs.python.org/3/library/uuid.html#uuid.uuid5</p> <p>Based on https://github.com/MrPowers/quinn/pull/96 with the difference that since Spark 3.0.0 an OVERLAY function from ANSI SQL 2016 is available which saves coding space and string allocation(s) in place of CONCAT + SUBSTRING.</p> <p>For more info on OVERLAY, see: https://docs.databricks.com/sql/language-manual/functions/overlay.html</p> Example <p>Input is a DataFrame with two columns:</p> id string 1 hello 2 world 3 <p>Input parameters:</p> <ul> <li>source_columns = [\"id\", \"string\"]</li> <li>target_column = \"uuid5\"</li> </ul> <p>Result:</p> id string uuid5 1 hello f3e99bbd-85ae-5dc3-bf6e-cd0022a0ebe6 2 world b48e880f-c289-5c94-b51f-b9d21f9616c0 3 2193a99d-222e-5a0c-a7d6-48fbe78d2708 <p>In code: <pre><code>HashUUID5(source_columns=[\"id\", \"string\"], target_column=\"uuid5\").transform(input_df)\n</code></pre></p> <p>In this example, the <code>id</code> and <code>string</code> columns are concatenated and hashed using the UUID5 algorithm. The result is stored in the <code>uuid5</code> column.</p>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.delimiter","title":"delimiter  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>delimiter: Optional[str] = Field(default='|', description='Separator for the string that will eventually be hashed')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: str = 'Generate a UUID with the UUID5 algorithm'\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.extra_string","title":"extra_string  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>extra_string: Optional[str] = Field(default='', description='In case of collisions, one can pass an extra string to hash on.')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.namespace","title":"namespace  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>namespace: Optional[Union[str, UUID]] = Field(default='', description='Namespace DNS')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.source_columns","title":"source_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source_columns: ListOfColumns = Field(default=..., description=\"List of columns that should be hashed. Should contain the name of at least 1 column. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'`\")\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: str = Field(default=..., description='The generated UUID will be written to the column name specified here')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> Source code in <code>src/koheesio/spark/transformations/uuid5.py</code> <pre><code>def execute(self) -&gt; None:\n    ns = f.lit(uuid5_namespace(self.namespace).bytes)\n    self.log.info(f\"UUID5 namespace '{ns}' derived from '{self.namespace}'\")\n    cols_to_hash = f.concat_ws(self.delimiter, *self.source_columns)\n    cols_to_hash = f.concat(f.lit(self.extra_string), cols_to_hash)\n    cols_to_hash = f.encode(cols_to_hash, \"utf-8\")\n    cols_to_hash = f.concat(ns, cols_to_hash)\n    source_columns_sha1 = f.sha1(cols_to_hash)\n    variant_part = f.substring(source_columns_sha1, 17, 4)\n    variant_part = f.conv(variant_part, 16, 2)\n    variant_part = f.lpad(variant_part, 16, \"0\")\n    variant_part = f.overlay(variant_part, f.lit(\"10\"), 1, 2)  # RFC 4122 variant.\n    variant_part = f.lower(f.conv(variant_part, 2, 16))\n    target_col_uuid = f.concat_ws(\n        \"-\",\n        f.substring(source_columns_sha1, 1, 8),\n        f.substring(source_columns_sha1, 9, 4),\n        f.concat(f.lit(\"5\"), f.substring(source_columns_sha1, 14, 3)),  # Set version.\n        variant_part,\n        f.substring(source_columns_sha1, 21, 12),\n    )\n    # Applying the transformation to the input df, storing the result in the column specified in `target_column`.\n    self.output.df = self.df.withColumn(self.target_column, target_col_uuid)\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.hash_uuid5","title":"koheesio.spark.transformations.uuid5.hash_uuid5","text":"<pre><code>hash_uuid5(input_value: str, namespace: Optional[Union[str, UUID]] = '', extra_string: Optional[str] = '')\n</code></pre> <p>pure python implementation of HashUUID5</p> <p>See: https://docs.python.org/3/library/uuid.html#uuid.uuid5</p> <p>Parameters:</p> Name Type Description Default <code>input_value</code> <code>str</code> <p>value that will be hashed</p> required <code>namespace</code> <code>Optional[str | UUID]</code> <p>namespace DNS</p> <code>''</code> <code>extra_string</code> <code>Optional[str]</code> <p>optional extra string that will be prepended to the input_value</p> <code>''</code> <p>Returns:</p> Type Description <code>str</code> <p>uuid.UUID (uuid5) cast to string</p> Source code in <code>src/koheesio/spark/transformations/uuid5.py</code> <pre><code>def hash_uuid5(\n    input_value: str,\n    namespace: Optional[Union[str, uuid.UUID]] = \"\",\n    extra_string: Optional[str] = \"\",\n):\n    \"\"\"pure python implementation of HashUUID5\n\n    See: https://docs.python.org/3/library/uuid.html#uuid.uuid5\n\n    Parameters\n    ----------\n    input_value : str\n        value that will be hashed\n    namespace : Optional[str | uuid.UUID]\n        namespace DNS\n    extra_string : Optional[str]\n        optional extra string that will be prepended to the input_value\n\n    Returns\n    -------\n    str\n        uuid.UUID (uuid5) cast to string\n    \"\"\"\n    if not isinstance(namespace, uuid.UUID):\n        hashed_namespace = uuid5_namespace(namespace)\n    else:\n        hashed_namespace = namespace\n    return str(uuid.uuid5(hashed_namespace, (extra_string + input_value)))\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.uuid5_namespace","title":"koheesio.spark.transformations.uuid5.uuid5_namespace","text":"<pre><code>uuid5_namespace(ns: Optional[Union[str, UUID]]) -&gt; UUID\n</code></pre> <p>Helper function used to provide a UUID5 hashed namespace based on the passed str</p> <p>Parameters:</p> Name Type Description Default <code>ns</code> <code>Optional[Union[str, UUID]]</code> <p>A str, an empty string (or None), or an existing UUID can be passed</p> required <p>Returns:</p> Type Description <code>UUID</code> <p>UUID5 hashed namespace</p> Source code in <code>src/koheesio/spark/transformations/uuid5.py</code> <pre><code>def uuid5_namespace(ns: Optional[Union[str, uuid.UUID]]) -&gt; uuid.UUID:\n    \"\"\"Helper function used to provide a UUID5 hashed namespace based on the passed str\n\n    Parameters\n    ----------\n    ns : Optional[Union[str, uuid.UUID]]\n        A str, an empty string (or None), or an existing UUID can be passed\n\n    Returns\n    -------\n    uuid.UUID\n        UUID5 hashed namespace\n    \"\"\"\n    # if we already have a UUID, we just return it\n    if isinstance(ns, uuid.UUID):\n        return ns\n\n    # if ns is empty or none, we simply return the default NAMESPACE_DNS\n    if not ns:\n        ns = uuid.NAMESPACE_DNS\n        return ns\n\n    # else we hash the string against the NAMESPACE_DNS\n    ns = uuid.uuid5(uuid.NAMESPACE_DNS, ns)\n    return ns\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html","title":"Date time","text":"<p>Module that holds the transformations that can be used for date and time related operations.</p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone","title":"koheesio.spark.transformations.date_time.ChangeTimeZone","text":"<p>Allows for the value of a column to be changed from one timezone to another</p> Adding useful metadata <p>When <code>add_target_timezone</code> is enabled (default), an additional column is created documenting which timezone a field has been converted to. Additionally, the suffix added to this column can be customized (default value is <code>_timezone</code>).</p> Example <p>Input:</p> <pre><code>target_column = \"some_column_name\"\ntarget_timezone = \"EST\"\nadd_target_timezone = True  # default value\ntimezone_column_suffix = \"_timezone\"  # default value\n</code></pre> <p>Output: <pre><code>column name  = \"some_column_name_timezone\"  # notice the suffix\ncolumn value = \"EST\"\n</code></pre></p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.add_target_timezone","title":"add_target_timezone  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>add_target_timezone: bool = Field(default=True, description='Toggles whether the target timezone is added as a column. True by default.')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.from_timezone","title":"from_timezone  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>from_timezone: str = Field(default=..., alias='source_timezone', description='Timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.target_timezone_column_suffix","title":"target_timezone_column_suffix  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_timezone_column_suffix: Optional[str] = Field(default='_timezone', alias='suffix', description=\"Allows to customize the suffix that is added to the target_timezone column. Defaults to '_timezone'. Note: this will be ignored if 'add_target_timezone' is set to False\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.to_timezone","title":"to_timezone  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>to_timezone: str = Field(default=..., alias='target_timezone', description='Target timezone. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def execute(self):\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n        # document which timezone a field has been converted to\n        if self.add_target_timezone:\n            df = df.withColumn(f\"{target_column}{self.target_timezone_column_suffix}\", f.lit(self.to_timezone))\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return change_timezone(column=column, source_timezone=self.from_timezone, target_timezone=self.to_timezone)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_no_duplicate_timezones","title":"validate_no_duplicate_timezones","text":"<pre><code>validate_no_duplicate_timezones(values)\n</code></pre> <p>Validate that source and target timezone are not the same</p> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_no_duplicate_timezones(cls, values):\n    \"\"\"Validate that source and target timezone are not the same\"\"\"\n    from_timezone_value = values.get(\"from_timezone\")\n    to_timezone_value = values.get(\"o_timezone\")\n\n    if from_timezone_value == to_timezone_value:\n        raise ValueError(\"Timezone conversions from and to the same timezones are not valid.\")\n\n    return values\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_timezone","title":"validate_timezone","text":"<pre><code>validate_timezone(timezone_value)\n</code></pre> <p>Validate that the timezone is a valid timezone.</p> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>@field_validator(\"from_timezone\", \"to_timezone\")\ndef validate_timezone(cls, timezone_value):\n    \"\"\"Validate that the timezone is a valid timezone.\"\"\"\n    if timezone_value not in all_timezones_set:\n        raise ValueError(\n            \"Not a valid timezone. Refer to the `TZ database name` column here: \"\n            \"https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\"\n        )\n    return timezone_value\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat","title":"koheesio.spark.transformations.date_time.DateFormat","text":"<p>wrapper around pyspark.sql.functions.date_format</p> See Also <ul> <li>https://spark.apache.org/docs/3.3.2/api/python/reference/pyspark.sql/api/pyspark.sql.functions.date_format.html</li> <li>https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html</li> </ul> Concept <p>This Transformation allows to convert a date/timestamp/string to a value of string in the format specified by the date format given.</p> <p>A pattern could be for instance <code>dd.MM.yyyy</code> and could return a string like \u201818.03.1993\u2019. All pattern letters of datetime pattern can be used, see: https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html</p> How to use <p>If more than one column is passed, the behavior of the Class changes this way</p> <ul> <li>the transformation will be run in a loop against all the given columns</li> <li>the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.</li> </ul> Example <pre><code>source_column value: datetime.date(2020, 1, 1)\ntarget: \"yyyyMMdd HH:mm\"\noutput: \"20200101 00:00\"\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(..., description='The format for the resulting string. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return date_format(column, self.format)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp","title":"koheesio.spark.transformations.date_time.ToTimestamp","text":"<p>wrapper around <code>pyspark.sql.functions.to_timestamp</code></p> <p>Converts a Column (or set of Columns) into <code>pyspark.sql.types.TimestampType</code> using the specified format. Specify formats according to <code>datetime pattern https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html</code>_.</p> <p>Functionally equivalent to col.cast(\"timestamp\").</p> See Also <p>Related Koheesio classes:</p> <ul> <li>koheesio.spark.transformations.ColumnsTransformation :     Base class for ColumnsTransformation. Defines column / columns field + recursive logic</li> <li>koheesio.spark.transformations.ColumnsTransformationWithTarget :     Defines target_column / target_suffix field</li> </ul> <p>pyspark.sql.functions:</p> <ul> <li>datetime pattern : https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html</li> </ul> Example"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--basic-usage-example","title":"Basic usage example:","text":"<p>input_df:</p> t \"1997-02-28 10:30:00\" <p><code>t</code> is a string</p> <pre><code>tts = ToTimestamp(\n    # since the source column is the same as the target in this example, 't' will be overwritten\n    column=\"t\",\n    target_column=\"t\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df)\n</code></pre> <p>output_df:</p> t datetime.datetime(1997, 2, 28, 10, 30) <p>Now <code>t</code> is a timestamp</p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--multiple-columns-at-once","title":"Multiple columns at once:","text":"<p>input_df:</p> t1 t2 \"1997-02-28 10:30:00\" \"2007-03-31 11:40:10\" <p><code>t1</code> and <code>t2</code> are strings</p> <pre><code>tts = ToTimestamp(\n    columns=[\"t1\", \"t2\"],\n    # 'target_suffix' is synonymous with 'target_column'\n    target_suffix=\"new\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df).select(\"t1_new\", \"t2_new\")\n</code></pre> <p>output_df:</p> t1_new t2_new datetime.datetime(1997, 2, 28, 10, 30) datetime.datetime(2007, 3, 31, 11, 40) <p>Now <code>t1_new</code> and <code>t2_new</code> are both timestamps</p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default=..., description='The date format for of the timestamp field. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    # convert string to timestamp\n    converted_col = to_timestamp(column, self.format)\n    return when(column.isNull(), lit(None)).otherwise(converted_col)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.change_timezone","title":"koheesio.spark.transformations.date_time.change_timezone","text":"<pre><code>change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str)\n</code></pre> <p>Helper function to change from one timezone to another</p> <p>wrapper around <code>pyspark.sql.functions.from_utc_timestamp</code> and <code>to_utc_timestamp</code></p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Union[str, Column]</code> <p>The column to change the timezone of</p> required <code>source_timezone</code> <code>str</code> <p>The timezone of the source_column value. Timezone fields are validated against the <code>TZ database name</code> column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones</p> required <code>target_timezone</code> <code>str</code> <p>The target timezone. Timezone fields are validated against the <code>TZ database name</code> column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones</p> required Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str):\n    \"\"\"Helper function to change from one timezone to another\n\n    wrapper around `pyspark.sql.functions.from_utc_timestamp` and `to_utc_timestamp`\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column to change the timezone of\n    source_timezone : str\n        The timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in\n        this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n    target_timezone : str\n        The target timezone. Timezone fields are validated against the `TZ database name` column in this list:\n        https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n\n    \"\"\"\n    column = col(column) if isinstance(column, str) else column\n    return from_utc_timestamp((to_utc_timestamp(column, source_timezone)), target_timezone)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html","title":"Interval","text":"<p>This module provides a <code>DateTimeColumn</code> class that extends the <code>Column</code> class from PySpark. It allows for adding or subtracting an interval value from a datetime column.</p> <p>This can be used to reflect a change in a given date / time column in a more human-readable way.</p> <p>Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</p> Background <p>The aim is to easily add or subtract an 'interval' value to a datetime column. An interval value is a string that represents a time interval. For example, '1 day', '1 month', '5 years', '1 minute 30 seconds', '10 milliseconds', etc. These can be used to reflect a change in a given date / time column in a more human-readable way.</p> <p>Typically, this can be done using the <code>date_add()</code> and <code>date_sub()</code> functions in Spark SQL. However, these functions only support adding or subtracting a single unit of time measured in days. Using an interval gives us much more flexibility; however, Spark SQL does not provide a function to add or subtract an interval value from a datetime column through the python API directly, so we have to use the <code>expr()</code> function to do this to be able to directly use SQL.</p> <p>This module provides a <code>DateTimeColumn</code> class that extends the <code>Column</code> class from PySpark. It allows for adding or subtracting an interval value from a datetime column using the <code>+</code> and <code>-</code> operators.</p> <p>Additionally, this module provides two transformation classes that can be used as a transformation step in a pipeline:</p> <ul> <li><code>DateTimeAddInterval</code>: adds an interval value to a datetime column</li> <li><code>DateTimeSubtractInterval</code>: subtracts an interval value from a datetime column</li> </ul> <p>These classes are subclasses of <code>ColumnsTransformationWithTarget</code> and hence can be used to perform transformations on multiple columns at once.</p> <p>The above transformations both use the provided <code>asjust_time()</code> function to perform the actual transformation.</p> See also: <p>Related Koheesio classes:</p> <ul> <li>koheesio.spark.transformations.ColumnsTransformation :     Base class for ColumnsTransformation. Defines column / columns field + recursive logic</li> <li>koheesio.spark.transformations.ColumnsTransformationWithTarget :     Defines target_column / target_suffix field</li> </ul> <p>pyspark.sql.functions:</p> <ul> <li>https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</li> <li>https://spark.apache.org/docs/latest/api/sql/index.html</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_add</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_subtract</li> </ul> <p>Classes:</p> Name Description <code>DateTimeColumn</code> <p>A datetime column that can be adjusted by adding or subtracting an interval value using the <code>+</code> and <code>-</code> operators.</p> <code>DateTimeAddInterval</code> <p>A transformation that adds an interval value to a datetime column. This class is a subclass of <code>ColumnsTransformationWithTarget</code> and hence can be used as a transformation step in a pipeline. See <code>ColumnsTransformationWithTarget</code> for more information.</p> <code>DateTimeSubtractInterval</code> <p>A transformation that subtracts an interval value from a datetime column. This class is a subclass of <code>ColumnsTransformationWithTarget</code> and hence can be used as a transformation step in a pipeline. See <code>ColumnsTransformationWithTarget</code> for more information.</p> Note <p>the <code>DateTimeAddInterval</code> and <code>DateTimeSubtractInterval</code> classes are very similar. The only difference is that one adds an interval value to a datetime column, while the other subtracts an interval value from a datetime column.</p> <p>Functions:</p> Name Description <code>dt_column</code> <p>Converts a column to a <code>DateTimeColumn</code>. This function aims to be a drop-in replacement for <code>pyspark.sql.functions.col</code> that returns a <code>DateTimeColumn</code> instead of a <code>Column</code>.</p> <code>adjust_time</code> <p>Adjusts a datetime column by adding or subtracting an interval value.</p> <code>validate_interval</code> <p>Validates a given interval string.</p> Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--various-ways-to-create-and-interact-with-datetimecolumn","title":"Various ways to create and interact with <code>DateTimeColumn</code>:","text":"<ul> <li>Create a <code>DateTimeColumn</code> from a string: <code>dt_column(\"my_column\")</code></li> <li>Create a <code>DateTimeColumn</code> from a <code>Column</code>: <code>dt_column(df.my_column)</code></li> <li>Use the <code>+</code> and <code>-</code> operators to add or subtract an interval value from a <code>DateTimeColumn</code>:<ul> <li><code>dt_column(\"my_column\") + \"1 day\"</code></li> <li><code>dt_column(\"my_column\") - \"1 month\"</code></li> </ul> </li> </ul>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--functional-examples-using-adjust_time","title":"Functional examples using <code>adjust_time()</code>:","text":"<ul> <li>Add 1 day to a column: <code>adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")</code></li> <li>Subtract 1 month from a column: <code>adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")</code></li> </ul>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--as-a-transformation-step","title":"As a transformation step:","text":"<p><pre><code>from koheesio.spark.transformations.date_time.interval import (\n    DateTimeAddInterval,\n)\n\ninput_df = spark.createDataFrame([(1, \"2022-01-01 00:00:00\")], [\"id\", \"my_column\"])\n\n# add 1 day to my_column and store the result in a new column called 'one_day_later'\noutput_df = DateTimeAddInterval(column=\"my_column\", target_column=\"one_day_later\", interval=\"1 day\").transform(input_df)\n</code></pre> output_df:</p> id my_column one_day_later 1 2022-01-01 00:00:00 2022-01-02 00:00:00 <p><code>DateTimeSubtractInterval</code> works in a similar way, but subtracts an interval value from a datetime column.</p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.Operations","title":"koheesio.spark.transformations.date_time.interval.Operations  <code>module-attribute</code>","text":"<pre><code>Operations = Literal['add', 'subtract']\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","text":"<p>A transformation that adds or subtracts a specified interval from a datetime column.</p> See also: <p>pyspark.sql.functions:</p> <ul> <li>https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</li> <li>https://spark.apache.org/docs/latest/api/sql/index.html#interval</li> </ul> <p>Parameters:</p> Name Type Description Default <code>interval</code> <code>str</code> <p>The interval to add to the datetime column.</p> required <code>operation</code> <code>Operations</code> <p>The operation to perform. Must be either 'add' or 'subtract'.</p> <code>add</code> Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--add-1-day-to-a-column","title":"add 1 day to a column","text":"<pre><code>DateTimeAddInterval(\n    column=\"my_column\",\n    interval=\"1 day\",\n).transform(df)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--subtract-1-month-from-my_column-and-store-the-result-in-a-new-column-called-one_month_earlier","title":"subtract 1 month from <code>my_column</code> and store the result in a new column called <code>one_month_earlier</code>","text":"<pre><code>DateTimeSubtractInterval(\n    column=\"my_column\",\n    target_column=\"one_month_earlier\",\n    interval=\"1 month\",\n)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.interval","title":"interval  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>interval: str = Field(default=..., description='The interval to add to the datetime column.', examples=['1 day', '5 years', '3 months'])\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.operation","title":"operation  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>operation: Operations = Field(default='add', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.validate_interval","title":"validate_interval  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>validate_interval = field_validator('interval')(validate_interval)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def func(self, column: Column):\n    return adjust_time(column, operation=self.operation, interval=self.interval)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn","title":"koheesio.spark.transformations.date_time.interval.DateTimeColumn","text":"<p>A datetime column that can be adjusted by adding or subtracting an interval value  using the <code>+</code> and <code>-</code> operators.</p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn.from_column","title":"from_column  <code>classmethod</code>","text":"<pre><code>from_column(column: Column)\n</code></pre> <p>Create a DateTimeColumn from an existing Column</p> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>@classmethod\ndef from_column(cls, column: Column):\n    \"\"\"Create a DateTimeColumn from an existing Column\"\"\"\n    return cls(column._jc)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","text":"<p>Subtracts a specified interval from a datetime column.</p> <p>Works in the same way as <code>DateTimeAddInterval</code>, but subtracts the specified interval from the datetime column. See <code>DateTimeAddInterval</code> for more information.</p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval.operation","title":"operation  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>operation: Operations = Field(default='subtract', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time","title":"koheesio.spark.transformations.date_time.interval.adjust_time","text":"<pre><code>adjust_time(column: Column, operation: Operations, interval: str) -&gt; Column\n</code></pre> <p>Adjusts a datetime column by adding or subtracting an interval value.</p> <p>This can be used to reflect a change in a given date / time column in a more human-readable way.</p> See also <p>Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</p> Example <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Column</code> <p>The datetime column to adjust.</p> required <code>operation</code> <code>Operations</code> <p>The operation to perform. Must be either 'add' or 'subtract'.</p> required <code>interval</code> <code>str</code> <p>The value to add or subtract. Must be a valid interval string.</p> required <p>Returns:</p> Type Description <code>Column</code> <p>The adjusted datetime column.</p> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def adjust_time(column: Column, operation: Operations, interval: str) -&gt; Column:\n    \"\"\"\n    Adjusts a datetime column by adding or subtracting an interval value.\n\n    This can be used to reflect a change in a given date / time column in a more human-readable way.\n\n\n    See also\n    --------\n    Please refer to the Spark SQL documentation for a list of valid interval values:\n    https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal\n\n    ### pyspark.sql.functions:\n\n    * https://spark.apache.org/docs/latest/api/sql/index.html#interval\n    * https://spark.apache.org/docs/latest/api/sql/#try_add\n    * https://spark.apache.org/docs/latest/api/sql/#try_subtract\n\n    Example\n    --------\n    ### add 1 day to a column\n    ```python\n    adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n    ```\n\n    ### subtract 1 month from a column\n    ```python\n    adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n    ```\n\n    ### or, a much more complicated example\n\n    In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called `my_column`.\n    ```python\n    adjust_time(\n        \"my_column\",\n        operation=\"add\",\n        interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n    )\n    ```\n\n    Parameters\n    ----------\n    column : Column\n        The datetime column to adjust.\n    operation : Operations\n        The operation to perform. Must be either 'add' or 'subtract'.\n    interval : str\n        The value to add or subtract. Must be a valid interval string.\n\n    Returns\n    -------\n    Column\n        The adjusted datetime column.\n    \"\"\"\n\n    # check that value is a valid interval\n    interval = validate_interval(interval)\n\n    column_name = column._jc.toString()\n\n    # determine the operation to perform\n    try:\n        operation = {\n            \"add\": \"try_add\",\n            \"subtract\": \"try_subtract\",\n        }[operation]\n    except KeyError as e:\n        raise ValueError(f\"Operation '{operation}' is not valid. Must be either 'add' or 'subtract'.\") from e\n\n    # perform the operation\n    _expression = f\"{operation}({column_name}, interval '{interval}')\"\n    column = expr(_expression)\n\n    return column\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--pysparksqlfunctions","title":"pyspark.sql.functions:","text":"<ul> <li>https://spark.apache.org/docs/latest/api/sql/index.html#interval</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_add</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_subtract</li> </ul>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--add-1-day-to-a-column","title":"add 1 day to a column","text":"<pre><code>adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--subtract-1-month-from-a-column","title":"subtract 1 month from a column","text":"<pre><code>adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--or-a-much-more-complicated-example","title":"or, a much more complicated example","text":"<p>In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called <code>my_column</code>. <pre><code>adjust_time(\n    \"my_column\",\n    operation=\"add\",\n    interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n)\n</code></pre></p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column","title":"koheesio.spark.transformations.date_time.interval.dt_column","text":"<pre><code>dt_column(column: Union[str, Column]) -&gt; DateTimeColumn\n</code></pre> <p>Convert a column to a DateTimeColumn</p> <p>Aims to be a drop-in replacement for <code>pyspark.sql.functions.col</code> that returns a DateTimeColumn instead of a Column.</p> Example <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Union[str, Column]</code> <p>The column (or name of the column) to convert to a DateTimeColumn</p> required Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def dt_column(column: Union[str, Column]) -&gt; DateTimeColumn:\n    \"\"\"Convert a column to a DateTimeColumn\n\n    Aims to be a drop-in replacement for `pyspark.sql.functions.col` that returns a DateTimeColumn instead of a Column.\n\n    Example\n    --------\n    ### create a DateTimeColumn from a string\n    ```python\n    dt_column(\"my_column\")\n    ```\n\n    ### create a DateTimeColumn from a Column\n    ```python\n    dt_column(df.my_column)\n    ```\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column (or name of the column) to convert to a DateTimeColumn\n    \"\"\"\n    if isinstance(column, str):\n        column = col(column)\n    elif not isinstance(column, Column):\n        raise TypeError(f\"Expected column to be of type str or Column, got {type(column)} instead.\")\n    return DateTimeColumn.from_column(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-string","title":"create a DateTimeColumn from a string","text":"<pre><code>dt_column(\"my_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-column","title":"create a DateTimeColumn from a Column","text":"<pre><code>dt_column(df.my_column)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.validate_interval","title":"koheesio.spark.transformations.date_time.interval.validate_interval","text":"<pre><code>validate_interval(interval: str)\n</code></pre> <p>Validate an interval string</p> <p>Parameters:</p> Name Type Description Default <code>interval</code> <code>str</code> <p>The interval string to validate</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the interval string is invalid</p> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def validate_interval(interval: str):\n    \"\"\"Validate an interval string\n\n    Parameters\n    ----------\n    interval : str\n        The interval string to validate\n\n    Raises\n    ------\n    ValueError\n        If the interval string is invalid\n    \"\"\"\n    try:\n        expr(f\"interval '{interval}'\")\n    except ParseException as e:\n        raise ValueError(f\"Value '{interval}' is not a valid interval.\") from e\n    return interval\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/index.html","title":"Strings","text":"<p>Adds a number of Transformations that are intended to be used with StringType column input. Some will work with other types however, but will output StringType or an array of StringType.</p> <p>These Transformations take full advantage of Koheesio's ColumnsTransformationWithTarget class, allowing a user to apply column transformations to multiple columns at once. See the class docstrings for more information.</p> <p>The following Transformations are included:</p> <p>change_case:</p> <ul> <li><code>Lower</code>     Converts a string column to lower case.</li> <li><code>Upper</code>     Converts a string column to upper case.</li> <li><code>TitleCase</code> or <code>InitCap</code>     Converts a string column to title case, where each word starts with a capital letter.</li> </ul> <p>concat:</p> <ul> <li><code>Concat</code>     Concatenates multiple input columns together into a single column, optionally using the given separator.</li> </ul> <p>pad:</p> <ul> <li><code>Pad</code>     Pads the values of <code>source_column</code> with the <code>character</code> up until it reaches <code>length</code> of characters</li> <li><code>LPad</code>     Pad with a character on the left side of the string.</li> <li><code>RPad</code>     Pad with a character on the right side of the string.</li> </ul> <p>regexp:</p> <ul> <li><code>RegexpExtract</code>     Extract a specific group matched by a Java regexp from the specified string column.</li> <li><code>RegexpReplace</code>     Searches for the given regexp and replaces all instances with what is in 'replacement'.</li> </ul> <p>replace:</p> <ul> <li><code>Replace</code>     Replace all instances of a string in a column with another string.</li> </ul> <p>split:</p> <ul> <li><code>SplitAll</code>     Splits the contents of a column on basis of a split_pattern.</li> <li><code>SplitAtFirstMatch</code>     Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.</li> </ul> <p>substring:</p> <ul> <li><code>Substring</code>     Extracts a substring from a string column starting at the given position.</li> </ul> <p>trim:</p> <ul> <li><code>Trim</code>     Trim whitespace from the beginning and/or end of a string.</li> <li><code>LTrim</code>     Trim whitespace from the beginning of a string.</li> <li><code>RTrim</code>     Trim whitespace from the end of a string.</li> </ul>"},{"location":"api_reference/spark/transformations/strings/change_case.html","title":"Change case","text":"<p>Convert the case of a string column to upper case, lower case, or title case</p> <p>Classes:</p> Name Description <code>`Lower`</code> <p>Converts a string column to lower case.</p> <code>`Upper`</code> <p>Converts a string column to upper case.</p> <code>`TitleCase` or `InitCap`</code> <p>Converts a string column to title case, where each word starts with a capital letter.</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.InitCap","title":"koheesio.spark.transformations.strings.change_case.InitCap  <code>module-attribute</code>","text":"<pre><code>InitCap = TitleCase\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase","title":"koheesio.spark.transformations.strings.change_case.LowerCase","text":"<p>This function makes the contents of a column lower case.</p> <p>Wraps the <code>pyspark.sql.functions.lower</code> function.</p> Warnings <p>If the type of the column is not string, <code>LowerCase</code> will not be run. A Warning will be thrown indicating this.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The name of the column or columns to convert to lower case. Alias: column. Lower case will be applied to all columns in the list. Column is required to be of string type.</p> required <code>target_column</code> <p>The name of the column to store the result in. If None, the result will be stored in the same column as the input.</p> required Example <p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = LowerCase(column=\"product\", target_column=\"product_lower\").transform(df)\n</code></pre> <p>output_df:</p> product amount country product_lower Banana lemon orange 1000 USA banana lemon orange Carrots Blueberries 1500 USA carrots blueberries Beans 1600 USA beans <p>In this example, the column <code>product</code> is converted to <code>product_lower</code> and the contents of this column are converted to lower case.</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig","title":"ColumnConfig","text":"<p>Limit data type to string</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/change_case.py</code> <pre><code>def func(self, column: Column):\n    return lower(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase","title":"koheesio.spark.transformations.strings.change_case.TitleCase","text":"<p>This function makes the contents of a column title case. This means that every word starts with an upper case.</p> <p>Wraps the <code>pyspark.sql.functions.initcap</code> function.</p> Warnings <p>If the type of the column is not string, TitleCase will not be run. A Warning will be thrown indicating this.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The name of the column or columns to convert to title case. Alias: column. Title case will be applied to all columns in the list. Column is required to be of string type.</p> required <code>target_column</code> <p>The name of the column to store the result in. If None, the result will be stored in the same column as the input.</p> required Example <p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots blueberries 1500 USA Beans 1600 USA <pre><code>output_df = TitleCase(column=\"product\", target_column=\"product_title\").transform(df)\n</code></pre> <p>output_df:</p> product amount country product_title Banana lemon orange 1000 USA Banana Lemon Orange Carrots blueberries 1500 USA Carrots Blueberries Beans 1600 USA Beans <p>In this example, the column <code>product</code> is converted to <code>product_title</code> and the contents of this column are converted to title case (each word now starts with an upper case).</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/change_case.py</code> <pre><code>def func(self, column: Column):\n    return initcap(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase","title":"koheesio.spark.transformations.strings.change_case.UpperCase","text":"<p>This function makes the contents of a column upper case.</p> <p>Wraps the <code>pyspark.sql.functions.upper</code> function.</p> Warnings <p>If the type of the column is not string, <code>UpperCase</code> will not be run. A Warning will be thrown indicating this.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The name of the column or columns to convert to upper case. Alias: column. Upper case will be applied to all columns in the list. Column is required to be of string type.</p> required <code>target_column</code> <p>The name of the column to store the result in. If None, the result will be stored in the same column as the input.</p> required <p>Examples:</p> <p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = UpperCase(column=\"product\", target_column=\"product_upper\").transform(df)\n</code></pre> <p>output_df:</p> product amount country product_upper Banana lemon orange 1000 USA BANANA LEMON ORANGE Carrots Blueberries 1500 USA CARROTS BLUEBERRIES Beans 1600 USA BEANS <p>In this example, the column <code>product</code> is converted to <code>product_upper</code> and the contents of this column are converted to upper case.</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/change_case.py</code> <pre><code>def func(self, column: Column):\n    return upper(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html","title":"Concat","text":"<p>Concatenates multiple input columns together into a single column, optionally using a given separator.</p>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat","title":"koheesio.spark.transformations.strings.concat.Concat","text":"<p>This is a wrapper around PySpark concat() and concat_ws() functions</p> <p>Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.</p> Concept <p>When working with arrays, the function will return the result of the concatenation of the elements in the array.</p> <ul> <li>If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.</li> <li>If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.</li> </ul> <p>When working with date/timestamps, the function will return the result of the concatenation of the elements in the array. The timestamp is converted to a string using the default format of <code>yyyy-MM-dd HH:mm:ss</code>.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first.</p> required <code>target_column</code> <code>Optional[str]</code> <p>Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.</p> <code>None</code> <code>spacer</code> <code>Optional[str]</code> <p>Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used</p> <code>None</code> Example"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-a-string-column-and-a-timestamp-column","title":"Example using a string column and a timestamp column","text":"<p>input_df:</p> column_a column_b text 1997-02-28 10:30:00 <pre><code>output_df = Concat(\n    columns=[\"column_a\", \"column_b\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column_a column_b concatenated_column text 1997-02-28 10:30:00 text--1997-02-28 10:30:00 <p>In the example above, the resulting column is a string column.</p> <p>If we had left out the spacer, the resulting column would have had the value of <code>text1997-02-28 10:30:00</code> (a string). Note that the timestamp is converted to a string using the default format of <code>yyyy-MM-dd HH:mm:ss</code>.</p>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-two-array-columns","title":"Example using two array columns","text":"<p>input_df:</p> array_col_1 array_col_2 [text1, text2] [text3, text4] <pre><code>output_df = Concat(\n    columns=[\"array_col_1\", \"array_col_2\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> array_col_1 array_col_2 concatenated_column [text1, text2] [text3, text4] \"text1--text2--text3\" <p>Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would have been an array with the values of <code>[\"text1\", \"text2\", \"text3\"]</code>.</p> <p>Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.</p>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.spacer","title":"spacer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>spacer: Optional[str] = Field(default=None, description='Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used', alias='sep')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: Optional[str] = Field(default=None, description=\"Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.execute","title":"execute","text":"<pre><code>execute() -&gt; DataFrame\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/concat.py</code> <pre><code>def execute(self) -&gt; DataFrame:\n    columns = [col(s) for s in self.get_columns()]\n    self.output.df = self.df.withColumn(\n        self.target_column, concat_ws(self.spacer, *columns) if self.spacer else concat(*columns)\n    )\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.get_target_column","title":"get_target_column","text":"<pre><code>get_target_column(target_column_value, values)\n</code></pre> <p>Get the target column name if it is not provided.</p> <p>If not provided, a name will be generated by concatenating the names of the source columns with an '_'.</p> Source code in <code>src/koheesio/spark/transformations/strings/concat.py</code> <pre><code>@field_validator(\"target_column\")\ndef get_target_column(cls, target_column_value, values):\n    \"\"\"Get the target column name if it is not provided.\n\n    If not provided, a name will be generated by concatenating the names of the source columns with an '_'.\"\"\"\n    if not target_column_value:\n        columns_value: List = values[\"columns\"]\n        columns = list(dict.fromkeys(columns_value))  # dict.fromkeys is used to dedup while maintaining order\n        return \"_\".join(columns)\n\n    return target_column_value\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html","title":"Pad","text":"<p>Pad the values of a column with a character up until it reaches a certain length.</p> <p>Classes:</p> Name Description <code>Pad</code> <p>Pads the values of <code>source_column</code> with the <code>character</code> up until it reaches <code>length</code> of characters</p> <code>LPad</code> <p>Pad with a character on the left side of the string.</p> <code>RPad</code> <p>Pad with a character on the right side of the string.</p>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.LPad","title":"koheesio.spark.transformations.strings.pad.LPad  <code>module-attribute</code>","text":"<pre><code>LPad = Pad\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.pad_directions","title":"koheesio.spark.transformations.strings.pad.pad_directions  <code>module-attribute</code>","text":"<pre><code>pad_directions = Literal['left', 'right']\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad","title":"koheesio.spark.transformations.strings.pad.Pad","text":"<p>Pads the values of <code>source_column</code> with the <code>character</code> up until it reaches <code>length</code> of characters The <code>direction</code> param can be changed to apply either a left or a right pad. Defaults to left pad.</p> <p>Wraps the <code>lpad</code> and <code>rpad</code> functions from PySpark.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to pad. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>character</code> <code>constr(min_length=1)</code> <p>The character to use for padding</p> required <code>length</code> <code>PositiveInt</code> <p>Positive integer to indicate the intended length</p> required <code>direction</code> <code>Optional[pad_directions]</code> <p>On which side to add the characters. Either \"left\" or \"right\". Defaults to \"left\"</p> <code>left</code> Example <p>input_df:</p> column hello world <pre><code>output_df = Pad(\n    column=\"column\",\n    target_column=\"padded_column\",\n    character=\"*\",\n    length=10,\n    direction=\"right\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column padded_column hello hello***** world world***** <p>Note: in the example above, we could have used the RPad class instead of Pad with direction=\"right\" to achieve the same result.</p>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.character","title":"character  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>character: constr(min_length=1) = Field(default=..., description='The character to use for padding')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: Optional[pad_directions] = Field(default='left', description='On which side to add the characters . Either \"left\" or \"right\". Defaults to \"left\"')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.length","title":"length  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>length: PositiveInt = Field(default=..., description='Positive integer to indicate the intended length')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/pad.py</code> <pre><code>def func(self, column: Column):\n    func = lpad if self.direction == \"left\" else rpad\n    return func(column, self.length, self.character)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad","title":"koheesio.spark.transformations.strings.pad.RPad","text":"<p>Pad with a character on the right side of the string.</p> <p>See Pad class docstring for more information.</p>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: Optional[pad_directions] = 'right'\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html","title":"Regexp","text":"<p>String transformations using regular expressions.</p> <p>This module contains transformations that use regular expressions to transform strings.</p> <p>Classes:</p> Name Description <code>RegexpExtract</code> <p>Extract a specific group matched by a Java regexp from the specified string column.</p> <code>RegexpReplace</code> <p>Searches for the given regexp and replaces all instances with what is in 'replacement'.</p>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract","title":"koheesio.spark.transformations.strings.regexp.RegexpExtract","text":"<p>Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.</p> <p>A wrapper around the pyspark regexp_extract function</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to extract from. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>regexp</code> <code>str</code> <p>The Java regular expression to extract</p> required <code>index</code> <code>Optional[int]</code> <p>When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.</p> <code>0</code> Example"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--extracting-the-year-and-week-number-from-a-string","title":"Extracting the year and week number from a string","text":"<p>Let's say we have a column containing the year and week in a format like <code>Y## W#</code> and we would like to extract the week numbers.</p> <p>input_df:</p> YWK 2020 W1 2021 WK2 <pre><code>output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"week_number\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=2,  # remember that this is 1-indexed! So 2 will get the week number in this example.\n).transform(input_df)\n</code></pre> <p>output_df:</p> YWK week_number 2020 W1 1 2021 WK2 2"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--using-the-same-example-but-extracting-the-year-instead","title":"Using the same example, but extracting the year instead","text":"<p>If you want to extract the year, you can use index=1.</p> <pre><code>output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"year\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=1,  # remember that this is 1-indexed! So 1 will get the year in this example.\n).transform(input_df)\n</code></pre> <p>output_df:</p> YWK year 2020 W1 2020 2021 WK2 2021"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.index","title":"index  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>index: Optional[int] = Field(default=0, description='When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.regexp","title":"regexp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>regexp: str = Field(default=..., description='The Java regular expression to extract')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/regexp.py</code> <pre><code>def func(self, column: Column):\n    return regexp_extract(column, self.regexp, self.index)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace","title":"koheesio.spark.transformations.strings.regexp.RegexpReplace","text":"<p>Searches for the given regexp and replaces all instances with what is in 'replacement'.</p> <p>A wrapper around the pyspark regexp_replace function</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The column (or list of columns) to replace in. Alias: column</p> required <code>target_column</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> required <code>regexp</code> <p>The regular expression to replace</p> required <code>replacement</code> <p>String to replace matched pattern with.</p> required <p>Examples:</p> <p>input_df: | content    | |------------| | hello world|</p> <p>Let's say you want to replace 'hello'.</p> <pre><code>output_df = RegexpReplace(\n    column=\"content\",\n    target_column=\"replaced\",\n    regexp=\"hello\",\n    replacement=\"gutentag\",\n).transform(input_df)\n</code></pre> <p>output_df: | content    | replaced      | |------------|---------------| | hello world| gutentag world|</p>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.regexp","title":"regexp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>regexp: str = Field(default=..., description='The regular expression to replace')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.replacement","title":"replacement  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>replacement: str = Field(default=..., description='String to replace matched pattern with.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/regexp.py</code> <pre><code>def func(self, column: Column):\n    return regexp_replace(column, self.regexp, self.replacement)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html","title":"Replace","text":"<p>String replacements without using regular expressions.</p>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace","title":"koheesio.spark.transformations.strings.replace.Replace","text":"<p>Replace all instances of a string in a column with another string.</p> <p>This transformation uses PySpark when().otherwise() functions.</p> Notes <ul> <li>If original_value is not set, the transformation will replace all null values with new_value</li> <li>If original_value is set, the transformation will replace all values matching original_value with new_value</li> <li>Numeric values are supported, but will be cast to string in the process</li> <li>Replace is meant for simple string replacements. If more advanced replacements are needed, use the <code>RegexpReplace</code>     transformation instead.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to replace values in. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>original_value</code> <code>Optional[str]</code> <p>The original value that needs to be replaced. Alias: from</p> <code>None</code> <code>new_value</code> <code>str</code> <p>The new value to replace this with. Alias: to</p> required <p>Examples:</p> <p>input_df:</p> column hello world None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-null-values-with-a-new-value","title":"Replace all null values with a new value","text":"<pre><code>output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=None,  # This is the default value, so it can be omitted\n    new_value=\"programmer\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column replaced_column hello hello world world None programmer"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-instances-of-a-string-in-a-column-with-another-string","title":"Replace all instances of a string in a column with another string","text":"<pre><code>output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=\"world\",\n    new_value=\"programmer\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column replaced_column hello hello world programmer None None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.new_value","title":"new_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>new_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.original_value","title":"original_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>original_value: Optional[str] = Field(default=None, alias='from', description='The original value that needs to be replaced')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.cast_values_to_str","title":"cast_values_to_str","text":"<pre><code>cast_values_to_str(value)\n</code></pre> <p>Cast values to string if they are not None</p> Source code in <code>src/koheesio/spark/transformations/strings/replace.py</code> <pre><code>@field_validator(\"original_value\", \"new_value\", mode=\"before\")\ndef cast_values_to_str(cls, value):\n    \"\"\"Cast values to string if they are not None\"\"\"\n    if value:\n        return str(value)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/replace.py</code> <pre><code>def func(self, column: Column):\n    when_statement = (\n        when(column.isNull(), lit(self.new_value))\n        if not self.original_value\n        else when(\n            column == self.original_value,\n            lit(self.new_value),\n        )\n    )\n    return when_statement.otherwise(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html","title":"Split","text":"<p>Splits the contents of a column on basis of a split_pattern</p> <p>Classes:</p> Name Description <code>SplitAll</code> <p>Splits the contents of a column on basis of a split_pattern.</p> <code>SplitAtFirstMatch</code> <p>Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.</p>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll","title":"koheesio.spark.transformations.strings.split.SplitAll","text":"<p>This function splits the contents of a column on basis of a split_pattern.</p> <p>It splits at al the locations the pattern is found. The new column will be of ArrayType.</p> <p>Wraps the pyspark.sql.functions.split function.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to split. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>split_pattern</code> <code>str</code> <p>This is the pattern that will be used to split the column contents.</p> required Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"<p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = SplitColumn(column=\"product\", target_column=\"split\", split_pattern=\" \").transform(input_df)\n</code></pre> <p>output_df:</p> product amount country split Banana lemon orange 1000 USA [\"Banana\", \"lemon\" \"orange\"] Carrots Blueberries 1500 USA [\"Carrots\", \"Blueberries\"] Beans 1600 USA [\"Beans\"]"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.split_pattern","title":"split_pattern  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>split_pattern: str = Field(default=..., description='The pattern to split the column contents.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/split.py</code> <pre><code>def func(self, column: Column):\n    return split(column, pattern=self.split_pattern)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch","title":"koheesio.spark.transformations.strings.split.SplitAtFirstMatch","text":"<p>Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..</p> Note <ul> <li>SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you     can toggle the parameter retrieve_first_part.</li> <li>The new column will be of StringType.</li> <li>If you want to split a column more than once, you should call this function multiple times.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to split. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>split_pattern</code> <code>str</code> <p>This is the pattern that will be used to split the column contents.</p> required <code>retrieve_first_part</code> <code>Optional[bool]</code> <p>Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.</p> <code>True</code> Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"<p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = SplitColumn(column=\"product\", target_column=\"split_first\", split_pattern=\"an\").transform(input_df)\n</code></pre> <p>output_df:</p> product amount country split_first Banana lemon orange 1000 USA B Carrots Blueberries 1500 USA Carrots Blueberries Beans 1600 USA Be"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.retrieve_first_part","title":"retrieve_first_part  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>retrieve_first_part: Optional[bool] = Field(default=True, description='Takes the first part of the split when true, the second part when False. Other parts are ignored.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/split.py</code> <pre><code>def func(self, column: Column):\n    split_func = split(column, pattern=self.split_pattern)\n\n    # first part\n    if self.retrieve_first_part:\n        return split_func.getItem(0)\n\n    # or, second part\n    return coalesce(split_func.getItem(1), lit(\"\"))\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/substring.html","title":"Substring","text":"<p>Extracts a substring from a string column starting at the given position.</p>"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring","title":"koheesio.spark.transformations.strings.substring.Substring","text":"<p>Extracts a substring from a string column starting at the given position.</p> <p>This is a wrapper around PySpark substring() function</p> Notes <ul> <li>Numeric columns will be cast to string</li> <li>start is 1-indexed, not 0-indexed!</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to substring. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>start</code> <code>PositiveInt</code> <p>Positive int. Defines where to begin the substring from. The first character of the field has index 1!</p> required <code>length</code> <code>Optional[int]</code> <p>Optional. If not provided, the substring will go until end of string.</p> <code>-1</code> Example"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring--extract-a-substring-from-a-string-column-starting-at-the-given-position","title":"Extract a substring from a string column starting at the given position.","text":"<p>input_df:</p> column skyscraper <pre><code>output_df = Substring(\n    column=\"column\",\n    target_column=\"substring_column\",\n    start=3,  # 1-indexed! So this will start at the 3rd character\n    length=4,\n).transform(input_df)\n</code></pre> <p>output_df:</p> column substring_column skyscraper yscr"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.length","title":"length  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>length: Optional[int] = Field(default=-1, description='The target length for the string. use -1 to perform until end')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.start","title":"start  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>start: PositiveInt = Field(default=..., description='The starting position')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/substring.py</code> <pre><code>def func(self, column: Column):\n    return when(column.isNull(), None).otherwise(substring(column, self.start, self.length)).cast(StringType())\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html","title":"Trim","text":"<p>Trim whitespace from the beginning and/or end of a string.</p> <p>Classes:</p> Name Description <code>- `Trim`</code> <p>Trim whitespace from the beginning and/or end of a string.</p> <code>- `LTrim`</code> <p>Trim whitespace from the beginning of a string.</p> <code>- `RTrim`</code> <p>Trim whitespace from the end of a string.</p> <code>See class docstrings for more information.</code>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.trim_type","title":"koheesio.spark.transformations.strings.trim.trim_type  <code>module-attribute</code>","text":"<pre><code>trim_type = Literal['left', 'right', 'left-right']\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim","title":"koheesio.spark.transformations.strings.trim.LTrim","text":"<p>Trim whitespace from the beginning of a string. Alias: LeftTrim</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: trim_type = 'left'\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim","title":"koheesio.spark.transformations.strings.trim.RTrim","text":"<p>Trim whitespace from the end of a string. Alias: RightTrim</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: trim_type = 'right'\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim","title":"koheesio.spark.transformations.strings.trim.Trim","text":"<p>Trim whitespace from the beginning and/or end of a string.</p> <p>This is a wrapper around PySpark ltrim() and rtrim() functions</p> <p>The <code>direction</code> parameter can be changed to apply either a left or a right trim. Defaults to left AND right trim.</p> <p>Note: If the type of the column is not string, Trim will not be run. A Warning will be thrown indicating this</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The column (or list of columns) to trim. Alias: column If no columns are provided, all string columns will be trimmed.</p> required <code>target_column</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> required <code>direction</code> <p>On which side to remove the spaces. Either \"left\", \"right\" or \"left-right\". Defaults to \"left-right\"</p> required <p>Examples:</p> <p>input_df: | column    | |-----------| | \" hello \" |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-beginning-of-a-string","title":"Trim whitespace from the beginning of a string","text":"<pre><code>output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"left\").transform(input_df)\n</code></pre> <p>output_df: | column    | trimmed_column | |-----------|----------------| | \" hello \" | \"hello \"       |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-both-sides-of-a-string","title":"Trim whitespace from both sides of a string","text":"<pre><code>output_df = Trim(\n    column=\"column\",\n    target_column=\"trimmed_column\",\n    direction=\"left-right\",  # default value\n).transform(input_df)\n</code></pre> <p>output_df: | column    | trimmed_column | |-----------|----------------| | \" hello \" | \"hello\"        |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-end-of-a-string","title":"Trim whitespace from the end of a string","text":"<pre><code>output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"right\").transform(input_df)\n</code></pre> <p>output_df: | column    | trimmed_column | |-----------|----------------| | \" hello \" | \" hello\"       |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: ListOfColumns = Field(default='*', alias='column', description='The column (or list of columns) to trim. Alias: column. If no columns are provided, all stringcolumns will be trimmed.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: trim_type = Field(default='left-right', description=\"On which side to remove the spaces. Either 'left', 'right' or 'left-right'\")\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig","title":"ColumnConfig","text":"<p>Limit data types to string only.</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/trim.py</code> <pre><code>def func(self, column: Column):\n    if self.direction == \"left\":\n        return f.ltrim(column)\n\n    if self.direction == \"right\":\n        return f.rtrim(column)\n\n    # both (left-right)\n    return f.rtrim(f.ltrim(column))\n</code></pre>"},{"location":"api_reference/spark/writers/index.html","title":"Writers","text":"<p>The Writer class is used to write the DataFrame to a target.</p>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode","title":"koheesio.spark.writers.BatchOutputMode","text":"<p>For Batch:</p> <ul> <li>append: Append the contents of the DataFrame to the output table, default option in Koheesio.</li> <li>overwrite: overwrite the existing data.</li> <li>ignore: ignore the operation (i.e. no-op).</li> <li>error or errorifexists: throw an exception at runtime.</li> <li>merge: update matching data in the table and insert rows that do not exist.</li> <li>merge_all: update matching data in the table and insert rows that do not exist.</li> </ul>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERROR","title":"ERROR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERROR = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERRORIFEXISTS = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.IGNORE","title":"IGNORE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IGNORE = 'ignore'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE","title":"MERGE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE = 'merge'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGEALL","title":"MERGEALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGEALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE_ALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.OVERWRITE","title":"OVERWRITE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OVERWRITE = 'overwrite'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode","title":"koheesio.spark.writers.StreamingOutputMode","text":"<p>For Streaming:</p> <ul> <li>append: only the new rows in the streaming DataFrame will be written to the sink.</li> <li>complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some    updates.</li> <li>update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time    there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.</li> </ul>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.COMPLETE","title":"COMPLETE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>COMPLETE = 'complete'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.UPDATE","title":"UPDATE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>UPDATE = 'update'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer","title":"koheesio.spark.writers.Writer","text":"<p>The Writer class is used to write the DataFrame to a target.</p>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='delta', description='The format of the output')\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.streaming","title":"streaming  <code>property</code>","text":"<pre><code>streaming: bool\n</code></pre> <p>Check if the DataFrame is a streaming DataFrame or not.</p>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Execute on a Writer should handle writing of the self.df (input) as a minimum</p> Source code in <code>src/koheesio/spark/writers/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Writer should handle writing of the self.df (input) as a minimum\"\"\"\n    # self.df  # input dataframe\n    ...\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.write","title":"write","text":"<pre><code>write(df: Optional[DataFrame] = None) -&gt; Output\n</code></pre> <p>Write the DataFrame to the output using execute() and return the output.</p> <p>If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.</p> Source code in <code>src/koheesio/spark/writers/__init__.py</code> <pre><code>def write(self, df: Optional[DataFrame] = None) -&gt; SparkStep.Output:\n    \"\"\"Write the DataFrame to the output using execute() and return the output.\n\n    If no DataFrame is passed, the self.df will be used.\n    If no self.df is set, a RuntimeError will be thrown.\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html","title":"Buffer","text":"<p>This module contains classes for writing data to a buffer before writing to the final destination.</p> <p>The <code>BufferWriter</code> class is a base class for writers that write to a buffer first. It provides methods for writing, reading, and resetting the buffer, as well as checking if the buffer is compressed and compressing the buffer.</p> <p>The <code>PandasCsvBufferWriter</code> class is a subclass of <code>BufferWriter</code> that writes a Spark DataFrame to CSV file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).</p> <p>The <code>PandasJsonBufferWriter</code> class is a subclass of <code>BufferWriter</code> that writes a Spark DataFrame to JSON file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter","title":"koheesio.spark.writers.buffer.BufferWriter","text":"<p>Base class for writers that write to a buffer first, before writing to the final destination.</p> <p><code>execute()</code> method should implement how the incoming DataFrame is written to the buffer object (e.g. BytesIO) in the output.</p> <p>The default implementation uses a <code>SpooledTemporaryFile</code> as the buffer. This is a file-like object that starts off stored in memory and automatically rolls over to a temporary file on disk if it exceeds a certain size. A <code>SpooledTemporaryFile</code> behaves similar to <code>BytesIO</code>, but with the added benefit of being able to handle larger amounts of data.</p> <p>This approach provides a balance between speed and memory usage, allowing for fast in-memory operations for smaller amounts of data while still being able to handle larger amounts of data that would not otherwise fit in memory.</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output","title":"Output","text":"<p>Output class for BufferWriter</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.buffer","title":"buffer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer: InstanceOf[SpooledTemporaryFile] = Field(default_factory=partial(SpooledTemporaryFile, mode='w+b', max_size=0), exclude=True)\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.compress","title":"compress","text":"<pre><code>compress()\n</code></pre> <p>Compress the file_buffer in place using GZIP</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def compress(self):\n    \"\"\"Compress the file_buffer in place using GZIP\"\"\"\n    # check if the buffer is already compressed\n    if self.is_compressed():\n        self.logger.warn(\"Buffer is already compressed. Nothing to compress...\")\n        return self\n\n    # compress the file_buffer\n    file_buffer = self.buffer\n    compressed = gzip.compress(file_buffer.read())\n\n    # write the compressed content back to the buffer\n    self.reset_buffer()\n    self.buffer.write(compressed)\n\n    return self  # to allow for chaining\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.is_compressed","title":"is_compressed","text":"<pre><code>is_compressed()\n</code></pre> <p>Check if the buffer is compressed.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def is_compressed(self):\n    \"\"\"Check if the buffer is compressed.\"\"\"\n    self.rewind_buffer()\n    magic_number_present = self.buffer.read(2) == b\"\\x1f\\x8b\"\n    self.rewind_buffer()\n    return magic_number_present\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.read","title":"read","text":"<pre><code>read()\n</code></pre> <p>Read the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def read(self):\n    \"\"\"Read the buffer\"\"\"\n    self.rewind_buffer()\n    data = self.buffer.read()\n    self.rewind_buffer()\n    return data\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.reset_buffer","title":"reset_buffer","text":"<pre><code>reset_buffer()\n</code></pre> <p>Reset the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def reset_buffer(self):\n    \"\"\"Reset the buffer\"\"\"\n    self.buffer.truncate(0)\n    self.rewind_buffer()\n    return self\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.rewind_buffer","title":"rewind_buffer","text":"<pre><code>rewind_buffer()\n</code></pre> <p>Rewind the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def rewind_buffer(self):\n    \"\"\"Rewind the buffer\"\"\"\n    self.buffer.seek(0)\n    return self\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.write","title":"write","text":"<pre><code>write(df=None) -&gt; Output\n</code></pre> <p>Write the DataFrame to the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def write(self, df=None) -&gt; Output:\n    \"\"\"Write the DataFrame to the buffer\"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.output.reset_buffer()\n    self.execute()\n    return self.output\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter","title":"koheesio.spark.writers.buffer.PandasCsvBufferWriter","text":"<p>Write a Spark DataFrame to CSV file(s) using Pandas.</p> <p>Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html</p> <p>See also: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option</p> Note <p>This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).</p> Pyspark vs Pandas <p>The following table shows the mapping between Pyspark, Pandas, and Koheesio properties. Note that the default values are mostly the same as Pyspark's <code>DataFrameWriter</code> implementation, with some exceptions (see below).</p> <p>This class implements the most commonly used properties. If a property is not explicitly implemented, it can be accessed through <code>params</code>.</p> PySpark Property Default PySpark Pandas Property Default Pandas Koheesio Property Default Koheesio Notes maxRecordsPerFile ... chunksize None max_records_per_file ... Spark property name: spark.sql.files.maxRecordsPerFile sep , sep , sep , lineSep <code>\\n</code> line_terminator os.linesep lineSep (alias=line_terminator) \\n N/A ... index True index False Determines whether row labels (index) are included in the output header False header True header True quote \" quotechar \" quote (alias=quotechar) \" quoteAll False doublequote True quoteAll (alias=doublequote) False escape <code>\\</code> escapechar None escapechar (alias=escape) \\ escapeQuotes True N/A N/A N/A ... Not available in Pandas ignoreLeadingWhiteSpace True N/A N/A N/A ... Not available in Pandas ignoreTrailingWhiteSpace True N/A N/A N/A ... Not available in Pandas charToEscapeQuoteEscaping escape or <code>\u0000</code> N/A N/A N/A ... Not available in Pandas dateFormat <code>yyyy-MM-dd</code> N/A N/A N/A ... Pandas implements Timestamp, not Date timestampFormat <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code> date_format N/A timestampFormat (alias=date_format) yyyy-MM-dd'T'HHss.SSS Follows PySpark defaults timestampNTZFormat <code>yyyy-MM-dd'T'HH:mm:ss[.SSS]</code> N/A N/A N/A ... Pandas implements Timestamp, see above compression None compression infer compression None encoding utf-8 encoding utf-8 N/A ... Not explicitly implemented nullValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented emptyValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented N/A ... float_format N/A N/A ... Not explicitly implemented N/A ... decimal N/A N/A ... Not explicitly implemented N/A ... index_label None N/A ... Not explicitly implemented N/A ... columns N/A N/A ... Not explicitly implemented N/A ... mode N/A N/A ... Not explicitly implemented N/A ... quoting N/A N/A ... Not explicitly implemented N/A ... errors N/A N/A ... Not explicitly implemented N/A ... storage_options N/A N/A ... Not explicitly implemented differences with Pyspark: <ul> <li>dateFormat -&gt; Pandas implements Timestamp, not just Date. Hence, Koheesio sets the default to the python     equivalent of PySpark's default.</li> <li>compression -&gt; Spark does not compress by default, hence Koheesio does not compress by default. Compression can     be provided though.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>header</code> <code>bool</code> <p>Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.</p> <code>True</code> <code>sep</code> <code>str</code> <p>Field delimiter for the output file. Default is ','.</p> <code>,</code> <code>quote</code> <code>str</code> <p>String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'. Default is '\"'.</p> <code>\"</code> <code>quoteAll</code> <code>bool</code> <p>A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio sets the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'. Default is False.</p> <code>False</code> <code>escape</code> <code>str</code> <p>String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to <code>\\</code> to match Pyspark's default behavior. In Pandas, this field is called 'escapechar', and defaults to None. Default is '\\'.</p> <code>\\</code> <code>timestampFormat</code> <code>str</code> <p>Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code> which mimics the iso8601 format (<code>datetime.isoformat()</code>). Default is '%Y-%m-%dT%H:%M:%S.%f'.</p> <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code> <code>lineSep</code> <code>str, optional, default=</code> <p>String of length 1. Defines the character used as line separator that should be used for writing. Default is os.linesep.</p> required <code>compression</code> <code>Optional[Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', 'tar']]</code> <p>A string representing the compression to use for on-the-fly compression of the output data. Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.</p> <code>None</code>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.compression","title":"compression  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>compression: Optional[CompressionOptions] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.escape","title":"escape  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>escape: constr(max_length=1) = Field(default='\\\\', description=\"String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to `\\\\` to match Pyspark's default behavior. In Pandas, this is called 'escapechar', and defaults to None.\", alias='escapechar')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.header","title":"header  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>header: bool = Field(default=True, description=\"Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.index","title":"index  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>index: bool = Field(default=False, description='Toggles whether to write row names (index). Default False in Koheesio - pandas default is True.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.lineSep","title":"lineSep  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lineSep: Optional[constr(max_length=1)] = Field(default=linesep, description='String of length 1. Defines the character used as line separator that should be used for writing.', alias='line_terminator')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quote","title":"quote  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>quote: constr(max_length=1) = Field(default='\"', description=\"String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'.\", alias='quotechar')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quoteAll","title":"quoteAll  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>quoteAll: bool = Field(default=False, description=\"A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio set the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'.\", alias='doublequote')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.sep","title":"sep  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sep: constr(max_length=1) = Field(default=',', description='Field delimiter for the output file')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.timestampFormat","title":"timestampFormat  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timestampFormat: str = Field(default='%Y-%m-%dT%H:%M:%S.%f', description=\"Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` which mimics the iso8601 format (`datetime.isoformat()`).\", alias='date_format')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output","title":"Output","text":"<p>Output class for PandasCsvBufferWriter</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output.pandas_df","title":"pandas_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Write the DataFrame to the buffer using Pandas to_csv() method. Compression is handled by pandas to_csv() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_csv() method.\n    Compression is handled by pandas to_csv() method.\n    \"\"\"\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = self.df.toPandas()\n\n    # create csv file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_csv(file_buffer, **self.get_options(options_type=\"spark\"))\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.get_options","title":"get_options","text":"<pre><code>get_options(options_type: str = 'csv')\n</code></pre> <p>Returns the options to pass to Pandas' to_csv() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def get_options(self, options_type: str = \"csv\"):\n    \"\"\"Returns the options to pass to Pandas' to_csv() method.\"\"\"\n    try:\n        import pandas as _pd\n\n        # Get the pandas version as a tuple of integers\n        pandas_version = tuple(int(i) for i in _pd.__version__.split(\".\"))\n    except ImportError:\n        raise ImportError(\"Pandas is required to use this writer\")\n\n    # Use line_separator for pandas 2.0.0 and later\n    line_sep_option_naming = \"line_separator\" if pandas_version &gt;= (2, 0, 0) else \"line_terminator\"\n\n    csv_options = {\n        \"header\": self.header,\n        \"sep\": self.sep,\n        \"quotechar\": self.quote,\n        \"doublequote\": self.quoteAll,\n        \"escapechar\": self.escape,\n        \"na_rep\": self.emptyValue or self.nullValue,\n        line_sep_option_naming: self.lineSep,\n        \"index\": self.index,\n        \"date_format\": self.timestampFormat,\n        \"compression\": self.compression,\n        **self.params,\n    }\n\n    if options_type == \"spark\":\n        csv_options[\"lineterminator\"] = csv_options.pop(line_sep_option_naming)\n    elif options_type == \"kohesio_pandas_buffer_writer\":\n        csv_options[\"line_terminator\"] = csv_options.pop(line_sep_option_naming)\n\n    return csv_options\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter","title":"koheesio.spark.writers.buffer.PandasJsonBufferWriter","text":"<p>Write a Spark DataFrame to JSON file(s) using Pandas.</p> <p>Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html</p> Note <p>This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).</p> <p>Parameters:</p> Name Type Description Default <code>orient</code> <p>Format of the resulting JSON string. Default is 'records'.</p> required <code>lines</code> <p>Format output as one JSON object per line. Only used when orient='records'. Default is True. - If true, the output will be formatted as one JSON object per line. - If false, the output will be written as a single JSON object. Note: this value is only used when orient='records' and will be ignored otherwise.</p> required <code>date_format</code> <p>Type of date conversion. Default is 'iso'. See <code>Date and Timestamp Formats</code> for a detailed description and more information.</p> required <code>double_precision</code> <p>Number of decimal places for encoding floating point values. Default is 10.</p> required <code>force_ascii</code> <p>Force encoded string to be ASCII. Default is True.</p> required <code>compression</code> <p>A string representing the compression to use for on-the-fly compression of the output data. Koheesio sets this default to 'None' leaving the data uncompressed. Can be set to gzip' optionally. Other compression options are currently not supported by Koheesio for JSON output.</p> required <code>The</code> required <code>dates</code> required <code>The</code> required <code>different</code> required <code>original</code> required <code>Note</code> required <code>then</code> required <code>References</code> required"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[list[str]] = Field(default=None, description='The columns to write. If None, all columns will be written.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.compression","title":"compression  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>compression: Optional[Literal['gzip']] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to 'gzip' optionally.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.date_format","title":"date_format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>date_format: Literal['iso', 'epoch'] = Field(default='iso', description=\"Type of date conversion. Default is 'iso'.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.double_precision","title":"double_precision  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>double_precision: int = Field(default=10, description='Number of decimal places for encoding floating point values. Default is 10.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.force_ascii","title":"force_ascii  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_ascii: bool = Field(default=True, description='Force encoded string to be ASCII. Default is True.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.lines","title":"lines  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lines: bool = Field(default=True, description=\"Format output as one JSON object per line. Only used when orient='records'. Default is True.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.orient","title":"orient  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>orient: Literal['split', 'records', 'index', 'columns', 'values', 'table'] = Field(default='records', description=\"Format of the resulting JSON string. Default is 'records'.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output","title":"Output","text":"<p>Output class for PandasJsonBufferWriter</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output.pandas_df","title":"pandas_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Write the DataFrame to the buffer using Pandas to_json() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_json() method.\"\"\"\n    df = self.df\n    if self.columns:\n        df = df[self.columns]\n\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = df.toPandas()\n\n    # create json file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_json(file_buffer, **self.get_options())\n\n    # compress the buffer if compression is set\n    if self.compression:\n        self.output.compress()\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Returns the options to pass to Pandas' to_json() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def get_options(self):\n    \"\"\"Returns the options to pass to Pandas' to_json() method.\"\"\"\n    json_options = {\n        \"orient\": self.orient,\n        \"date_format\": self.date_format,\n        \"double_precision\": self.double_precision,\n        \"force_ascii\": self.force_ascii,\n        \"lines\": self.lines,\n        **self.params,\n    }\n\n    # ignore the 'lines' parameter if orient is not 'records'\n    if self.orient != \"records\":\n        del json_options[\"lines\"]\n\n    return json_options\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html","title":"Dummy","text":"<p>Module for the DummyWriter class.</p>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter","title":"koheesio.spark.writers.dummy.DummyWriter","text":"<p>A simple DummyWriter that performs the equivalent of a df.show() on the given DataFrame and returns the first row of data as a dict.</p> <p>This Writer does not actually write anything to a source/destination, but is useful for debugging or testing purposes.</p> <p>Parameters:</p> Name Type Description Default <code>n</code> <code>PositiveInt</code> <p>Number of rows to show.</p> <code>20</code> <code>truncate</code> <code>bool | PositiveInt</code> <p>If set to <code>True</code>, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length <code>truncate</code> and align cells right.</p> <code>True</code> <code>vertical</code> <code>bool</code> <p>If set to <code>True</code>, print output rows vertically (one line per column value).</p> <code>False</code>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.n","title":"n  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>n: PositiveInt = Field(default=20, description='Number of rows to show.', gt=0)\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.truncate","title":"truncate  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>truncate: Union[bool, PositiveInt] = Field(default=True, description='If set to ``True``, truncate strings longer than 20 chars by default.If set to a number greater than one, truncates long strings to length ``truncate`` and align cells right.')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.vertical","title":"vertical  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>vertical: bool = Field(default=False, description='If set to ``True``, print output rows vertically (one line per column value).')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output","title":"Output","text":"<p>DummyWriter output</p>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.df_content","title":"df_content  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df_content: str = Field(default=..., description='The content of the DataFrame as a string')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.head","title":"head  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>head: Dict[str, Any] = Field(default=..., description='The first row of the DataFrame as a dict')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute the DummyWriter</p> Source code in <code>src/koheesio/spark/writers/dummy.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"Execute the DummyWriter\"\"\"\n    df: DataFrame = self.df\n\n    # noinspection PyProtectedMember\n    df_content = df._jdf.showString(self.n, self.truncate, self.vertical)\n\n    # logs the equivalent of doing df.show()\n    self.log.info(f\"content of df that was passed to DummyWriter:\\n{df_content}\")\n\n    self.output.head = self.df.head().asDict()\n    self.output.df_content = df_content\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.int_truncate","title":"int_truncate","text":"<pre><code>int_truncate(truncate_value) -&gt; int\n</code></pre> <p>Truncate is either a bool or an int.</p> Parameters: <p>truncate_value : int | bool, optional, default=True     If int, specifies the maximum length of the string.     If bool and True, defaults to a maximum length of 20 characters.</p> Returns: <p>int     The maximum length of the string.</p> Source code in <code>src/koheesio/spark/writers/dummy.py</code> <pre><code>@field_validator(\"truncate\")\ndef int_truncate(cls, truncate_value) -&gt; int:\n    \"\"\"\n    Truncate is either a bool or an int.\n\n    Parameters:\n    -----------\n    truncate_value : int | bool, optional, default=True\n        If int, specifies the maximum length of the string.\n        If bool and True, defaults to a maximum length of 20 characters.\n\n    Returns:\n    --------\n    int\n        The maximum length of the string.\n\n    \"\"\"\n    # Same logic as what is inside DataFrame.show()\n    if isinstance(truncate_value, bool) and truncate_value is True:\n        return 20  # default is 20 chars\n    return int(truncate_value)  # otherwise 0, or whatever the user specified\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html","title":"Kafka","text":"<p>Kafka writer to write batch or streaming data into kafka topics</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter","title":"koheesio.spark.writers.kafka.KafkaWriter","text":"<p>Kafka writer to write batch or streaming data into kafka topics</p> <p>All kafka specific options can be provided as additional init params</p> <p>Parameters:</p> Name Type Description Default <code>broker</code> <code>str</code> <p>broker url of the kafka cluster</p> required <code>topic</code> <code>str</code> <p>full topic name to write the data to</p> required <code>trigger</code> <code>Optional[Union[Trigger, str, Dict]]</code> <p>Indicates optionally how to stream the data into kafka, continuous or batch</p> required <code>checkpoint_location</code> <code>str</code> <p>In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs.</p> required Example <pre><code>KafkaWriter(\n    write_broker=\"broker.com:9500\",\n    topic=\"test-topic\",\n    trigger=Trigger(continuous=True)\n    includeHeaders: \"true\",\n    key.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    value.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    kafka.group.id: \"test-group\",\n    checkpoint_location: \"s3://bucket/test-topic\"\n)\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.batch_writer","title":"batch_writer  <code>property</code>","text":"<pre><code>batch_writer: DataFrameWriter\n</code></pre> <p>returns a batch writer</p> <p>Returns:</p> Type Description <code>DataFrameWriter</code>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.broker","title":"broker  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>broker: str = Field(default=..., description='Kafka brokers to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.checkpoint_location","title":"checkpoint_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = 'kafka'\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.logged_option_keys","title":"logged_option_keys  <code>property</code>","text":"<pre><code>logged_option_keys\n</code></pre> <p>keys to be logged</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.options","title":"options  <code>property</code>","text":"<pre><code>options\n</code></pre> <p>retrieve the kafka options incl topic and broker.</p> <p>Returns:</p> Type Description <code>dict</code> <p>Dict being the combination of kafka options + topic + broker</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.stream_writer","title":"stream_writer  <code>property</code>","text":"<pre><code>stream_writer: DataStreamWriter\n</code></pre> <p>returns a stream writer</p> <p>Returns:</p> Type Description <code>DataStreamWriter</code>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.streaming_query","title":"streaming_query  <code>property</code>","text":"<pre><code>streaming_query: Optional[Union[str, StreamingQuery]]\n</code></pre> <p>return the streaming query</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.topic","title":"topic  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>topic: str = Field(default=..., description='Kafka topic to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.trigger","title":"trigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>trigger: Optional[Union[Trigger, str, Dict]] = Field(Trigger(available_now=True), description='Set the trigger for the stream query. If not set data is processed in batch')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[DataStreamWriter, DataFrameWriter]\n</code></pre> <p>function to get the writer of proper type according to whether the data to written is a stream or not This function will also set the trigger property in case of a datastream.</p> <p>Returns:</p> Type Description <code>Union[DataStreamWriter, DataFrameWriter]</code> <p>In case of streaming data -&gt; DataStreamWriter, else -&gt; DataFrameWriter</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output","title":"Output","text":"<p>Output of the KafkaWriter</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output.streaming_query","title":"streaming_query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Effectively write the data from the dataframe (streaming of batch) to kafka topic.</p> <p>Returns:</p> Type Description <code>Output</code> <p>streaming_query function can be used to gain insights on running write.</p> Source code in <code>src/koheesio/spark/writers/kafka.py</code> <pre><code>def execute(self):\n    \"\"\"Effectively write the data from the dataframe (streaming of batch) to kafka topic.\n\n    Returns\n    -------\n    KafkaWriter.Output\n        streaming_query function can be used to gain insights on running write.\n    \"\"\"\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self._validate_dataframe()\n\n    _writer = self.writer.format(self.format).options(**self.options)\n    self.output.streaming_query = _writer.start() if self.streaming else _writer.save()\n</code></pre>"},{"location":"api_reference/spark/writers/snowflake.html","title":"Snowflake","text":"<p>This module contains the SnowflakeWriter class, which is used to write data to Snowflake.</p>"},{"location":"api_reference/spark/writers/stream.html","title":"Stream","text":"<p>Module that holds some classes and functions to be able to write to a stream</p> <p>Classes:</p> Name Description <code>Trigger</code> <p>class to set the trigger for a stream query</p> <code>StreamWriter</code> <p>abstract class for stream writers</p> <code>ForEachBatchStreamWriter</code> <p>class to run a writer for each batch</p> <p>Functions:</p> Name Description <code>writer_to_foreachbatch</code> <p>function to be used as batch_function for StreamWriter (sub)classes</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter","title":"koheesio.spark.writers.stream.ForEachBatchStreamWriter","text":"<p>Runnable ForEachBatchWriter</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def execute(self):\n    self.streaming_query = self.writer.start()\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter","title":"koheesio.spark.writers.stream.StreamWriter","text":"<p>ABC Stream Writer</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.batch_function","title":"batch_function  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_function: Optional[Callable] = Field(default=None, description='allows you to run custom batch functions for each micro batch', alias='batch_function_for_each_df')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.checkpoint_location","title":"checkpoint_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.output_mode","title":"output_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>output_mode: StreamingOutputMode = Field(default=APPEND, alias='outputMode', description=__doc__)\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.stream_writer","title":"stream_writer  <code>property</code>","text":"<pre><code>stream_writer: DataStreamWriter\n</code></pre> <p>Returns the stream writer for the given DataFrame and settings</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.streaming_query","title":"streaming_query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.trigger","title":"trigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>trigger: Optional[Union[Trigger, str, Dict]] = Field(default=Trigger(available_now=True), description='Set the trigger for the stream query. If this is not set it process data as batch')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer\n</code></pre> <p>Returns the stream writer since we don't have a batch mode for streams</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.await_termination","title":"await_termination","text":"<pre><code>await_termination(timeout: Optional[int] = None)\n</code></pre> <p>Await termination of the stream query</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def await_termination(self, timeout: Optional[int] = None):\n    \"\"\"Await termination of the stream query\"\"\"\n    self.streaming_query.awaitTermination(timeout=timeout)\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger","title":"koheesio.spark.writers.stream.Trigger","text":"<p>Trigger types for a stream query.</p> <p>Only one trigger can be set!</p> Example <ul> <li>processingTime='5 seconds'</li> <li>continuous='5 seconds'</li> <li>availableNow=True</li> <li>once=True</li> </ul> See Also <ul> <li>https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers</li> </ul>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.available_now","title":"available_now  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>available_now: Optional[bool] = Field(default=None, alias='availableNow', description='if set to True, set a trigger that processes all available data in multiple batches then terminates the query.')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.continuous","title":"continuous  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>continuous: Optional[str] = Field(default=None, description=\"a time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a continuous query with a given checkpoint interval.\")\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(validate_default=False, extra='forbid')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.once","title":"once  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>once: Optional[bool] = Field(default=None, deprecated=True, description='if set to True, set a trigger that processes only one batch of data in a streaming query then terminates the query. use `available_now` instead of `once`.')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.processing_time","title":"processing_time  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>processing_time: Optional[str] = Field(default=None, alias='processingTime', description=\"a processing time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a microbatch query periodically based on the processing time.\")\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.triggers","title":"triggers  <code>property</code>","text":"<pre><code>triggers\n</code></pre> <p>Returns a list of tuples with the value for each trigger</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.value","title":"value  <code>property</code>","text":"<pre><code>value: Dict[str, str]\n</code></pre> <p>Returns the trigger value as a dictionary</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Returns the trigger value as a dictionary This method can be skipped, as the value can be accessed directly from the <code>value</code> property</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def execute(self):\n    \"\"\"Returns the trigger value as a dictionary\n    This method can be skipped, as the value can be accessed directly from the `value` property\n    \"\"\"\n    self.log.warning(\"Trigger.execute is deprecated. Use Trigger.value directly instead\")\n    self.output.value = self.value\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_any","title":"from_any  <code>classmethod</code>","text":"<pre><code>from_any(value)\n</code></pre> <p>Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a dictionary</p> <p>This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@classmethod\ndef from_any(cls, value):\n    \"\"\"Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a\n    dictionary\n\n    This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types\n    \"\"\"\n    if isinstance(value, Trigger):\n        return value\n\n    if isinstance(value, str):\n        return cls.from_string(value)\n\n    if isinstance(value, dict):\n        return cls.from_dict(value)\n\n    raise RuntimeError(f\"Unable to create Trigger based on the given value: {value}\")\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(_dict)\n</code></pre> <p>Creates a Trigger class based on a dictionary</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@classmethod\ndef from_dict(cls, _dict):\n    \"\"\"Creates a Trigger class based on a dictionary\"\"\"\n    return cls(**_dict)\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(trigger: str)\n</code></pre> <p>Creates a Trigger class based on a string</p> Example Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@classmethod\ndef from_string(cls, trigger: str):\n    \"\"\"Creates a Trigger class based on a string\n\n    Example\n    -------\n    ### happy flow\n\n    * processingTime='5 seconds'\n    * processing_time=\"5 hours\"\n    * processingTime=4 minutes\n    * once=True\n    * once=true\n    * available_now=true\n    * continuous='3 hours'\n    * once=TrUe\n    * once=TRUE\n\n    ### unhappy flow\n    valid values, but should fail the validation check of the class\n\n    * availableNow=False\n    * continuous=True\n    * once=false\n    \"\"\"\n    import re\n\n    trigger_from_string = re.compile(r\"(?P&lt;triggerType&gt;\\w+)=[\\'\\\"]?(?P&lt;value&gt;.+)[\\'\\\"]?\")\n    _match = trigger_from_string.match(trigger)\n\n    if _match is None:\n        raise ValueError(\n            f\"Cannot parse value for Trigger: '{trigger}'. \\n\"\n            f\"Valid types are {', '.join(cls._all_triggers_with_alias())}\"\n        )\n\n    trigger_type, value = _match.groups()\n\n    # strip the value of any quotes\n    value = value.strip(\"'\").strip('\"')\n\n    # making value a boolean when given\n    value = convert_str_to_bool(value)\n\n    return cls.from_dict({trigger_type: value})\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--happy-flow","title":"happy flow","text":"<ul> <li>processingTime='5 seconds'</li> <li>processing_time=\"5 hours\"</li> <li>processingTime=4 minutes</li> <li>once=True</li> <li>once=true</li> <li>available_now=true</li> <li>continuous='3 hours'</li> <li>once=TrUe</li> <li>once=TRUE</li> </ul>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--unhappy-flow","title":"unhappy flow","text":"<p>valid values, but should fail the validation check of the class</p> <ul> <li>availableNow=False</li> <li>continuous=True</li> <li>once=false</li> </ul>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_available_now","title":"validate_available_now","text":"<pre><code>validate_available_now(available_now)\n</code></pre> <p>Validate the available_now trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"available_now\", mode=\"before\")\ndef validate_available_now(cls, available_now):\n    \"\"\"Validate the available_now trigger value\"\"\"\n    # making value a boolean when given\n    available_now = convert_str_to_bool(available_now)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if available_now is not True:\n        raise ValueError(f\"Value for availableNow must be True. Got:{available_now}\")\n    return available_now\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_continuous","title":"validate_continuous","text":"<pre><code>validate_continuous(continuous)\n</code></pre> <p>Validate the continuous trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"continuous\", mode=\"before\")\ndef validate_continuous(cls, continuous):\n    \"\"\"Validate the continuous trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger` except that the if statement is not\n    # split in two parts\n    if not isinstance(continuous, str):\n        raise ValueError(f\"Value for continuous must be a string. Got: {continuous}\")\n\n    if len(continuous.strip()) == 0:\n        raise ValueError(f\"Value for continuous must be a non empty string. Got: {continuous}\")\n    return continuous\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_once","title":"validate_once","text":"<pre><code>validate_once(once)\n</code></pre> <p>Validate the once trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"once\", mode=\"before\")\ndef validate_once(cls, once):\n    \"\"\"Validate the once trigger value\"\"\"\n    # making value a boolean when given\n    once = convert_str_to_bool(once)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if once is not True:\n        raise ValueError(f\"Value for once must be True. Got: {once}\")\n    return once\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_processing_time","title":"validate_processing_time","text":"<pre><code>validate_processing_time(processing_time)\n</code></pre> <p>Validate the processing time trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"processing_time\", mode=\"before\")\ndef validate_processing_time(cls, processing_time):\n    \"\"\"Validate the processing time trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if not isinstance(processing_time, str):\n        raise ValueError(f\"Value for processing_time must be a string. Got: {processing_time}\")\n\n    if len(processing_time.strip()) == 0:\n        raise ValueError(f\"Value for processingTime must be a non empty string. Got: {processing_time}\")\n    return processing_time\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_triggers","title":"validate_triggers","text":"<pre><code>validate_triggers(triggers: Dict)\n</code></pre> <p>Validate the trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_triggers(cls, triggers: Dict):\n    \"\"\"Validate the trigger value\"\"\"\n    params = [*triggers.values()]\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`; modified to work with pydantic v2\n    if not triggers:\n        raise ValueError(\"No trigger provided\")\n    if len(params) &gt; 1:\n        raise ValueError(\"Multiple triggers not allowed.\")\n\n    return triggers\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch","title":"koheesio.spark.writers.stream.writer_to_foreachbatch","text":"<pre><code>writer_to_foreachbatch(writer: Writer)\n</code></pre> <p>Call <code>writer.execute</code> on each batch</p> <p>To be passed as batch_function for StreamWriter (sub)classes.</p> Example Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def writer_to_foreachbatch(writer: Writer):\n    \"\"\"Call `writer.execute` on each batch\n\n    To be passed as batch_function for StreamWriter (sub)classes.\n\n    Example\n    -------\n    ### Writing to a Delta table and a Snowflake table\n    ```python\n    DeltaTableStreamWriter(\n        table=\"my_table\",\n        checkpointLocation=\"my_checkpointlocation\",\n        batch_function=writer_to_foreachbatch(\n            SnowflakeWriter(\n                **sfOptions,\n                table=\"snowflake_table\",\n                insert_type=SnowflakeWriter.InsertType.APPEND,\n            )\n        ),\n    )\n    ```\n    \"\"\"\n\n    def inner(df, batch_id: int):\n        \"\"\"Inner method\n\n        As per the Spark documentation:\n        In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a\n        DataFrame and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the\n        output (that is, the provided Dataset) to external systems. The output DataFrame is guaranteed to exactly\n        same for the same batchId (assuming all operations are deterministic in the query).\n        \"\"\"\n        writer.log.debug(f\"Running batch function for batch {batch_id}\")\n        writer.write(df)\n\n    return inner\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch--writing-to-a-delta-table-and-a-snowflake-table","title":"Writing to a Delta table and a Snowflake table","text":"<pre><code>DeltaTableStreamWriter(\n    table=\"my_table\",\n    checkpointLocation=\"my_checkpointlocation\",\n    batch_function=writer_to_foreachbatch(\n        SnowflakeWriter(\n            **sfOptions,\n            table=\"snowflake_table\",\n            insert_type=SnowflakeWriter.InsertType.APPEND,\n        )\n    ),\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html","title":"Delta","text":"<p>This module is the entry point for the koheesio.spark.writers.delta package.</p> <p>It imports and exposes the DeltaTableWriter and DeltaTableStreamWriter classes for external use.</p> <p>Classes:     DeltaTableWriter: Class to write data in batch mode to a Delta table.     DeltaTableStreamWriter: Class to write data in streaming mode to a Delta table.</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode","title":"koheesio.spark.writers.delta.BatchOutputMode","text":"<p>For Batch:</p> <ul> <li>append: Append the contents of the DataFrame to the output table, default option in Koheesio.</li> <li>overwrite: overwrite the existing data.</li> <li>ignore: ignore the operation (i.e. no-op).</li> <li>error or errorifexists: throw an exception at runtime.</li> <li>merge: update matching data in the table and insert rows that do not exist.</li> <li>merge_all: update matching data in the table and insert rows that do not exist.</li> </ul>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERROR","title":"ERROR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERROR = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERRORIFEXISTS = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.IGNORE","title":"IGNORE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IGNORE = 'ignore'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE","title":"MERGE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE = 'merge'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGEALL","title":"MERGEALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGEALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE_ALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.OVERWRITE","title":"OVERWRITE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OVERWRITE = 'overwrite'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.DeltaTableStreamWriter","text":"<p>Delta table stream writer</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options","title":"Options","text":"<p>Options for DeltaTableStreamWriter</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/stream.py</code> <pre><code>def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter","title":"koheesio.spark.writers.delta.DeltaTableWriter","text":"<p>Delta table Writer for both batch and streaming dataframes.</p> Example <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Union[DeltaTableStep, str]</code> <p>The table to write to</p> required <code>output_mode</code> <code>Optional[Union[str, BatchOutputMode, StreamingOutputMode]]</code> <p>The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.</p> required <code>params</code> <code>Optional[dict]</code> <p>Additional parameters to use for specific mode</p> required"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-mergeall","title":"Example for <code>MERGEALL</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val&gt;=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # &lt;------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # &lt;------ DEFAULT, can be changed by providing custom value\n    },\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge","title":"Example for <code>MERGE</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=&lt;target_table_name&gt;)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge_1","title":"Example for <code>MERGE</code>","text":"<p>in case the table isn't created yet, first run will execute an APPEND operation <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"&lt;update_condition&gt;\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"&lt;insert_condition&gt;\",\n            },\n        ],\n        \"merge_cond\": \"&lt;merge_condition&gt;\",\n    },\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"<p>dataframe writer options can be passed as keyword arguments <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = 'delta'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.output_mode","title":"output_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.partition_by","title":"partition_by  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[DeltaMergeBuilder, DataFrameWriter]\n</code></pre> <p>Specify DeltaTableWriter</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.get_output_mode","title":"get_output_mode  <code>classmethod</code>","text":"<pre><code>get_output_mode(choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]\n</code></pre> <p>Retrieve an OutputMode by validating <code>choice</code> against a set of option OutputModes.</p> <p>Currently supported output modes can be found in:</p> <ul> <li>BatchOutputMode</li> <li>StreamingOutputMode</li> </ul> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.SCD2DeltaTableWriter","text":"<p>A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.</p> <p>Attributes:</p> Name Type Description <code>table</code> <code>InstanceOf[DeltaTableStep]</code> <p>The table to merge to.</p> <code>merge_key</code> <code>str</code> <p>The key used for merging data.</p> <code>include_columns</code> <code>List[str]</code> <p>Columns to be merged. Will be selected from DataFrame. Default is all columns.</p> <code>exclude_columns</code> <code>List[str]</code> <p>Columns to be excluded from DataFrame.</p> <code>scd2_columns</code> <code>List[str]</code> <p>List of attributes for SCD2 type (track changes).</p> <code>scd2_timestamp_col</code> <code>Optional[Column]</code> <p>Timestamp column for SCD2 type (track changes). Default to current_timestamp.</p> <code>scd1_columns</code> <code>List[str]</code> <p>List of attributes for SCD1 type (just update).</p> <code>meta_scd2_struct_col_name</code> <code>str</code> <p>SCD2 struct name.</p> <code>meta_scd2_effective_time_col_name</code> <code>str</code> <p>Effective col name.</p> <code>meta_scd2_is_current_col_name</code> <code>str</code> <p>Current col name.</p> <code>meta_scd2_end_time_col_name</code> <code>str</code> <p>End time col name.</p> <code>target_auto_generated_columns</code> <code>List[str]</code> <p>Auto generated columns from target Delta table. Will be used to exclude from merge logic.</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.include_columns","title":"include_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.merge_key","title":"merge_key  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>merge_key: str = Field(..., description='Merge key')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> <p>Execute the SCD Type 2 operation.</p> <p>This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.</p> <p>Raises:</p> Type Description <code>TypeError</code> <p>If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.</p> Source code in <code>src/koheesio/spark/writers/delta/scd.py</code> <pre><code>def execute(self) -&gt; None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html","title":"Batch","text":"<p>This module defines the DeltaTableWriter class, which is used to write both batch and streaming dataframes to Delta tables.</p> <p>DeltaTableWriter supports two output modes: <code>MERGEALL</code> and <code>MERGE</code>.</p> <ul> <li>The <code>MERGEALL</code> mode merges all incoming data with existing data in the table based on certain conditions.</li> <li>The <code>MERGE</code> mode allows for more custom merging behavior using the DeltaMergeBuilder class from the <code>delta.tables</code>   library.</li> </ul> <p>The <code>output_mode_params</code> dictionary is used to specify conditions for merging, updating, and inserting data. The <code>target_alias</code> and <code>source_alias</code> keys are used to specify the aliases for the target and source dataframes in the merge conditions.</p> <p>Classes:</p> Name Description <code>DeltaTableWriter</code> <p>A class for writing data to Delta tables.</p> <code>DeltaTableStreamWriter</code> <p>A class for writing streaming data to Delta tables.</p> Example <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val&gt;=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n    },\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter","title":"koheesio.spark.writers.delta.batch.DeltaTableWriter","text":"<p>Delta table Writer for both batch and streaming dataframes.</p> Example <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Union[DeltaTableStep, str]</code> <p>The table to write to</p> required <code>output_mode</code> <code>Optional[Union[str, BatchOutputMode, StreamingOutputMode]]</code> <p>The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.</p> required <code>params</code> <code>Optional[dict]</code> <p>Additional parameters to use for specific mode</p> required"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-mergeall","title":"Example for <code>MERGEALL</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val&gt;=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # &lt;------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # &lt;------ DEFAULT, can be changed by providing custom value\n    },\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge","title":"Example for <code>MERGE</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=&lt;target_table_name&gt;)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge_1","title":"Example for <code>MERGE</code>","text":"<p>in case the table isn't created yet, first run will execute an APPEND operation <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"&lt;update_condition&gt;\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"&lt;insert_condition&gt;\",\n            },\n        ],\n        \"merge_cond\": \"&lt;merge_condition&gt;\",\n    },\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"<p>dataframe writer options can be passed as keyword arguments <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = 'delta'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.output_mode","title":"output_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.partition_by","title":"partition_by  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[DeltaMergeBuilder, DataFrameWriter]\n</code></pre> <p>Specify DeltaTableWriter</p>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.get_output_mode","title":"get_output_mode  <code>classmethod</code>","text":"<pre><code>get_output_mode(choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]\n</code></pre> <p>Retrieve an OutputMode by validating <code>choice</code> against a set of option OutputModes.</p> <p>Currently supported output modes can be found in:</p> <ul> <li>BatchOutputMode</li> <li>StreamingOutputMode</li> </ul> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html","title":"Scd","text":"<p>This module defines writers to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.</p> <p>Slowly Changing Dimension (SCD) is a technique used in data warehousing to handle changes to dimension data over time. SCD Type 2 is one of the most common types of SCD, where historical changes are tracked by creating new records for each change.</p> <p>Koheesio is a powerful data processing framework that provides advanced capabilities for working with Delta tables in Apache Spark. It offers a convenient and efficient way to handle SCD Type 2 operations on Delta tables.</p> <p>To learn more about Slowly Changing Dimension and SCD Type 2, you can refer to the following resources: - Slowly Changing Dimension (SCD) - Wikipedia</p> <p>By using Koheesio, you can benefit from its efficient merge logic, support for SCD Type 2 and SCD Type 1 attributes, and seamless integration with Delta tables in Spark.</p>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","text":"<p>A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.</p> <p>Attributes:</p> Name Type Description <code>table</code> <code>InstanceOf[DeltaTableStep]</code> <p>The table to merge to.</p> <code>merge_key</code> <code>str</code> <p>The key used for merging data.</p> <code>include_columns</code> <code>List[str]</code> <p>Columns to be merged. Will be selected from DataFrame. Default is all columns.</p> <code>exclude_columns</code> <code>List[str]</code> <p>Columns to be excluded from DataFrame.</p> <code>scd2_columns</code> <code>List[str]</code> <p>List of attributes for SCD2 type (track changes).</p> <code>scd2_timestamp_col</code> <code>Optional[Column]</code> <p>Timestamp column for SCD2 type (track changes). Default to current_timestamp.</p> <code>scd1_columns</code> <code>List[str]</code> <p>List of attributes for SCD1 type (just update).</p> <code>meta_scd2_struct_col_name</code> <code>str</code> <p>SCD2 struct name.</p> <code>meta_scd2_effective_time_col_name</code> <code>str</code> <p>Effective col name.</p> <code>meta_scd2_is_current_col_name</code> <code>str</code> <p>Current col name.</p> <code>meta_scd2_end_time_col_name</code> <code>str</code> <p>End time col name.</p> <code>target_auto_generated_columns</code> <code>List[str]</code> <p>Auto generated columns from target Delta table. Will be used to exclude from merge logic.</p>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.include_columns","title":"include_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.merge_key","title":"merge_key  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>merge_key: str = Field(..., description='Merge key')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> <p>Execute the SCD Type 2 operation.</p> <p>This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.</p> <p>Raises:</p> Type Description <code>TypeError</code> <p>If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.</p> Source code in <code>src/koheesio/spark/writers/delta/scd.py</code> <pre><code>def execute(self) -&gt; None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html","title":"Stream","text":"<p>This module defines the DeltaTableStreamWriter class, which is used to write streaming dataframes to Delta tables.</p>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","text":"<p>Delta table stream writer</p>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options","title":"Options","text":"<p>Options for DeltaTableStreamWriter</p>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/stream.py</code> <pre><code>def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/utils.html","title":"Utils","text":"<p>This module provides utility functions while working with delta framework.</p>"},{"location":"api_reference/spark/writers/delta/utils.html#koheesio.spark.writers.delta.utils.log_clauses","title":"koheesio.spark.writers.delta.utils.log_clauses","text":"<pre><code>log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -&gt; Optional[str]\n</code></pre> <p>Prepare log message for clauses of DeltaMergePlan statement.</p> <p>Parameters:</p> Name Type Description Default <code>clauses</code> <code>JavaObject</code> <p>The clauses of the DeltaMergePlan statement.</p> required <code>source_alias</code> <code>str</code> <p>The source alias.</p> required <code>target_alias</code> <code>str</code> <p>The target alias.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>The log message if there are clauses, otherwise None.</p> Notes <p>This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses, processes the conditions, and constructs the log message based on the clause type and columns.</p> <p>If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is None, it sets the condition_clause to \"No conditions required\".</p> <p>The log message includes the clauses type, the clause type, the columns, and the condition.</p> Source code in <code>src/koheesio/spark/writers/delta/utils.py</code> <pre><code>def log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -&gt; Optional[str]:\n    \"\"\"\n    Prepare log message for clauses of DeltaMergePlan statement.\n\n    Parameters\n    ----------\n    clauses : JavaObject\n        The clauses of the DeltaMergePlan statement.\n    source_alias : str\n        The source alias.\n    target_alias : str\n        The target alias.\n\n    Returns\n    -------\n    Optional[str]\n        The log message if there are clauses, otherwise None.\n\n    Notes\n    -----\n    This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses,\n    processes the conditions, and constructs the log message based on the clause type and columns.\n\n    If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is\n    None, it sets the condition_clause to \"No conditions required\".\n\n    The log message includes the clauses type, the clause type, the columns, and the condition.\n    \"\"\"\n    log_message = None\n\n    if not clauses.isEmpty():\n        clauses_type = clauses.last().nodeName().replace(\"DeltaMergeInto\", \"\")\n        _processed_clauses = {}\n\n        for i in range(0, clauses.length()):\n            clause = clauses.apply(i)\n            condition = clause.condition()\n\n            if \"value\" in dir(condition):\n                condition_clause = (\n                    condition.value()\n                    .toString()\n                    .replace(f\"'{source_alias}\", source_alias)\n                    .replace(f\"'{target_alias}\", target_alias)\n                )\n            elif condition.toString() == \"None\":\n                condition_clause = \"No conditions required\"\n\n            clause_type: str = clause.clauseType().capitalize()\n            columns = \"ALL\" if clause_type == \"Delete\" else clause.actions().toList().apply(0).toString()\n\n            if clause_type.lower() not in _processed_clauses:\n                _processed_clauses[clause_type.lower()] = []\n\n            log_message = (\n                f\"{clauses_type} will perform action:{clause_type} columns ({columns}) if `{condition_clause}`\"\n            )\n\n    return log_message\n</code></pre>"},{"location":"api_reference/sso/index.html","title":"Sso","text":""},{"location":"api_reference/sso/okta.html","title":"Okta","text":"<p>This module contains Okta integration steps.</p>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter","title":"koheesio.sso.okta.LoggerOktaTokenFilter","text":"<pre><code>LoggerOktaTokenFilter(okta_object: OktaAccessToken, name: str = 'OktaToken')\n</code></pre> <p>Filter which hides token value from log.</p> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def __init__(self, okta_object: OktaAccessToken, name: str = \"OktaToken\"):\n    self.__okta_object = okta_object\n    super().__init__(name=name)\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter.filter","title":"filter","text":"<pre><code>filter(record)\n</code></pre> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def filter(self, record):\n    # noinspection PyUnresolvedReferences\n    if token := self.__okta_object.output.token:\n        token_value = token.get_secret_value()\n        record.msg = record.msg.replace(token_value, \"&lt;SECRET_TOKEN&gt;\")\n\n    return True\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta","title":"koheesio.sso.okta.Okta","text":"<p>Base Okta class</p>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_id","title":"client_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_id: str = Field(default=..., alias='okta_id', description='Okta account ID')\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_secret","title":"client_secret  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_secret: SecretStr = Field(default=..., alias='okta_secret', description='Okta account secret', repr=False)\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.data","title":"data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data: Optional[Union[Dict[str, str], str]] = Field(default={'grant_type': 'client_credentials'}, description='Data to be sent along with the token request')\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken","title":"koheesio.sso.okta.OktaAccessToken","text":"<pre><code>OktaAccessToken(**kwargs)\n</code></pre> <p>Get Okta authorization token</p> <p>Example: <pre><code>token = (\n    OktaAccessToken(\n        url=\"https://org.okta.com\",\n        client_id=\"client\",\n        client_secret=SecretStr(\"secret\"),\n        params={\n            \"p1\": \"foo\",\n            \"p2\": \"bar\",\n        },\n    )\n    .execute()\n    .token\n)\n</code></pre></p> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def __init__(self, **kwargs):\n    _logger = LoggingFactory.get_logger(name=self.__class__.__name__, inherit_from_koheesio=True)\n    logger_filter = LoggerOktaTokenFilter(okta_object=self)\n    _logger.addFilter(logger_filter)\n    super().__init__(**kwargs)\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output","title":"Output","text":"<p>Output class for OktaAccessToken.</p>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output.token","title":"token  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>token: Optional[SecretStr] = Field(default=None, description='Okta authentication token')\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Execute an HTTP Post call to Okta service and retrieve the access token.</p> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Execute an HTTP Post call to Okta service and retrieve the access token.\n    \"\"\"\n    HttpPostStep.execute(self)\n\n    # noinspection PyUnresolvedReferences\n    status_code = self.output.status_code\n    # noinspection PyUnresolvedReferences\n    raw_payload = self.output.raw_payload\n\n    if status_code != 200:\n        raise HTTPError(f\"Request failed with '{status_code}' code. Payload: {raw_payload}\")\n\n    # noinspection PyUnresolvedReferences\n    json_payload = self.output.json_payload\n\n    if token := json_payload.get(\"access_token\"):\n        self.output.token = SecretStr(token)\n    else:\n        raise ValueError(f\"No 'access_token' found in the Okta response: {json_payload}\")\n</code></pre>"},{"location":"api_reference/steps/index.html","title":"Steps","text":"<p>Steps Module</p> <p>This module contains the definition of the <code>Step</code> class, which serves as the base class for custom units of logic that can be executed. It also includes the <code>StepOutput</code> class, which defines the output data model for a <code>Step</code>.</p> <p>The <code>Step</code> class is designed to be subclassed for creating new steps in a data pipeline. Each subclass should implement the <code>execute</code> method, specifying the expected inputs and outputs.</p> <p>This module also exports the <code>SparkStep</code> class for steps that interact with Spark</p> Classes: <ul> <li>Step: Base class for a custom unit of logic that can be executed.</li> <li>StepOutput: Defines the output data model for a <code>Step</code>.</li> </ul>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step","title":"koheesio.steps.Step","text":"<p>Base class for a step</p> <p>A custom unit of logic that can be executed.</p> <p>The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the <code>def execute(self)</code> method, specifying the expected inputs and outputs.</p> <p>Note: since the Step class is meta classed, the execute method is wrapped with the <code>do_execute</code> function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.</p> Methods and Attributes <p>The Step class has several attributes and methods.</p> Background <p>A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.</p> <p>A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!</p> <p>The diagram serves to illustrate the concept of a Step:</p> <pre><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre> <p>Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.</p> <ul> <li>Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to   automatically validate data against the defined fields and their types.</li> <li>Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the <code>execute</code> method of the Step   class with the <code>_execute_wrapper</code> function. This ensures that the <code>execute</code> method always returns the output of   the Step along with providing logging and validation of the output.</li> <li>Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output   of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be   accessed through the <code>Step.Output</code> attribute.</li> <li>The <code>Output</code> class can be extended to add additional fields to the output of the Step.</li> </ul> <p>Examples:</p> <pre><code>class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -&gt; MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--input","title":"INPUT","text":"<p>The following fields are available by default on the Step class: - <code>name</code>: Name of the Step. If not set, the name of the class will be used. - <code>description</code>: Description of the Step. If not set, the docstring of the class will be used. If the docstring     contains multiple lines, only the first line will be used.</p> <p>When subclassing a Step, any additional pydantic field will be treated as <code>input</code> to the Step. See also the explanation on the <code>.execute()</code> method below.</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--output","title":"OUTPUT","text":"<p>Every Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be accessed through the <code>Step.Output</code> attribute. The <code>Output</code> class can be extended to add additional fields to the output of the Step. See also the explanation on the <code>.execute()</code>.</p> <ul> <li><code>Output</code>: A nested class representing the output of the Step used to validate the output of the     Step and based on the StepOutput class.</li> <li><code>output</code>: Allows you to interact with the Output of the Step lazily (see above and StepOutput)</li> </ul> <p>When subclassing a Step, any additional pydantic field added to the nested <code>Output</code> class will be treated as <code>output</code> of the Step. See also the description of <code>StepOutput</code> for more information.</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--methods","title":"Methods:","text":"<ul> <li><code>execute</code>: Abstract method to implement for new steps.<ul> <li>The Inputs of the step can be accessed, using <code>self.input_name</code>.</li> <li>The output of the step can be accessed, using <code>self.output.output_name</code>.</li> </ul> </li> <li><code>run</code>: Alias to .execute() method. You can use this to run the step, but execute is preferred.</li> <li><code>to_yaml</code>: YAML dump the step</li> <li><code>get_description</code>: Get the description of the Step</li> </ul> <p>When subclassing a Step, <code>execute</code> is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.</p> <p>Note: since the Step class is meta-classed, the execute method is automatically wrapped with the <code>do_execute</code> function making it always return a StepOutput. See also the explanation on the <code>do_execute</code> function.</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--class-methods","title":"class methods:","text":"<ul> <li><code>from_step</code>: Returns a new Step instance based on the data of another Step instance.     for example: <code>MyStep.from_step(other_step, a=\"foo\")</code></li> <li><code>get_description</code>: Get the description of the Step</li> </ul>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--dunder-methods","title":"dunder methods:","text":"<ul> <li><code>__getattr__</code>: Allows input to be accessed through <code>self.input_name</code></li> <li><code>__repr__</code> and <code>__str__</code>: String representation of a step</li> </ul>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.output","title":"output  <code>property</code> <code>writable</code>","text":"<pre><code>output: Output\n</code></pre> <p>Interact with the output of the Step</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.Output","title":"Output","text":"<p>Output class for Step</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Abstract method to implement for new steps.</p> <p>The Inputs of the step can be accessed, using <code>self.input_name</code></p> <p>Note: since the Step class is meta-classed, the execute method is wrapped with the <code>do_execute</code> function making   it always return the Steps output</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.from_step","title":"from_step  <code>classmethod</code>","text":"<pre><code>from_step(step: Step, **kwargs)\n</code></pre> <p>Returns a new Step instance based on the data of another Step or BaseModel instance</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_json","title":"repr_json","text":"<pre><code>repr_json(simple=False) -&gt; str\n</code></pre> <p>dump the step to json, meant for representation</p> <p>Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid json</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_json(self, simple=False) -&gt; str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -&gt; Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_yaml","title":"repr_yaml","text":"<pre><code>repr_yaml(simple=False) -&gt; str\n</code></pre> <p>dump the step to yaml, meant for representation</p> <p>Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_yaml())\ninput:\n  a: foo\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid yaml</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_yaml(self, simple=False) -&gt; str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>Alias to .execute()</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepMetaClass","title":"koheesio.steps.StepMetaClass","text":"<p>StepMetaClass has to be set up as a Metaclass extending ModelMetaclass to allow Pydantic to be unaffected while allowing for the execute method to be auto-decorated with do_execute</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput","title":"koheesio.steps.StepOutput","text":"<p>Class for the StepOutput model</p> Usage <p>Setting up the StepOutputs class is done like this: <pre><code>class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n</code></pre></p>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(validate_default=False, defer_build=True)\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.validate_output","title":"validate_output","text":"<pre><code>validate_output() -&gt; StepOutput\n</code></pre> <p>Validate the output of the Step</p> <p>Essentially, this method is a wrapper around the validate method of the BaseModel class</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def validate_output(self) -&gt; StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n</code></pre>"},{"location":"api_reference/steps/dummy.html","title":"Dummy","text":"<p>Dummy step for testing purposes.</p> <p>This module contains a dummy step for testing purposes. It is used to test the Koheesio framework or to provide a simple example of how to create a new step.</p> Example <p><pre><code>s = DummyStep(a=\"a\", b=2)\ns.execute()\n</code></pre> In this case, <code>s.output</code> will be equivalent to the following dictionary: <pre><code>{\"a\": \"a\", \"b\": 2, \"c\": \"aa\"}\n</code></pre></p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput","title":"koheesio.steps.dummy.DummyOutput","text":"<p>Dummy output for testing purposes.</p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.a","title":"a  <code>instance-attribute</code>","text":"<pre><code>a: str\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.b","title":"b  <code>instance-attribute</code>","text":"<pre><code>b: int\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep","title":"koheesio.steps.dummy.DummyStep","text":"<p>Dummy step for testing purposes.</p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.a","title":"a  <code>instance-attribute</code>","text":"<pre><code>a: str\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.b","title":"b  <code>instance-attribute</code>","text":"<pre><code>b: int\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output","title":"Output","text":"<p>Dummy output for testing purposes.</p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output.c","title":"c  <code>instance-attribute</code>","text":"<pre><code>c: str\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Dummy execute for testing purposes.</p> Source code in <code>src/koheesio/steps/dummy.py</code> <pre><code>def execute(self):\n    \"\"\"Dummy execute for testing purposes.\"\"\"\n    self.output.a = self.a\n    self.output.b = self.b\n    self.output.c = self.a * self.b\n</code></pre>"},{"location":"api_reference/steps/http.html","title":"Http","text":"<p>This module contains a few simple HTTP Steps that can be used to perform API Calls to HTTP endpoints</p> Example <pre><code>from koheesio.steps.http import HttpGetStep\n\nresponse = HttpGetStep(url=\"https://google.com\").execute().json_payload\n</code></pre> <p>In the above example, the <code>response</code> variable will contain the JSON response from the HTTP request.</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep","title":"koheesio.steps.http.HttpDeleteStep","text":"<p>send DELETE requests</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = DELETE\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep","title":"koheesio.steps.http.HttpGetStep","text":"<p>send GET requests</p> Example <p><pre><code>response = HttpGetStep(url=\"https://google.com\").execute().json_payload\n</code></pre> In the above example, the <code>response</code> variable will contain the JSON response from the HTTP request.</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = GET\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod","title":"koheesio.steps.http.HttpMethod","text":"<p>Enumeration of allowed http methods</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.DELETE","title":"DELETE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DELETE = 'delete'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.GET","title":"GET  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>GET = 'get'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.POST","title":"POST  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>POST = 'post'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.PUT","title":"PUT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PUT = 'put'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(value: str)\n</code></pre> <p>Allows for getting the right Method Enum by simply passing a string value This method is not case-sensitive</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@classmethod\ndef from_string(cls, value: str):\n    \"\"\"Allows for getting the right Method Enum by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep","title":"koheesio.steps.http.HttpPostStep","text":"<p>send POST requests</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = POST\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep","title":"koheesio.steps.http.HttpPutStep","text":"<p>send PUT requests</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = PUT\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep","title":"koheesio.steps.http.HttpStep","text":"<p>Can be used to perform API Calls to HTTP endpoints</p> Understanding Retries <p>This class includes a built-in retry mechanism for handling temporary issues, such as network errors or server downtime, that might cause the HTTP request to fail. The retry mechanism is controlled by three parameters: <code>max_retries</code>, <code>initial_delay</code>, and <code>backoff</code>.</p> <ul> <li> <p><code>max_retries</code> determines the number of retries after the initial request. For example, if <code>max_retries</code> is set to     4, the request will be attempted a total of 5 times (1 initial attempt + 4 retries). If <code>max_retries</code> is set to     0, no retries will be attempted, and the request will be tried only once.</p> </li> <li> <p><code>initial_delay</code> sets the waiting period before the first retry. If <code>initial_delay</code> is set to 3, the delay before     the first retry will be 3 seconds. Changing the <code>initial_delay</code> value directly affects the amount of delay     before each retry.</p> </li> <li> <p><code>backoff</code> controls the rate at which the delay increases for each subsequent retry. If <code>backoff</code> is set to 2 (the     default), the delay will double with each retry. If <code>backoff</code> is set to 1, the delay between retries will     remain constant. Changing the <code>backoff</code> value affects how quickly the delay increases.</p> </li> </ul> <p>Given the default values of <code>max_retries=3</code>, <code>initial_delay=2</code>, and <code>backoff=2</code>, the delays between retries would be 2 seconds, 4 seconds, and 8 seconds, respectively. This results in a total delay of 14 seconds before all retries are exhausted.</p> <p>For example, if you set <code>initial_delay=3</code> and <code>backoff=2</code>, the delays before the retries would be <code>3 seconds</code>, <code>6 seconds</code>, and <code>12 seconds</code>. If you set <code>initial_delay=2</code> and <code>backoff=3</code>, the delays before the retries would be <code>2 seconds</code>, <code>6 seconds</code>, and <code>18 seconds</code>. If you set <code>initial_delay=2</code> and <code>backoff=1</code>, the delays before the retries would be <code>2 seconds</code>, <code>2 seconds</code>, and <code>2 seconds</code>.</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.data","title":"data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data: Optional[Union[Dict[str, str], str]] = Field(default_factory=dict, description='[Optional] Data to be sent along with the request', alias='body')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.headers","title":"headers  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Optional[Dict[str, Union[str, SecretStr]]] = Field(default_factory=dict, description='Request headers', alias='header')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to HTTP request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.session","title":"session  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>session: Session = Field(default_factory=Session, description='Requests session object to be used for making HTTP requests', exclude=True, repr=False)\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.timeout","title":"timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timeout: Optional[int] = Field(default=3, description='[Optional] Request timeout')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., description='API endpoint URL', alias='uri')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output","title":"Output","text":"<p>Output class for HttpStep</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.json_payload","title":"json_payload  <code>property</code>","text":"<pre><code>json_payload\n</code></pre> <p>Alias for response_json</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.raw_payload","title":"raw_payload  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>raw_payload: Optional[str] = Field(default=None, alias='response_text', description='The raw response for the request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_json","title":"response_json  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>response_json: Optional[Union[Dict, List]] = Field(default=None, alias='json_payload', description='The JSON response for the request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_raw","title":"response_raw  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>response_raw: Optional[Response] = Field(default=None, alias='response', description='The raw requests.Response object returned by the appropriate requests.request() call')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.status_code","title":"status_code  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>status_code: Optional[int] = Field(default=None, description='The status return code of the request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.decode_sensitive_headers","title":"decode_sensitive_headers","text":"<pre><code>decode_sensitive_headers(headers)\n</code></pre> <p>Authorization headers are being converted into SecretStr under the hood to avoid dumping any sensitive content into logs by the <code>encode_sensitive_headers</code> method.</p> <p>However, when calling the <code>get_headers</code> method, the SecretStr should be converted back to string, otherwise sensitive info would have looked like '**********'.</p> <p>This method decodes values of the <code>headers</code> dictionary that are of type SecretStr into plain text.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@field_serializer(\"headers\", when_used=\"json\")\ndef decode_sensitive_headers(self, headers):\n    \"\"\"\n    Authorization headers are being converted into SecretStr under the hood to avoid dumping any\n    sensitive content into logs by the `encode_sensitive_headers` method.\n\n    However, when calling the `get_headers` method, the SecretStr should be converted back to\n    string, otherwise sensitive info would have looked like '**********'.\n\n    This method decodes values of the `headers` dictionary that are of type SecretStr into plain text.\n    \"\"\"\n    for k, v in headers.items():\n        headers[k] = v.get_secret_value() if isinstance(v, SecretStr) else v\n    return headers\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.delete","title":"delete","text":"<pre><code>delete() -&gt; Response\n</code></pre> <p>Execute an HTTP DELETE call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def delete(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP DELETE call\"\"\"\n    self.method = HttpMethod.DELETE\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.encode_sensitive_headers","title":"encode_sensitive_headers","text":"<pre><code>encode_sensitive_headers(headers)\n</code></pre> <p>Encode potentially sensitive data into pydantic.SecretStr class to prevent them being displayed as plain text in logs.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@field_validator(\"headers\", mode=\"before\")\ndef encode_sensitive_headers(cls, headers):\n    \"\"\"\n    Encode potentially sensitive data into pydantic.SecretStr class to prevent them\n    being displayed as plain text in logs.\n    \"\"\"\n    if auth := headers.get(\"Authorization\"):\n        headers[\"Authorization\"] = auth if isinstance(auth, SecretStr) else SecretStr(auth)\n    return headers\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Executes the HTTP request.</p> <p>This method simply calls <code>self.request()</code>, which includes the retry logic. If <code>self.request()</code> raises an exception, it will be propagated to the caller of this method.</p> <p>Raises:</p> Type Description <code>(RequestException, HTTPError)</code> <p>The last exception that was caught if <code>self.request()</code> fails after <code>self.max_retries</code> attempts.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"\n    Executes the HTTP request.\n\n    This method simply calls `self.request()`, which includes the retry logic. If `self.request()` raises an\n    exception, it will be propagated to the caller of this method.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `self.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get","title":"get","text":"<pre><code>get() -&gt; Response\n</code></pre> <p>Execute an HTTP GET call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP GET call\"\"\"\n    self.method = HttpMethod.GET\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_headers","title":"get_headers","text":"<pre><code>get_headers()\n</code></pre> <p>Dump headers into JSON without SecretStr masking.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get_headers(self):\n    \"\"\"\n    Dump headers into JSON without SecretStr masking.\n    \"\"\"\n    return json.loads(self.model_dump_json()).get(\"headers\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>options to be passed to requests.request()</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get_options(self):\n    \"\"\"options to be passed to requests.request()\"\"\"\n    return {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self.params,  # type: ignore\n    }\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_proper_http_method_from_str_value","title":"get_proper_http_method_from_str_value","text":"<pre><code>get_proper_http_method_from_str_value(method_value)\n</code></pre> <p>Converts string value to HttpMethod enum value</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@field_validator(\"method\")\ndef get_proper_http_method_from_str_value(cls, method_value):\n    \"\"\"Converts string value to HttpMethod enum value\"\"\"\n    if isinstance(method_value, str):\n        try:\n            method_value = HttpMethod.from_string(method_value)\n        except AttributeError as e:\n            raise AttributeError(\n                \"Only values from HttpMethod class are allowed! \"\n                f\"Provided value: '{method_value}', allowed values: {', '.join(HttpMethod.__members__.keys())}\"\n            ) from e\n\n    return method_value\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.post","title":"post","text":"<pre><code>post() -&gt; Response\n</code></pre> <p>Execute an HTTP POST call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def post(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP POST call\"\"\"\n    self.method = HttpMethod.POST\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.put","title":"put","text":"<pre><code>put() -&gt; Response\n</code></pre> <p>Execute an HTTP PUT call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def put(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP PUT call\"\"\"\n    self.method = HttpMethod.PUT\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.request","title":"request","text":"<pre><code>request(method: Optional[HttpMethod] = None) -&gt; Response\n</code></pre> <p>Executes the HTTP request with retry logic.</p> <p>Actual http_method execution is abstracted into this method. This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.</p> <p>This method will try to execute <code>requests.request</code> up to <code>self.max_retries</code> times. If <code>self.request()</code> raises an exception, it logs a warning message and the error message, then waits for <code>self.initial_delay * (self.backoff ** i)</code> seconds before retrying. The delay increases exponentially after each failed attempt due to the <code>self.backoff ** i</code> term.</p> <p>If <code>self.request()</code> still fails after <code>self.max_retries</code> attempts, it logs an error message and re-raises the last exception that was caught.</p> <p>This is a good way to handle temporary issues that might cause <code>self.request()</code> to fail, such as network errors or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with requests if it's struggling to respond.</p> <p>Parameters:</p> Name Type Description Default <code>method</code> <code>HttpMethod</code> <p>Optional parameter that allows calls to different HTTP methods and bypassing class level <code>method</code> parameter.</p> <code>None</code> <p>Raises:</p> Type Description <code>(RequestException, HTTPError)</code> <p>The last exception that was caught if <code>requests.request()</code> fails after <code>self.max_retries</code> attempts.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def request(self, method: Optional[HttpMethod] = None) -&gt; requests.Response:\n    \"\"\"\n    Executes the HTTP request with retry logic.\n\n    Actual http_method execution is abstracted into this method.\n    This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.\n\n    This method will try to execute `requests.request` up to `self.max_retries` times. If `self.request()` raises\n    an exception, it logs a warning message and the error message, then waits for\n    `self.initial_delay * (self.backoff ** i)` seconds before retrying. The delay increases exponentially\n    after each failed attempt due to the `self.backoff ** i` term.\n\n    If `self.request()` still fails after `self.max_retries` attempts, it logs an error message and re-raises the\n    last exception that was caught.\n\n    This is a good way to handle temporary issues that might cause `self.request()` to fail, such as network errors\n    or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with\n    requests if it's struggling to respond.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        Optional parameter that allows calls to different HTTP methods and bypassing class level `method`\n        parameter.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `requests.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    _method = (method or self.method).value.upper()\n    options = self.get_options()\n\n    self.log.debug(f\"Making {_method} request to {options['url']} with headers {options['headers']}\")\n\n    response = self.session.request(method=_method, **options)\n    response.raise_for_status()\n\n    self.log.debug(f\"Received response with status code {response.status_code} and body {response.text}\")\n    self.set_outputs(response)\n\n    return response\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.set_outputs","title":"set_outputs","text":"<pre><code>set_outputs(response)\n</code></pre> <p>Types of response output</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def set_outputs(self, response):\n    \"\"\"\n    Types of response output\n    \"\"\"\n    self.output.response_raw = response\n    self.output.raw_payload = response.text\n    self.output.status_code = response.status_code\n\n    # Only decode non empty payloads to avoid triggering decoding error unnecessarily.\n    if self.output.raw_payload:\n        try:\n            self.output.response_json = response.json()\n\n        except json.decoder.JSONDecodeError as e:\n            self.log.info(f\"An error occurred while processing the JSON payload. Error message:\\n{e.msg}\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep","title":"koheesio.steps.http.PaginatedHtppGetStep","text":"<p>Represents a paginated HTTP GET step.</p> <p>Parameters:</p> Name Type Description Default <code>paginate</code> <code>bool</code> <p>Whether to paginate the API response. Defaults to False.</p> required <code>pages</code> <code>int</code> <p>Number of pages to paginate. Defaults to 1.</p> required <code>offset</code> <code>int</code> <p>Offset for paginated API calls. Offset determines the starting page. Defaults to 1.</p> required <code>limit</code> <code>int</code> <p>Limit for paginated API calls. Defaults to 100.</p> required"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.limit","title":"limit  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit: Optional[int] = Field(default=100, description='Limit for paginated API calls. The url should (optionally) contain a named limit parameter, for example: api.example.com/data?limit={limit}')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.offset","title":"offset  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>offset: Optional[int] = Field(default=1, description=\"Offset for paginated API calls. Offset determines the starting page. Defaults to 1. The url can (optionally) contain a named 'offset' parameter, for example: api.example.com/data?offset={offset}\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.pages","title":"pages  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pages: Optional[int] = Field(default=1, description='Number of pages to paginate. Defaults to 1')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.paginate","title":"paginate  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>paginate: Optional[bool] = Field(default=False, description=\"Whether to paginate the API response. Defaults to False. When set to True, the API response will be paginated. The url should contain a named 'page' parameter for example: api.example.com/data?page={page}\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Executes the HTTP GET request and handles pagination.</p> <p>Returns:</p> Type Description <code>Output</code> <p>The output of the HTTP GET request.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def execute(self) -&gt; HttpGetStep.Output:\n    \"\"\"\n    Executes the HTTP GET request and handles pagination.\n\n    Returns\n    -------\n    HttpGetStep.Output\n        The output of the HTTP GET request.\n    \"\"\"\n    # Set up pagination parameters\n    offset, pages = (self.offset, self.pages + 1) if self.paginate else (1, 1)  # type: ignore\n    data = []\n    _basic_url = self.url\n\n    for page in range(offset, pages):\n        if self.paginate:\n            self.log.info(f\"Fetching page {page} of {pages - 1}\")\n\n        self.url = self._url(basic_url=_basic_url, page=page)\n        self.request()\n\n        if isinstance(self.output.response_json, list):\n            data += self.output.response_json\n        else:\n            data.append(self.output.response_json)\n\n    self.url = _basic_url\n    self.output.response_json = data\n    self.output.response_raw = None\n    self.output.raw_payload = None\n    self.output.status_code = None\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Returns the options to be passed to the requests.request() function.</p> <p>Returns:</p> Type Description <code>dict</code> <p>The options.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get_options(self):\n    \"\"\"\n    Returns the options to be passed to the requests.request() function.\n\n    Returns\n    -------\n    dict\n        The options.\n    \"\"\"\n    options = {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self._adjust_params(),  # type: ignore\n    }\n\n    return options\n</code></pre>"},{"location":"community/approach-documentation.html","title":"Approach documentation","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#scope","title":"Scope","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#the-system","title":"The System","text":"<p>We will be adopting \"The Documentation System\".</p> <p>From documentation.divio.com:</p> <p>There is a secret that needs to be understood in order to write good software documentation: there isn\u2019t one thing called documentation, there are four.</p> <p>They are: tutorials, how-to guides, technical reference and explanation. They represent four different purposes or functions, and require four different approaches to their creation. Understanding the implications of this will help improve most documentation - often immensely.</p> <p>About the system  The documentation system outlined here is a simple, comprehensive and nearly universally-applicable scheme. It is proven in practice across a wide variety of fields and applications.</p> <p>There are some very simple principles that govern documentation that are very rarely if ever spelled out. They seem to be a secret, though they shouldn\u2019t be.</p> <p>If you can put these principles into practice, it will make your documentation better and your project, product or team more successful - that\u2019s a promise.</p> <p>The system is widely adopted for large and small, open and proprietary documentation projects.</p> <p>Video Presentation on YouTube:</p> <p></p>","tags":["doctype/explanation"]},{"location":"community/contribute.html","title":"Contribute","text":""},{"location":"community/contribute.html#how-to-contribute","title":"How to contribute","text":"<p>There are a few guidelines that we need contributors to follow so that we are able to process requests as efficiently as possible. If you have any questions or concerns please feel free to contact us at opensource@nike.com.</p>"},{"location":"community/contribute.html#getting-started","title":"Getting Started","text":"<ul> <li>Review our Code of Conduct</li> <li>Make sure you have a GitHub account</li> <li>Submit a ticket for your issue, assuming one does not already exist.<ul> <li>Clearly describe the issue including steps to reproduce when it is a bug.</li> <li>Make sure you fill in the earliest version that you know has the issue.</li> </ul> </li> <li>Fork the repository on GitHub</li> </ul>"},{"location":"community/contribute.html#making-changes","title":"Making Changes","text":"<ul> <li>Create a feature branch off of <code>main</code> before you start your work.<ul> <li>Please avoid working directly on the <code>main</code> branch.</li> </ul> </li> <li>Setup the required package manager hatch</li> <li>Setup the dev environment see below</li> <li>Make commits of logical units.<ul> <li>You may be asked to squash unnecessary commits down to logical units.</li> </ul> </li> <li>Check for unnecessary whitespace with <code>git diff --check</code> before committing.</li> <li>Write meaningful, descriptive commit messages.</li> <li>Please follow existing code conventions when working on a file</li> <li>Make sure to check the standards on the code, see below</li> <li>Make sure to test the code before you push changes see below</li> </ul>"},{"location":"community/contribute.html#submitting-changes","title":"\ud83e\udd1d Submitting Changes","text":"<ul> <li>Push your changes to a topic branch in your fork of the repository.</li> <li>Submit a pull request to the repository in the Nike-Inc organization.</li> <li>After feedback has been given we expect responses within two weeks. After two weeks we may close the pull request  if it isn't showing any activity.</li> <li>Bug fixes or features that lack appropriate tests may not be considered for merge.</li> <li>Changes that lower test coverage may not be considered for merge.</li> </ul>"},{"location":"community/contribute.html#make-commands","title":"\ud83d\udd28 Make commands","text":"<p>We use <code>make</code> for managing different steps of setup and maintenance in the project. You can install make by following the instructions here</p> <p>For a full list of available make commands, you can run:</p> <pre><code>make help\n</code></pre>"},{"location":"community/contribute.html#package-manager","title":"\ud83d\udce6 Package manager","text":"<p>We use <code>hatch</code> as our package manager.</p> <p>Note: Please DO NOT use pip or conda to install the dependencies. Instead, use hatch.</p> <p>To install hatch, run the following command: <pre><code>make init\n</code></pre></p> <p>or, <pre><code>make hatch-install\n</code></pre></p> <p>This will install hatch using brew if you are on a Mac. </p> <p>If you are on a different OS, you can follow the instructions here</p>"},{"location":"community/contribute.html#dev-environment-setup","title":"\ud83d\udccc Dev Environment Setup","text":"<p>To ensure our standards, make sure to install the required packages.</p> <pre><code>make dev\n</code></pre> <p>This will install all the required packages for development in the project under the <code>.venv</code> directory. Use this virtual environment to run the code and tests during local development.</p>"},{"location":"community/contribute.html#linting-and-standards","title":"\ud83e\uddf9 Linting and Standards","text":"<p>We use <code>ruff</code>, <code>pylint</code>, <code>isort</code>, <code>black</code> and <code>mypy</code> to maintain standards in the codebase.</p> <p>Run the following two commands to check the codebase for any issues:</p> <p><pre><code>make check\n</code></pre> This will run all the checks including pylint and mypy.</p> <p><pre><code>make fmt\n</code></pre> This will format the codebase using black, isort, and ruff.</p> <p>Make sure that the linters and formatters do not report any errors or warnings before submitting a pull request.</p>"},{"location":"community/contribute.html#testing","title":"\ud83e\uddea Testing","text":"<p>We use <code>pytest</code> to test our code. </p> <p>You can run the tests by running one of the following commands:</p> <pre><code>make cov  # to run the tests and check the coverage\nmake all-tests  # to run all the tests\nmake spark-tests  # to run the spark tests\nmake non-spark-tests  # to run the non-spark tests\n</code></pre> <p>Make sure that all tests pass and that you have adequate coverage before submitting a pull request.</p>"},{"location":"community/contribute.html#additional-resources","title":"Additional Resources","text":"<ul> <li>General GitHub documentation</li> <li>GitHub pull request documentation</li> <li>Nike's Code of Conduct</li> <li>Nike's Individual Contributor License Agreement</li> <li>Nike OSS</li> </ul>"},{"location":"includes/glossary.html","title":"Glossary","text":""},{"location":"includes/glossary.html#pydantic","title":"Pydantic","text":"<p>Pydantic is a Python library for data validation and settings management using Python type annotations. It allows Koheesio to bring in strong typing and a high level of type safety. Essentially, it allows Koheesio to consider configurations of a pipeline (i.e. the settings used inside Steps, Tasks, etc.) as data that can be validated and structured.</p>"},{"location":"includes/glossary.html#pyspark","title":"PySpark","text":"<p>PySpark is a Python library for Apache Spark, a powerful open-source data processing engine. It allows Koheesio to handle large-scale data processing tasks efficiently. </p>"},{"location":"misc/info.html","title":"Info","text":"<p>{{ macros_info() }}</p>"},{"location":"reference/concepts/concepts.html","title":"Concepts","text":"<p>The framework architecture is built from a set of core components. Each of the implementations that the framework  provides out of the box, can be swapped out for custom implementations as long as they match the API.</p> <p>The core components are the following:</p> <p>Note: click on the 'Concept' to take you to the corresponding module. The module documentation will have    greater detail on the specifics of the implementation</p>"},{"location":"reference/concepts/concepts.html#step","title":"Step","text":"<p>A custom unit of logic that can be executed. A Step is an atomic operation and serves as the building block of data  pipelines built with the framework. A step can be seen as an operation on a set of inputs, and returns a set of  outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure  below.</p> <pre><code>\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &amp;nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\nStep\n&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\n\"]\n\nI1[\"Input 1\"] ---&gt; Step\nI2[\"Input 2\"] ---&gt; Step\nI3[\"Input 3\"] ---&gt; Step\n\nStep ---&gt; O1[\"Output 1\"]\nStep ---&gt; O2[\"Output 2\"]\nStep ---&gt; O3[\"Output 3\"]\n</code></pre> <p>Step is the core abstraction of the framework. Meaning, that it is the core building block of the framework and is used to define all the operations that can be executed. </p> <p>Please see the Step documentation for more details.</p>"},{"location":"reference/concepts/concepts.html#task","title":"Task","text":"<p>The unit of work of one execution of the framework. </p> <p>An execution usually consists of an <code>Extract - Transform - Load</code> approach of one data object. Tasks typically consist of a series of Steps.</p> <p>Please see the Task documentation for more details.</p>"},{"location":"reference/concepts/concepts.html#context","title":"Context","text":"<p>The Context is used to configure the environment where a Task or Step runs.</p> <p>It is often based on configuration files and can be used to adapt behaviour of a Task or Step based on the environment it runs in.</p> <p>Please see the Context documentation for more details.</p>"},{"location":"reference/concepts/concepts.html#logger","title":"logger","text":"<p>A logger object to log messages with different levels.</p> <p>Please see the Logging documentation for more details.</p> <p>The interactions between the base concepts of the model is visible in the below diagram:  </p> <pre><code>---\ntitle: Koheesio Class Diagram\n---\nclassDiagram\n    Step .. Task\n    Step .. Transformation\n    Step .. Reader\n    Step .. Writer\n\n    class Context\n\n    class LoggingFactory\n\n    class Task{\n        &lt;&lt;abstract&gt;&gt;\n        + List~Step~ steps\n        ...\n        + execute() Output\n    }\n\n    class Step{\n        &lt;&lt;abstract&gt;&gt;\n        ...\n        Output: ...\n        + execute() Output\n    }\n\n    class Transformation{\n        &lt;&lt;abstract&gt;&gt;\n        + df: DataFrame\n        ...\n        Output:\n        + df: DataFrame\n        + transform(df: DataFrame) DataFrame\n    }\n\n    class Reader{\n        &lt;&lt;abstract&gt;&gt;\n        ...\n        Output:\n        + df: DataFrame\n        + read() DataFrame\n    }\n\n    class Writer{\n        &lt;&lt;abstract&gt;&gt;\n        + df: DataFrame\n        ...\n        + write(df: DataFrame)\n    }</code></pre>"},{"location":"reference/concepts/context.html","title":"Context in Koheesio","text":"<p>In the Koheesio framework, the <code>Context</code> class plays a pivotal role. It serves as a flexible and powerful tool for  managing configuration data and shared variables across tasks and steps in your application.</p> <p><code>Context</code> behaves much like a Python dictionary, but with additional features that enhance its usability and  flexibility. It allows you to store and retrieve values, including complex Python objects, with ease. You can access  these values using dictionary-like methods or as class attributes, providing a simple and intuitive interface.</p> <p>Moreover, <code>Context</code> supports nested keys and recursive merging of contexts, making it a versatile tool for managing  complex configurations. It also provides serialization and deserialization capabilities, allowing you to easily save  and load configurations in JSON, YAML, or TOML formats.</p> <p>Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks,  <code>Context</code> provides a robust and efficient solution. </p> <p>This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio  applications.</p>"},{"location":"reference/concepts/context.html#api-reference","title":"API Reference","text":"<p>See API Reference for a detailed description of the <code>Context</code> class and its methods.</p>"},{"location":"reference/concepts/context.html#key-features","title":"Key Features","text":"<ul> <li> <p>Accessing Values: <code>Context</code> simplifies accessing configuration values. You can access them using dictionary-like    methods or as class attributes. This allows for a more intuitive interaction with the <code>Context</code> object. For example:</p> <pre><code>context = Context({\"bronze_table\": \"catalog.schema.table_name\"})\nprint(context.bronze_table)  # Outputs: catalog.schema.table_name\n</code></pre> </li> <li> <p>Nested Keys: <code>Context</code> supports nested keys, allowing you to access and add nested keys in a straightforward way.    This is useful when dealing with complex configurations that require a hierarchical structure. For example:</p> <pre><code>context = Context({\"bronze\": {\"table\": \"catalog.schema.table_name\"}})\nprint(context.bronze.table)  # Outputs: catalog.schema.table_name\n</code></pre> </li> <li> <p>Merging Contexts: You can merge two <code>Contexts</code> together, with the incoming <code>Context</code> having priority. Recursive    merging is also supported. This is particularly useful when you want to update a <code>Context</code> with new data without    losing the existing values. For example:</p> <pre><code>context1 = Context({\"bronze_table\": \"catalog.schema.table_name\"})\ncontext2 = Context({\"silver_table\": \"catalog.schema.table_name\"})\ncontext1.merge(context2)\nprint(context1.silver_table)  # Outputs: catalog.schema.table_name\n</code></pre> </li> <li> <p>Adding Keys: You can add keys to a Context by using the <code>add</code> method. This allows you to dynamically update the    <code>Context</code> as needed. For example:</p> <pre><code>context.add(\"silver_table\", \"catalog.schema.table_name\")\n</code></pre> </li> <li> <p>Checking Key Existence: You can check if a key exists in a Context by using the <code>contains</code> method. This is useful    when you want to ensure a key is present before attempting to access its value. For example:</p> <pre><code>context.contains(\"silver_table\")  # Returns: True\n</code></pre> </li> <li> <p>Getting Key-Value Pair: You can get a key-value pair from a Context by using the <code>get_item</code> method. This can be    useful when you want to extract a specific piece of data from the <code>Context</code>. For example:</p> <pre><code>context.get_item(\"silver_table\")  # Returns: {\"silver_table\": \"catalog.schema.table_name\"}\n</code></pre> </li> <li> <p>Converting to Dictionary: You can convert a Context to a dictionary by using the <code>to_dict</code> method. This can be    useful when you need to interact with code that expects a standard Python dictionary. For example:</p> <pre><code>context_dict = context.to_dict()\n</code></pre> </li> <li> <p>Creating from Dictionary: You can create a Context from a dictionary by using the <code>from_dict</code> method. This allows    you to easily convert existing data structures into a <code>Context</code>. For example:</p> <pre><code>context = Context.from_dict({\"bronze_table\": \"catalog.schema.table_name\"})\n</code></pre> </li> </ul>"},{"location":"reference/concepts/context.html#advantages-over-a-dictionary","title":"Advantages over a Dictionary","text":"<p>While a dictionary can be used to store configuration values, <code>Context</code> provides several advantages:</p> <ul> <li> <p>Support for nested keys: Unlike a standard Python dictionary, <code>Context</code> allows you to access nested keys as if    they were attributes. This makes it easier to work with complex, hierarchical data.</p> </li> <li> <p>Recursive merging of two <code>Contexts</code>: <code>Context</code> allows you to merge two <code>Contexts</code> together, with the incoming    <code>Context</code> having priority. This is useful when you want to update a <code>Context</code> with new data without losing the    existing values.</p> </li> <li> <p>Accessing keys as if they were class attributes: This provides a more intuitive way to interact with the    <code>Context</code>, as you can use dot notation to access values.</p> </li> <li> <p>Code completion in IDEs: Because you can access keys as if they were attributes, IDEs can provide code completion    for <code>Context</code> keys. This can make your coding process more efficient and less error-prone.</p> </li> <li> <p>Easy creation from a YAML, JSON, or TOML file: <code>Context</code> provides methods to easily load data from YAML or JSON    files, making it a great tool for managing configuration data.</p> </li> </ul>"},{"location":"reference/concepts/context.html#data-formats-and-serialization","title":"Data Formats and Serialization","text":"<p><code>Context</code> leverages JSON, YAML, and TOML for serialization and deserialization. These formats are widely used in the  industry and provide a balance between readability and ease of use.</p> <ul> <li> <p>JSON: A lightweight data-interchange format that is easy for humans to read and write and easy for machines to    parse and generate. It's widely used for APIs and web-based applications.</p> </li> <li> <p>YAML: A human-friendly data serialization standard often used for configuration files. It's more readable than    JSON and supports complex data structures.</p> </li> <li> <p>TOML: A minimal configuration file format that's easy to read due to its clear and simple syntax. It's often used    for configuration files in Python applications.</p> </li> </ul>"},{"location":"reference/concepts/context.html#examples","title":"Examples","text":"<p>In this section, we provide a variety of examples to demonstrate the capabilities of the <code>Context</code> class in Koheesio.</p>"},{"location":"reference/concepts/context.html#basic-operations","title":"Basic Operations","text":"<p>Here are some basic operations you can perform with <code>Context</code>. These operations form the foundation of how you interact  with a <code>Context</code> object:</p> <pre><code># Create a Context\ncontext = Context({\"bronze_table\": \"catalog.schema.table_name\"})\n\n# Access a value\nvalue = context.bronze_table\n\n# Add a key\ncontext.add(\"silver_table\", \"catalog.schema.table_name\")\n\n# Merge two Contexts\ncontext.merge(Context({\"silver_table\": \"catalog.schema.table_name\"}))\n</code></pre>"},{"location":"reference/concepts/context.html#serialization-and-deserialization","title":"Serialization and Deserialization","text":"<p><code>Context</code> supports serialization and deserialization to and from JSON, YAML, and TOML formats. This allows you to  easily save and load <code>Context</code> data:</p> <pre><code># Load context from a JSON file\ncontext = Context.from_json(\"path/to/context.json\")\n\n# Save context to a JSON file\ncontext.to_json(\"path/to/context.json\")\n\n# Load context from a YAML file\ncontext = Context.from_yaml(\"path/to/context.yaml\")\n\n# Save context to a YAML file\ncontext.to_yaml(\"path/to/context.yaml\")\n\n# Load context from a TOML file\ncontext = Context.from_toml(\"path/to/context.toml\")\n\n# Save context to a TOML file\ncontext.to_toml(\"path/to/context.toml\")\n</code></pre>"},{"location":"reference/concepts/context.html#nested-keys","title":"Nested Keys","text":"<p><code>Context</code> supports nested keys, allowing you to create hierarchical configurations. This is useful when dealing with  complex data structures:</p> <pre><code># Create a Context with nested keys\ncontext = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Access a nested key\nprint(context.database.bronze_table)  # Outputs: catalog.schema.bronze_table\n</code></pre>"},{"location":"reference/concepts/context.html#recursive-merging","title":"Recursive Merging","text":"<p><code>Context</code> also supports recursive merging, allowing you to merge two <code>Contexts</code> together at all levels of their  hierarchy. This is particularly useful when you want to update a <code>Context</code> with new data without losing the existing  values:</p> <pre><code># Create two Contexts with nested keys\ncontext1 = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\"\n    }\n})\n\ncontext2 = Context({\n    \"database\": {\n        \"silver_table\": \"catalog.schema.new_silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Merge the two Contexts\ncontext1.merge(context2)\n\n# Print the merged Context\nprint(context1.to_dict())  \n# Outputs: \n# {\n#     \"database\": {\n#         \"bronze_table\": \"catalog.schema.bronze_table\",\n#         \"silver_table\": \"catalog.schema.new_silver_table\",\n#         \"gold_table\": \"catalog.schema.gold_table\"\n#     }\n# }\n</code></pre>"},{"location":"reference/concepts/context.html#jsonpickle-and-complex-python-objects","title":"Jsonpickle and Complex Python Objects","text":"<p>The <code>Context</code> class in Koheesio also uses <code>jsonpickle</code> for serialization and deserialization of complex Python objects  to and from JSON. This allows you to convert complex Python objects, including custom classes, into a format that can  be easily stored and transferred.</p> <p>Here's an example of how this works:</p> <pre><code># Import necessary modules\nfrom koheesio.context import Context\n\n# Initialize SnowflakeReader and store in a Context\nsnowflake_reader = SnowflakeReader(...)  # fill in with necessary arguments\ncontext = Context({\"snowflake_reader\": snowflake_reader})\n\n# Serialize the Context to a JSON string\njson_str = context.to_json()\n\n# Print the serialized Context\nprint(json_str)\n\n# Deserialize the JSON string back into a Context\ndeserialized_context = Context.from_json(json_str)\n\n# Access the deserialized SnowflakeReader\ndeserialized_snowflake_reader = deserialized_context.snowflake_reader\n\n# Now you can use the deserialized SnowflakeReader as you would the original\n</code></pre> <p>This feature is particularly useful when you need to save the state of your application, transfer it over a network,  or store it in a database. When you're ready to use the stored data, you can easily convert it back into the original  Python objects.</p> <p>However, there are a few things to keep in mind:</p> <ol> <li> <p>The classes you're serializing must be importable (i.e., they must be in the Python path) when you're deserializing     the JSON. <code>jsonpickle</code> needs to be able to import the class to reconstruct the object. This holds true for most    Koheesio classes, as they are designed to be importable and reconstructible.</p> </li> <li> <p>Not all Python objects can be serialized. For example, objects that hold a reference to a file or a network     connection can't be serialized because their state can't be easily captured in a static file.</p> </li> <li> <p>As mentioned in the code comments, <code>jsonpickle</code> is not secure against malicious data. You should only deserialize     data that you trust.</p> </li> </ol> <p>So, while the <code>Context</code> class provides a powerful tool for handling complex Python objects, it's important to be aware  of these limitations.</p>"},{"location":"reference/concepts/context.html#conclusion","title":"Conclusion","text":"<p>In this document, we've covered the key features of the <code>Context</code> class in the Koheesio framework, including its  ability to handle complex Python objects, support for nested keys and recursive merging, and its serialization and  deserialization capabilities. </p> <p>Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks,  <code>Context</code> provides a robust and efficient solution.</p>"},{"location":"reference/concepts/context.html#further-reading","title":"Further Reading","text":"<p>For more information, you can refer to the following resources:</p> <ul> <li>Python jsonpickle Documentation</li> <li>Python JSON Documentation</li> <li>Python YAML Documentation</li> <li>Python TOML Documentation</li> </ul> <p>Refer to the API documentation for more details on the <code>Context</code> class and its methods.</p>"},{"location":"reference/concepts/logger.html","title":"Python Logger Code Instructions","text":"<p>Here you can find instructions on how to use the Koheesio Logging Factory.</p>"},{"location":"reference/concepts/logger.html#logging-factory","title":"Logging Factory","text":"<p>The <code>LoggingFactory</code> class is a factory for creating and configuring loggers. To use it, follow these steps:</p> <ol> <li> <p>Import the necessary modules:</p> <pre><code>from koheesio.logger import LoggingFactory\n</code></pre> </li> <li> <p>Initialize logging factory for koheesio modules:</p> <pre><code>factory = LoggingFactory(name=\"replace_koheesio_parent_name\", env=\"local\", logger_id=\"your_run_id\")\n# Or use default \nfactory = LoggingFactory()\n# Or just specify log level for koheesio modules\nfactory = LoggingFactory(level=\"DEBUG\")\n</code></pre> </li> <li> <p>Create a logger by calling the <code>create_logger</code> method of the <code>LoggingFactory</code> class, you can inherit from koheesio logger:</p> <p><code>python logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME)    # Or for koheesio modules logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME,inherit_from_koheesio=True)</code></p> </li> <li> <p>You can now use the <code>logger</code> object to log messages:</p> <pre><code>logger.debug(\"Debug message\")\nlogger.info(\"Info message\")\nlogger.warning(\"Warning message\")\nlogger.error(\"Error message\")\nlogger.critical(\"Critical message\")\n</code></pre> </li> <li> <p>(Optional) You can add additional handlers to the logger by calling the <code>add_handlers</code> method of the <code>LoggingFactory</code> class:</p> <pre><code>handlers = [\n    (\"your_handler_module.YourHandlerClass\", {\"level\": \"INFO\"}),\n    # Add more handlers if needed\n]\nfactory.add_handlers(handlers)\n</code></pre> </li> <li> <p>(Optional) You can create child loggers based on the parent logger by calling the <code>get_logger</code> method of the <code>LoggingFactory</code> class:</p> <pre><code>child_logger = factory.get_logger(name=\"your_child_logger_name\")\n</code></pre> </li> <li> <p>(Optional) Get an independent logger without inheritance</p> <p>If you need an independent logger without inheriting from the <code>LoggingFactory</code> logger, you can use the <code>get_logger</code> method:</p> <pre><code>your_logger = factory.get_logger(name=\"your_logger_name\", inherit=False)\n</code></pre> </li> </ol> <p>By setting <code>inherit</code> to <code>False</code>, you will obtain a logger that is not tied to the <code>LoggingFactory</code> logger hierarchy, only format of message will be the same, but you can also change it. This allows you to have an independent logger with its own configuration.    You can use the <code>your_logger</code> object to log messages:</p> <pre><code>```python\nyour_logger.debug(\"Debug message\")\nyour_logger.info(\"Info message\")\nyour_logger.warning(\"Warning message\")\nyour_logger.error(\"Error message\")\nyour_logger.critical(\"Critical message\")\n```\n</code></pre> <ol> <li> <p>(Optional) You can use Masked types to masked secrets/tokens/passwords in output. The Masked types are special types provided by the koheesio library to handle sensitive data     that should not be logged or printed in plain text. They are used to wrap sensitive data and override their string representation to prevent accidental exposure of the data.Here are some examples of how to use Masked types:</p> <pre><code>import logging\nfrom koheesio.logger import MaskedString, MaskedInt, MaskedFloat, MaskedDict\n\n# Set up logging\nlogger = logging.getLogger(__name__)\nlogging.basicConfig(level=logging.INFO)\n\n# Using MaskedString\nmasked_string = MaskedString(\"my secret string\")\nlogger.info(masked_string)  # This will not log the actual string\n\n# Using MaskedInt\nmasked_int = MaskedInt(12345)\nlogger.info(masked_int)  # This will not log the actual integer\n\n# Using MaskedFloat\nmasked_float = MaskedFloat(3.14159)\nlogger.info(masked_float)  # This will not log the actual float\n\n# Using MaskedDict\nmasked_dict = MaskedDict({\"key\": \"value\"})\nlogger.info(masked_dict)  # This will not log the actual dictionary\n</code></pre> </li> </ol> <p>Please make sure to replace \"your_logger_name\", \"your_run_id\", \"your_handler_module.YourHandlerClass\", \"your_child_logger_name\", and other placeholders with your own values according to your application's requirements.</p> <p>By following these steps, you can obtain an independent logger without inheriting from the <code>LoggingFactory</code> logger. This allows you to customize the logger configuration and use it separately in your code.</p> <p>Note: Ensure that you have imported the necessary modules, instantiated the <code>LoggingFactory</code> class, and customized the logger name and other parameters according to your application's requirements.</p>"},{"location":"reference/concepts/logger.html#example","title":"Example","text":"<pre><code>import logging\n\n# Step 2: Instantiate the LoggingFactory class\nfactory = LoggingFactory(env=\"local\")\n\n# Step 3: Create an independent logger with a custom log level\nyour_logger = factory.get_logger(\"your_logger\", inherit_from_koheesio=False)\nyour_logger.setLevel(logging.DEBUG)\n\n# Step 4: Create a logger using the create_logger method from LoggingFactory with a different log level\nfactory_logger = LoggingFactory(level=\"WARNING\").get_logger(name=factory.LOGGER_NAME)\n\n# Step 5: Create a child logger with a debug level\nchild_logger = factory.get_logger(name=\"child\")\nchild_logger.setLevel(logging.DEBUG)\n\nchild2_logger = factory.get_logger(name=\"child2\")\nchild2_logger.setLevel(logging.INFO)\n\n# Step 6: Log messages at different levels for both loggers\nyour_logger.debug(\"Debug message\")  # This message will be displayed\nyour_logger.info(\"Info message\")  # This message will be displayed\nyour_logger.warning(\"Warning message\")  # This message will be displayed\nyour_logger.error(\"Error message\")  # This message will be displayed\nyour_logger.critical(\"Critical message\")  # This message will be displayed\n\nfactory_logger.debug(\"Debug message\")  # This message will not be displayed\nfactory_logger.info(\"Info message\")  # This message will not be displayed\nfactory_logger.warning(\"Warning message\")  # This message will be displayed\nfactory_logger.error(\"Error message\")  # This message will be displayed\nfactory_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild_logger.debug(\"Debug message\")  # This message will be displayed\nchild_logger.info(\"Info message\")  # This message will be displayed\nchild_logger.warning(\"Warning message\")  # This message will be displayed\nchild_logger.error(\"Error message\")  # This message will be displayed\nchild_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild2_logger.debug(\"Debug message\")  # This message will be displayed\nchild2_logger.info(\"Info message\")  # This message will be displayed\nchild2_logger.warning(\"Warning message\")  # This message will be displayed\nchild2_logger.error(\"Error message\")  # This message will be displayed\nchild2_logger.critical(\"Critical message\")  # This message will be displayed\n</code></pre> <p>Output:</p> <pre><code>[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [your_logger] {__init__.py:&lt;module&gt;:118} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [your_logger] {__init__.py:&lt;module&gt;:119} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [your_logger] {__init__.py:&lt;module&gt;:120} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [your_logger] {__init__.py:&lt;module&gt;:121} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [your_logger] {__init__.py:&lt;module&gt;:122} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio] {__init__.py:&lt;module&gt;:126} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio] {__init__.py:&lt;module&gt;:127} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio] {__init__.py:&lt;module&gt;:128} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [koheesio.child] {__init__.py:&lt;module&gt;:130} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child] {__init__.py:&lt;module&gt;:131} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child] {__init__.py:&lt;module&gt;:132} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child] {__init__.py:&lt;module&gt;:133} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child] {__init__.py:&lt;module&gt;:134} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child2] {__init__.py:&lt;module&gt;:137} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child2] {__init__.py:&lt;module&gt;:138} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child2] {__init__.py:&lt;module&gt;:139} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child2] {__init__.py:&lt;module&gt;:140} - Critical message\n</code></pre>"},{"location":"reference/concepts/logger.html#loggeridfilter-class","title":"LoggerIDFilter Class","text":"<p>The <code>LoggerIDFilter</code> class is a filter that injects <code>run_id</code> information into the log. To use it, follow these steps:</p> <ol> <li> <p>Import the necessary modules:</p> <pre><code>import logging\n</code></pre> </li> <li> <p>Create an instance of the <code>LoggerIDFilter</code> class:</p> <pre><code>logger_filter = LoggerIDFilter()\n</code></pre> </li> <li> <p>Set the <code>LOGGER_ID</code> attribute of the <code>LoggerIDFilter</code> class to the desired run ID:</p> <pre><code>LoggerIDFilter.LOGGER_ID = \"your_run_id\"\n</code></pre> </li> <li> <p>Add the <code>logger_filter</code> to your logger or handler:</p> <pre><code>logger = logging.getLogger(\"your_logger_name\")\nlogger.addFilter(logger_filter)\n</code></pre> </li> </ol>"},{"location":"reference/concepts/logger.html#loggingfactory-set-up-optional","title":"LoggingFactory Set Up (Optional)","text":"<ol> <li> <p>Import the <code>LoggingFactory</code> class in your application code.</p> </li> <li> <p>Set the value for the <code>LOGGER_FILTER</code> variable:</p> </li> <li>If you want to assign a specific <code>logging.Filter</code> instance, replace <code>None</code> with your desired filter instance.</li> <li> <p>If you want to keep the default value of <code>None</code>, leave it unchanged.</p> </li> <li> <p>Set the value for the <code>LOGGER_LEVEL</code> variable:</p> </li> <li>If you want to use the value from the <code>\"KOHEESIO_LOGGING_LEVEL\"</code> environment variable, leave the code as is.</li> <li> <p>If you want to use a different environment variable or a specific default value, modify the code accordingly.</p> </li> <li> <p>Set the value for the <code>LOGGER_ENV</code> variable:</p> </li> <li> <p>Replace <code>\"local\"</code> with your desired environment name.</p> </li> <li> <p>Set the value for the <code>LOGGER_FORMAT</code> variable:</p> </li> <li>If you want to customize the log message format, modify the value within the double quotes.</li> <li> <p>The format should follow the desired log message format pattern.</p> </li> <li> <p>Set the value for the <code>LOGGER_FORMATTER</code> variable:</p> </li> <li>If you want to assign a specific <code>Formatter</code> instance, replace <code>Formatter(LOGGER_FORMAT)</code> with your desired formatter instance.</li> <li> <p>If you want to keep the default formatter with the defined log message format, leave it unchanged.</p> </li> <li> <p>Set the value for the <code>CONSOLE_HANDLER</code> variable:</p> <ul> <li>If you want to assign a specific <code>logging.Handler</code> instance, replace <code>None</code> with your desired handler instance.</li> <li>If you want to keep the default value of <code>None</code>, leave it unchanged.</li> </ul> </li> <li> <p>Set the value for the <code>ENV</code> variable:</p> <ul> <li>Replace <code>None</code> with your desired environment value if applicable.</li> <li>If you don't need to set this variable, leave it as <code>None</code>.</li> </ul> </li> <li> <p>Save the changes to the file.</p> </li> </ol>"},{"location":"reference/concepts/step.html","title":"Steps in Koheesio","text":"<p>In the Koheesio framework, the <code>Step</code> class and its derivatives play a crucial role. They serve as the building blocks  for creating data pipelines, allowing you to define custom units of logic that can be executed. This document will  guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.</p> <p>Several type of Steps are available in Koheesio, including <code>Reader</code>, <code>Transformation</code>, <code>Writer</code>, and <code>Task</code>.</p>"},{"location":"reference/concepts/step.html#what-is-a-step","title":"What is a Step?","text":"<p>A <code>Step</code> is an atomic operation serving as the building block of data pipelines built with the Koheesio framework.  Tasks typically consist of a series of Steps. </p> <p>A step can be seen as an operation on a set of inputs, that returns a set of outputs. This does not imply that steps  are stateless (e.g. data writes)! This concept is visualized in the figure below.</p> <pre><code>\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &amp;nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\nStep\n&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\n\"]\n\nI1[\"Input 1\"] ---&gt; Step\nI2[\"Input 2\"] ---&gt; Step\nI3[\"Input 3\"] ---&gt; Step\n\nStep ---&gt; O1[\"Output 1\"]\nStep ---&gt; O2[\"Output 2\"]\nStep ---&gt; O3[\"Output 3\"]\n</code></pre>"},{"location":"reference/concepts/step.html#how-to-read-a-step","title":"How to Read a Step?","text":"<p>A <code>Step</code> in Koheesio is a class that represents a unit of work in a data pipeline. It's similar to a Python built-in  data class, but with additional features for execution, validation, and logging.</p> <p>When you look at a <code>Step</code>, you'll typically see the following components:</p> <ol> <li> <p>Class Definition: The <code>Step</code> is defined as a class that inherits from the base <code>Step</code> class in Koheesio.      For example, <code>class MyStep(Step):</code>.</p> </li> <li> <p>Input Fields: These are defined as class attributes with type annotations, similar to attributes in a Python      data class. These fields represent the inputs to the <code>Step</code>. For example, <code>a: str</code> defines an input field <code>a</code> of      type <code>str</code>. Additionally, you will often see these fields defined using Pydantic's <code>Field</code> class, which allows     for more detailed validation and documentation as well as default values and aliasing.</p> </li> <li> <p>Output Fields: These are defined in a nested class called <code>Output</code> that inherits from <code>StepOutput</code>. This class      represents the output of the <code>Step</code>. For example, <code>class Output(StepOutput): b: str</code> defines an output field <code>b</code> of      type <code>str</code>.</p> </li> <li> <p>Execute Method: This is a method that you need to implement when you create a new <code>Step</code>. It contains the logic      of the <code>Step</code> and is where you use the input fields and populate the output fields. For example,      <code>def execute(self): self.output.b = f\"{self.a}-some-suffix\"</code>.</p> </li> </ol> <p>Here's an example of a <code>Step</code>:</p> <pre><code>class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -&gt; MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n</code></pre> <p>In this <code>Step</code>, <code>a</code> is an input field of type <code>str</code>, <code>b</code> is an output field of type <code>str</code>, and the <code>execute</code> method appends <code>-some-suffix</code> to the input <code>a</code> and assigns it to the output <code>b</code>.</p> <p>When you see a <code>Step</code>, you can think of it as a function where the class attributes are the inputs, the <code>Output</code> class  defines the outputs, and the <code>execute</code> method is the function body. The main difference is that a <code>Step</code> also includes  automatic validation of inputs and outputs (thanks to Pydantic), logging, and error handling.</p>"},{"location":"reference/concepts/step.html#understanding-inheritance-in-steps","title":"Understanding Inheritance in Steps","text":"<p>Inheritance is a core concept in object-oriented programming where a class (child or subclass) inherits properties and  methods from another class (parent or superclass). In the context of Koheesio, when you create a new <code>Step</code>, you're  creating a subclass that inherits from the base <code>Step</code> class.</p> <p>When a new Step is defined (like <code>class MyStep(Step):</code>), it inherits all the properties and methods from the <code>Step</code>  class. This includes the <code>execute</code> method, which is then overridden to provide the specific functionality for that Step.</p> <p>Here's a simple breakdown:</p> <ol> <li> <p>Parent Class (Superclass): This is the <code>Step</code> class in Koheesio. It provides the basic structure and      functionalities of a Step, including input and output validation, logging, and error handling.</p> </li> <li> <p>Child Class (Subclass): This is the new Step you define, like <code>MyStep</code>. It inherits all the properties and      methods from the <code>Step</code> class and can add or override them as needed.</p> </li> <li> <p>Inheritance: This is the process where <code>MyStep</code> inherits the properties and methods from the <code>Step</code> class. In      Python, this is done by mentioning the parent class in parentheses when defining the child class, like      <code>class MyStep(Step):</code>.</p> </li> <li> <p>Overriding: This is when you provide a new implementation of a method in the child class that is already defined      in the parent class. In the case of Steps, you override the <code>execute</code> method to define the specific logic of your      Step.</p> </li> </ol> <p>Understanding inheritance is key to understanding how Steps work in Koheesio. It allows you to leverage the  functionalities provided by the <code>Step</code> class and focus on implementing the specific logic of your Step.</p>"},{"location":"reference/concepts/step.html#benefits-of-using-steps-in-data-pipelines","title":"Benefits of Using Steps in Data Pipelines","text":"<p>The concept of a <code>Step</code> is beneficial when creating Data Pipelines or Data Products for several reasons:</p> <ol> <li> <p>Modularity: Each <code>Step</code> represents a self-contained unit of work, which makes the pipeline modular. This makes      it easier to understand, test, and maintain the pipeline. If a problem arises, you can pinpoint which step is      causing the issue.</p> </li> <li> <p>Reusability: Steps can be reused across different pipelines. Once a <code>Step</code> is defined, it can be used in any      number of pipelines. This promotes code reuse and consistency across projects.</p> </li> <li> <p>Readability: Steps make the pipeline code more readable. Each <code>Step</code> has a clear input, output, and execution      logic, which makes it easier to understand what each part of the pipeline is doing.</p> </li> <li> <p>Validation: Steps automatically validate their inputs and outputs. This ensures that the data flowing into and      out of each step is of the expected type and format, which can help catch errors early.</p> </li> <li> <p>Logging: Steps automatically log the start and end of their execution, along with the input and output data.      This can be very useful for debugging and understanding the flow of data through the pipeline.</p> </li> <li> <p>Error Handling: Steps provide built-in error handling. If an error occurs during the execution of a step, it is     caught, logged, and then re-raised. This provides a clear indication of where the error occurred.</p> </li> <li> <p>Scalability: Steps can be easily parallelized or distributed, which is crucial for processing large datasets.      This is especially true for steps that are designed to work with distributed computing frameworks like Apache Spark.</p> </li> </ol> <p>By using the concept of a <code>Step</code>, you can create data pipelines that are modular, reusable, readable, and robust, while also being easier to debug and scale.</p>"},{"location":"reference/concepts/step.html#compared-to-a-regular-pydantic-basemodel","title":"Compared to a regular Pydantic Basemodel","text":"<p>A <code>Step</code> in Koheesio, while built on top of Pydantic's <code>BaseModel</code>, provides additional features specifically designed  for creating data pipelines. Here are some key differences:</p> <ol> <li> <p>Execution Method: A <code>Step</code> includes an <code>execute</code> method that needs to be implemented. This method contains the      logic of the step and is automatically decorated with functionalities such as logging and output validation.</p> </li> <li> <p>Input and Output Validation: A <code>Step</code> uses Pydantic models to define and validate its inputs and outputs. This      ensures that the data flowing into and out of the step is of the expected type and format.</p> </li> <li> <p>Automatic Logging: A <code>Step</code> automatically logs the start and end of its execution, along with the input and      output data. This is done through the <code>do_execute</code> decorator applied to the <code>execute</code> method.</p> </li> <li> <p>Error Handling: A <code>Step</code> provides built-in error handling. If an error occurs during the execution of the step,      it is caught, logged, and then re-raised. This should help in debugging and understanding the flow of data.</p> </li> <li> <p>Serialization: A <code>Step</code> can be serialized to a YAML string using the <code>to_yaml</code> method. This can be useful for      saving and loading steps.</p> </li> <li> <p>Lazy Mode Support: The <code>StepOutput</code> class in a <code>Step</code> supports lazy mode, which allows validation of the items      stored in the class to be called at will instead of being forced to run it upfront.</p> </li> </ol> <p>In contrast, a regular Pydantic <code>BaseModel</code> is a simple data validation model that doesn't include these additional  features. It's used for data parsing and validation, but doesn't include methods for execution, automatic logging,  error handling, or serialization to YAML.</p>"},{"location":"reference/concepts/step.html#key-features-of-a-step","title":"Key Features of a Step","text":""},{"location":"reference/concepts/step.html#defining-a-step","title":"Defining a Step","text":"<p>To define a new step, you subclass the <code>Step</code> class and implement the <code>execute</code> method. The inputs of the step can be  accessed using <code>self.input_name</code>. The output of the step can be accessed using <code>self.output.output_name</code>. For example:</p> <pre><code>class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n</code></pre>"},{"location":"reference/concepts/step.html#running-a-step","title":"Running a Step","text":"<p>To run a step, you can call the <code>execute</code> method. You can also use the <code>run</code> method, which is an alias to <code>execute</code>.  For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\n</code></pre>"},{"location":"reference/concepts/step.html#accessing-step-output","title":"Accessing Step Output","text":"<p>The output of a step can be accessed using <code>self.output.output_name</code>. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\n</code></pre>"},{"location":"reference/concepts/step.html#serializing-a-step","title":"Serializing a Step","text":"<p>You can serialize a step to a YAML string using the <code>to_yaml</code> method. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\nyaml_str = step.to_yaml()\n</code></pre>"},{"location":"reference/concepts/step.html#getting-step-description","title":"Getting Step Description","text":"<p>You can get the description of a step using the <code>get_description</code> method. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\ndescription = step.get_description()\n</code></pre>"},{"location":"reference/concepts/step.html#defining-a-step-with-multiple-inputs-and-outputs","title":"Defining a Step with Multiple Inputs and Outputs","text":"<p>Here's an example of how to define a new step with multiple inputs and outputs:</p> <pre><code>class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n    input3: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n        output2: int = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n        self.output.output2 = self.input2 + self.input3\n</code></pre>"},{"location":"reference/concepts/step.html#running-a-step-with-multiple-inputs","title":"Running a Step with Multiple Inputs","text":"<p>To run a step with multiple inputs, you can do the following:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\n</code></pre>"},{"location":"reference/concepts/step.html#accessing-multiple-step-outputs","title":"Accessing Multiple Step Outputs","text":"<p>The outputs of a step can be accessed using <code>self.output.output_name</code>. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\nprint(step.output.output2)  # Outputs: 5\n</code></pre>"},{"location":"reference/concepts/step.html#special-features","title":"Special Features","text":""},{"location":"reference/concepts/step.html#the-execute-method","title":"The Execute method","text":"<p>The <code>execute</code> method in the <code>Step</code> class is automatically decorated with the <code>StepMetaClass._execute_wrapper</code> function  due to the metaclass <code>StepMetaClass</code>. This provides several advantages:</p> <ol> <li> <p>Automatic Output Validation: The decorator ensures that the output of the <code>execute</code> method is always a      <code>StepOutput</code> instance. This means that the output is automatically validated against the defined output model,      ensuring data integrity and consistency.</p> </li> <li> <p>Logging: The decorator provides automatic logging at the start and end of the <code>execute</code> method. This includes      logging the input and output of the step, which can be useful for debugging and understanding the flow of data.</p> </li> <li> <p>Error Handling: If an error occurs during the execution of the <code>Step</code>, the decorator catches the exception and      logs an error message before re-raising the exception. This provides a clear indication of where the error occurred.</p> </li> <li> <p>Simplifies Step Implementation: Since the decorator handles output validation, logging, and error handling, the      user can focus on implementing the logic of the <code>execute</code> method without worrying about these aspects.</p> </li> <li> <p>Consistency: By automatically decorating the <code>execute</code> method, the library ensures that these features are      consistently applied across all steps, regardless of who implements them or how they are used. This makes the      behavior of steps predictable and consistent.</p> </li> <li> <p>Prevents Double Wrapping: The decorator checks if the function is already wrapped with <code>StepMetaClass._execute_wrapper</code>     and prevents double wrapping. This ensures that the decorator doesn't interfere with itself if <code>execute</code> is overridden in      subclasses.</p> </li> </ol> <p>Notice that you never have to explicitly return anything from the <code>execute</code> method. The <code>StepMetaClass._execute_wrapper</code>  decorator takes care of that for you.</p> <p>Implementation examples for custom metaclass which can be used to override the default behavior of the <code>StepMetaClass._execute_wrapper</code>:</p> <pre><code>    class MyMetaClass(StepMetaClass):\n        @classmethod\n        def _log_end_message(cls, step: Step, skip_logging: bool = False, *args, **kwargs):\n            print(\"It's me from custom meta class\")\n            super()._log_end_message(step, skip_logging, *args, **kwargs)\n\n    class MyMetaClass2(StepMetaClass):\n        @classmethod\n        def _validate_output(cls, step: Step, skip_validating: bool = False, *args, **kwargs):\n            # i want always have a dummy value in the output\n            step.output.dummy_value = \"dummy\"\n\n    class YourClassWithCustomMeta(Step, metaclass=MyMetaClass):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n\n    class YourClassWithCustomMeta2(Step, metaclass=MyMetaClass2):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n</code></pre>"},{"location":"reference/concepts/step.html#sparkstep","title":"SparkStep","text":"<p>The <code>SparkStep</code> class is a subclass of <code>Step</code> that is designed for steps that interact with Spark. It extends the  <code>Step</code> class with SparkSession support. Spark steps are expected to return a Spark DataFrame as output. The <code>spark</code>  property is available to access the active SparkSession instance. <code>Output</code> in a <code>SparkStep</code> is expected to be a <code>DataFrame</code> although optional.</p>"},{"location":"reference/concepts/step.html#using-a-sparkstep","title":"Using a SparkStep","text":"<p>Here's an example of how to use a <code>SparkStep</code>:</p> <pre><code>class MySparkStep(SparkStep):\n    input1: str = Field(...)\n\n    class Output(StepOutput):\n        output1: DataFrame = Field(...)\n\n    def execute(self):\n        # Your logic here\n        df = self.spark.read.text(self.input1)\n        self.output.output1 = df\n</code></pre> <p>To run a <code>SparkStep</code>, you can do the following:</p> <pre><code>step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\n</code></pre> <p>To access the output of a <code>SparkStep</code>, you can do the following:</p> <pre><code>step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\ndf = step.output.output1\ndf.show()\n</code></pre>"},{"location":"reference/concepts/step.html#conclusion","title":"Conclusion","text":"<p>In this document, we've covered the key features of the <code>Step</code> class in the Koheesio framework, including its ability  to define custom units of logic, manage inputs and outputs, and support for serialization. The automatic decoration of  the <code>execute</code> method provides several advantages that simplify step implementation and ensure consistency across all  steps.</p> <p>Whether you're defining a new operation in your data pipeline or managing the flow of data between steps, <code>Step</code>  provides a robust and efficient solution.</p>"},{"location":"reference/concepts/step.html#further-reading","title":"Further Reading","text":"<p>For more information, you can refer to the following resources:</p> <ul> <li>Python Pydantic Documentation</li> <li>Python YAML Documentation</li> </ul> <p>Refer to the API documentation for more details on the <code>Step</code> class and its methods.</p>"},{"location":"reference/spark/readers.html","title":"Reader Module","text":"<p>The <code>Reader</code> module in Koheesio provides a set of classes for reading data from various sources. A <code>Reader</code> is a type  of <code>SparkStep</code> that reads data from a source based on the input parameters and stores the result in <code>self.output.df</code>  for subsequent steps.</p>"},{"location":"reference/spark/readers.html#what-is-a-reader","title":"What is a Reader?","text":"<p>A <code>Reader</code> is a subclass of <code>SparkStep</code> that reads data from a source and stores the result. The source could be a  file, a database, a web API, or any other data source. The result is stored in a DataFrame, which is accessible through  the <code>df</code> property of the <code>Reader</code>.</p>"},{"location":"reference/spark/readers.html#api-reference","title":"API Reference","text":"<p>See API Reference for a detailed description of the <code>Reader</code> class and its methods.</p>"},{"location":"reference/spark/readers.html#key-features-of-a-reader","title":"Key Features of a Reader","text":"<ol> <li>Read Method: The <code>Reader</code> class provides a <code>read</code> method that calls the <code>execute</code> method and returns the result.    Essentially, calling <code>.read()</code> is a shorthand for calling <code>.execute().output.df</code>. This allows you to read data from    a <code>Reader</code> without having to call the <code>execute</code> method directly. This is a convenience method that simplifies the    usage of a <code>Reader</code>.</li> </ol> <p>Here's an example of how to use the <code>.read()</code> method:</p> <pre><code># Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the .read() method to get the data as a DataFrame\ndf = my_reader.read()\n\n# Now df is a DataFrame with the data read by MyReader\n</code></pre> <p>In this example, <code>MyReader</code> is a subclass of <code>Reader</code> that you've defined. After creating an instance of <code>MyReader</code>,    you call the <code>.read()</code> method to read the data and get it back as a DataFrame. The DataFrame <code>df</code> now contains the    data read by <code>MyReader</code>.</p> <ol> <li>DataFrame Property: The <code>Reader</code> class provides a <code>df</code> property as a shorthand for accessing <code>self.output.df</code>.    If <code>self.output.df</code> is <code>None</code>, the <code>execute</code> method is run first. This property ensures that the data is loaded and    ready to be used, even if the <code>execute</code> method hasn't been explicitly called.</li> </ol> <p>Here's an example of how to use the <code>df</code> property:</p> <pre><code># Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the df property to get the data as a DataFrame\ndf = my_reader.df\n\n# Now df is a DataFrame with the data read by MyReader\n</code></pre> <p>In this example, <code>MyReader</code> is a subclass of <code>Reader</code> that you've defined. After creating an instance of <code>MyReader</code>,    you access the <code>df</code> property to get the data as a DataFrame. The DataFrame <code>df</code> now contains the data read by    <code>MyReader</code>.</p> <ol> <li>SparkSession: Every <code>Reader</code> has a <code>SparkSession</code> available as <code>self.spark</code>. This is the currently active    <code>SparkSession</code>, which can be used to perform distributed data processing tasks.</li> </ol> <p>Here's an example of how to use the <code>spark</code> property:</p> <pre><code># Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the spark property to get the SparkSession\nspark = my_reader.spark\n\n# Now spark is the SparkSession associated with MyReader\n</code></pre> <p>In this example, <code>MyReader</code> is a subclass of <code>Reader</code> that you've defined. After creating an instance of <code>MyReader</code>,    you access the <code>spark</code> property to get the <code>SparkSession</code>. The <code>SparkSession</code> <code>spark</code> can now be used to perform    distributed data processing tasks.</p>"},{"location":"reference/spark/readers.html#how-to-define-a-reader","title":"How to Define a Reader?","text":"<p>To define a <code>Reader</code>, you create a subclass of the <code>Reader</code> class and implement the <code>execute</code> method. The <code>execute</code>  method should read from the source and store the result in <code>self.output.df</code>. This is an abstract method, which means it must be implemented in any subclass of <code>Reader</code>.</p> <p>Here's an example of a <code>Reader</code>:</p> <pre><code>class MyReader(Reader):\n  def execute(self):\n    # read data from source\n    data = read_from_source()\n    # store result in self.output.df\n    self.output.df = data\n</code></pre>"},{"location":"reference/spark/readers.html#understanding-inheritance-in-readers","title":"Understanding Inheritance in Readers","text":"<p>Just like a <code>Step</code>, a <code>Reader</code> is defined as a subclass that inherits from the base <code>Reader</code> class. This means it  inherits all the properties and methods from the <code>Reader</code> class and can add or override them as needed. The main method that needs to be overridden is the <code>execute</code> method, which should implement the logic for reading data from the source and storing it in <code>self.output.df</code>.</p>"},{"location":"reference/spark/readers.html#benefits-of-using-readers-in-data-pipelines","title":"Benefits of Using Readers in Data Pipelines","text":"<p>Using <code>Reader</code> classes in your data pipelines has several benefits:</p> <ol> <li> <p>Simplicity: Readers abstract away the details of reading data from various sources, allowing you to focus on the   logic of your pipeline.</p> </li> <li> <p>Consistency: By using Readers, you ensure that data is read in a consistent manner across different parts of your   pipeline.</p> </li> <li> <p>Flexibility: Readers can be easily swapped out for different data sources without changing the rest of your   pipeline.</p> </li> <li> <p>Efficiency: Readers automatically manage resources like connections and file handles, ensuring efficient use of   resources.</p> </li> </ol> <p>By using the concept of a <code>Reader</code>, you can create data pipelines that are simple, consistent, flexible, and efficient.</p>"},{"location":"reference/spark/readers.html#examples-of-reader-classes-in-koheesio","title":"Examples of Reader Classes in Koheesio","text":"<p>Koheesio provides a variety of <code>Reader</code> subclasses for reading data from different sources. Here are just a few examples:</p> <ol> <li> <p>Teradata Reader: A <code>Reader</code> subclass for reading data from Teradata databases.    It's defined in the <code>koheesio/steps/readers/teradata.py</code> file.</p> </li> <li> <p>Snowflake Reader: A <code>Reader</code> subclass for reading data from Snowflake databases.    It's defined in the <code>koheesio/steps/readers/snowflake.py</code> file.</p> </li> <li> <p>Box Reader: A <code>Reader</code> subclass for reading data from Box.   It's defined in the <code>koheesio/steps/integrations/box.py</code> file.</p> </li> </ol> <p>These are just a few examples of the many <code>Reader</code> subclasses available in Koheesio. Each <code>Reader</code> subclass is designed to read data from a specific source. They all inherit from the base <code>Reader</code> class and implement the <code>execute</code> method to read data from their respective sources and store it in <code>self.output.df</code>.</p> <p>Please note that this is not an exhaustive list. Koheesio provides many more <code>Reader</code> subclasses for a wide range of data sources. For a complete list, please refer to the Koheesio documentation or the source code.</p> <p>More readers can be found in the <code>koheesio/steps/readers</code> module.</p>"},{"location":"reference/spark/transformations.html","title":"Transformation Module","text":"<p>The <code>Transformation</code> module in Koheesio provides a set of classes for transforming data within a DataFrame. A <code>Transformation</code> is a type of <code>SparkStep</code> that takes a DataFrame as input, applies a transformation, and returns a DataFrame as output. The transformation logic is implemented in the <code>execute</code> method of each <code>Transformation</code> subclass.</p>"},{"location":"reference/spark/transformations.html#what-is-a-transformation","title":"What is a Transformation?","text":"<p>A <code>Transformation</code> is a subclass of <code>SparkStep</code> that applies a transformation to a DataFrame and stores the result. The transformation could be any operation that modifies the data or structure of the DataFrame, such as adding a new column, filtering rows, or aggregating data.</p> <p>Using <code>Transformation</code> classes ensures that data is transformed in a consistent manner across different parts of your  pipeline. This can help avoid errors and make your code easier to understand and maintain.</p>"},{"location":"reference/spark/transformations.html#api-reference","title":"API Reference","text":"<p>See API Reference for a detailed description of the <code>Transformation</code> classes and their methods.</p>"},{"location":"reference/spark/transformations.html#types-of-transformations","title":"Types of Transformations","text":"<p>There are three main types of transformations in Koheesio:</p> <ol> <li> <p><code>Transformation</code>: This is the base class for all transformations. It takes a DataFrame as input and returns a     DataFrame as output. The transformation logic is implemented in the <code>execute</code> method.</p> </li> <li> <p><code>ColumnsTransformation</code>: This is an extended <code>Transformation</code> class with a preset validator for handling column(s)     data. It standardizes the input for a single column or multiple columns. If more than one column is passed, the      transformation will be run in a loop against all the given columns.</p> </li> <li> <p><code>ColumnsTransformationWithTarget</code>: This is an extended <code>ColumnsTransformation</code> class with an additional     <code>target_column</code> field. This field can be used to store the result of the transformation in a new column. If the      <code>target_column</code> is not provided, the result will be stored in the source column.</p> </li> </ol> <p>Each type of transformation has its own use cases and advantages. The right one to use depends on the specific requirements of your data pipeline.</p>"},{"location":"reference/spark/transformations.html#how-to-define-a-transformation","title":"How to Define a Transformation","text":"<p>To define a <code>Transformation</code>, you create a subclass of the <code>Transformation</code> class and implement the <code>execute</code> method. The <code>execute</code> method should take a DataFrame from <code>self.input.df</code>, apply a transformation, and store the result in <code>self.output.df</code>.</p> <p><code>Transformation</code> classes abstract away some of the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.</p> <p>Here's an example of a <code>Transformation</code>:</p> <pre><code>class MyTransformation(Transformation):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # apply transformation\n        transformed_data = apply_transformation(data)\n        # store result in self.output.df\n        self.output.df = transformed_data\n</code></pre> <p>In this example, <code>MyTransformation</code> is a subclass of <code>Transformation</code> that you've defined. The <code>execute</code> method gets the data from <code>self.input.df</code>, applies a transformation called <code>apply_transformation</code> (undefined in this example), and stores the result in <code>self.output.df</code>.</p>"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformation","title":"How to Define a ColumnsTransformation","text":"<p>To define a <code>ColumnsTransformation</code>, you create a subclass of the <code>ColumnsTransformation</code> class and implement the  <code>execute</code> method. The <code>execute</code> method should apply a transformation to the specified columns of the DataFrame.</p> <p><code>ColumnsTransformation</code> classes can be easily swapped out for different data transformations without changing the rest  of your pipeline. This can make your pipeline more flexible and easier to modify or extend.</p> <p>Here's an example of a <code>ColumnsTransformation</code>:</p> <pre><code>from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n</code></pre> <p>In this example, <code>AddOne</code> is a subclass of <code>ColumnsTransformation</code> that you've defined. The <code>execute</code> method adds 1 to each column in <code>self.get_columns()</code>.</p> <p>The <code>ColumnsTransformation</code> class has a <code>ColumnConfig</code> class that can be used to configure the behavior of the class. This class has the following fields:</p> <ul> <li><code>run_for_all_data_type</code>: Allows to run the transformation for all columns of a given type.</li> <li><code>limit_data_type</code>: Allows to limit the transformation to a specific data type.</li> <li><code>data_type_strict_mode</code>: Toggles strict mode for data type validation. Will only work if <code>limit_data_type</code> is set.</li> </ul> <p>Note that data types need to be specified as a <code>SparkDatatype</code> enum. Users should not have to interact with the <code>ColumnConfig</code> class directly.</p>"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformationwithtarget","title":"How to Define a ColumnsTransformationWithTarget","text":"<p>To define a <code>ColumnsTransformationWithTarget</code>, you create a subclass of the <code>ColumnsTransformationWithTarget</code> class and implement the <code>func</code> method. The <code>func</code> method should return the transformation that will be applied to the column(s). The <code>execute</code> method, which is already preset, will use the <code>get_columns_with_target</code> method to loop over all the columns and apply this function to transform the DataFrame.</p> <p>Here's an example of a <code>ColumnsTransformationWithTarget</code>:</p> <pre><code>from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n</code></pre> <p>In this example, <code>AddOneWithTarget</code> is a subclass of <code>ColumnsTransformationWithTarget</code> that you've defined. The <code>func</code> method adds 1 to the values of a given column.</p> <p>The <code>ColumnsTransformationWithTarget</code> class has an additional <code>target_column</code> field. This field can be used to store the result of the transformation in a new column. If the <code>target_column</code> is not provided, the result will be stored in the source column. If more than one column is passed, the <code>target_column</code> will be used as a suffix. Leaving this blank will result in the original columns being renamed.</p> <p>The <code>ColumnsTransformationWithTarget</code> class also has a <code>get_columns_with_target</code> method. This method returns an iterator of the columns and handles the <code>target_column</code> as well.</p>"},{"location":"reference/spark/transformations.html#key-features-of-a-transformation","title":"Key Features of a Transformation","text":"<ol> <li> <p>Execute Method: The <code>Transformation</code> class provides an <code>execute</code> method to implement in your subclass.     This method should take a DataFrame from <code>self.input.df</code>, apply a transformation, and store the result in      <code>self.output.df</code>.</p> <p>For <code>ColumnsTransformation</code> and <code>ColumnsTransformationWithTarget</code>, the <code>execute</code> method is already implemented in the base class. Instead of overriding <code>execute</code>, you implement a <code>func</code> method in your subclass. This <code>func</code> method should return the transformation to be applied to each column. The <code>execute</code> method will then apply this func to each column in a loop.</p> </li> <li> <p>DataFrame Property: The <code>Transformation</code> class provides a <code>df</code> property as a shorthand for accessing     <code>self.input.df</code>. This property ensures that the data is ready to be transformed, even if the <code>execute</code> method     hasn't been explicitly called. This is useful for 'early validation' of the input data.</p> </li> <li> <p>SparkSession: Every <code>Transformation</code> has a <code>SparkSession</code> available as <code>self.spark</code>. This is the currently active     <code>SparkSession</code>, which can be used to perform distributed data processing tasks.</p> </li> <li> <p>Columns Property: The <code>ColumnsTransformation</code> and <code>ColumnsTransformationWithTarget</code> classes provide a <code>columns</code>     property. This property standardizes the input for a single column or multiple columns. If more than one column is     passed, the transformation will be run in a loop against all the given columns.</p> </li> <li> <p>Target Column Property: The <code>ColumnsTransformationWithTarget</code> class provides a <code>target_column</code> property. This     field can be used to store the result of the transformation in a new column. If the <code>target_column</code> is not provided,     the result will be stored in the source column.</p> </li> </ol>"},{"location":"reference/spark/transformations.html#examples-of-transformation-classes-in-koheesio","title":"Examples of Transformation Classes in Koheesio","text":"<p>Koheesio provides a variety of <code>Transformation</code> subclasses for transforming data in different ways. Here are some examples:</p> <ul> <li> <p><code>DataframeLookup</code>: This transformation joins two dataframes together based on a list of join mappings. It allows you   to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.</p> <p>Here's an example of how to use the <code>DataframeLookup</code> transformation:</p> <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\nspark = SparkSession.builder.getOrCreate()\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.execute().df\n</code></pre> </li> <li> <p><code>HashUUID5</code>: This transformation is a subclass of <code>Transformation</code> and provides an interface to generate a UUID5    hash for each row in the DataFrame. The hash is generated based on the values of the specified source columns.</p> <p>Here's an example of how to use the <code>HashUUID5</code> transformation:</p> <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\n\nspark = SparkSession.builder.getOrCreate()\ndf = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\n\nhash_transform = HashUUID5(\n    df=df,\n    source_columns=[\"id\", \"value\"],\n    target_column=\"hash\"\n)\n\noutput_df = hash_transform.execute().df\n</code></pre> </li> </ul> <p>In this example, <code>HashUUID5</code> is a subclass of <code>Transformation</code>. After creating an instance of <code>HashUUID5</code>, you call the <code>execute</code> method to apply the transformation. The <code>execute</code> method generates a UUID5 hash for each row in the DataFrame based on the values of the <code>id</code> and <code>value</code> columns and stores the result in a new column named <code>hash</code>.</p>"},{"location":"reference/spark/transformations.html#benefits-of-using-koheesio-transformations","title":"Benefits of using Koheesio Transformations","text":"<p>Using a Koheesio <code>Transformation</code> over plain Spark provides several benefits:</p> <ol> <li> <p>Consistency: By using <code>Transformation</code> classes, you ensure that data is transformed in a consistent manner     across different parts of your pipeline. This can help avoid errors and make your code easier to understand and     maintain.</p> </li> <li> <p>Abstraction: <code>Transformation</code> classes abstract away the details of transforming data, allowing you to focus on     the logic of your pipeline. This can make your code cleaner and easier to read.</p> </li> <li> <p>Flexibility: <code>Transformation</code> classes can be easily swapped out for different data transformations without     changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.</p> </li> <li> <p>Early Input Validation: As a <code>Transformation</code> is a type of <code>SparkStep</code>, which in turn is a <code>Step</code> and a type of     Pydantic <code>BaseModel</code>, all inputs are validated when an instance of a <code>Transformation</code> class is created. This early     validation helps catch errors related to invalid input, such as an invalid column name, before the PySpark pipeline     starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.</p> </li> <li> <p>Ease of Testing: <code>Transformation</code> classes are designed to be easily testable. This can make it easier to write     unit tests for your data pipeline, helping to ensure its correctness and reliability.</p> </li> <li> <p>Robustness: Koheesio has been extensively tested with hundreds of unit tests, ensuring that the <code>Transformation</code>     classes work as expected under a wide range of conditions. This makes your data pipelines more robust and less     likely to fail due to unexpected inputs or edge cases.</p> </li> </ol> <p>By using the concept of a <code>Transformation</code>, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.</p>"},{"location":"reference/spark/transformations.html#advanced-usage-of-transformations","title":"Advanced Usage of Transformations","text":"<p>Transformations can be combined and chained together to create complex data processing pipelines. Here's an example of how to chain transformations:</p> <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\n# Create a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Define two DataFrames\ndf1 = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\ndf2 = spark.createDataFrame([(1, \"C\"), (3, \"D\")], [\"id\", \"value\"])\n\n# Define the first transformation\nlookup = DataframeLookup(\n    other=df2,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\n# Apply the first transformation\noutput_df = lookup.transform(df1)\n\n# Define the second transformation\nhash_transform = HashUUID5(\n    source_columns=[\"id\", \"value\", \"right_value\"],\n    target_column=\"hash\"\n)\n\n# Apply the second transformation\noutput_df2 = hash_transform.transform(output_df)\n</code></pre> <p>In this example, <code>DataframeLookup</code> is a subclass of <code>ColumnsTransformation</code> and <code>HashUUID5</code> is a subclass of  <code>Transformation</code>. After creating instances of <code>DataframeLookup</code> and <code>HashUUID5</code>, you call the <code>transform</code> method to  apply each transformation. The <code>transform</code> method of <code>DataframeLookup</code> performs a left join with <code>df2</code> on the <code>id</code> column and adds the <code>value</code> column from <code>df2</code> to the result DataFrame as <code>right_value</code>. The <code>transform</code> method of <code>HashUUID5</code> generates a UUID5 hash for each row in the DataFrame based on the values of the <code>id</code>, <code>value</code>, and <code>right_value</code> columns and stores the result in a new column named <code>hash</code>.</p>"},{"location":"reference/spark/transformations.html#troubleshooting-transformations","title":"Troubleshooting Transformations","text":"<p>If you encounter an error when using a transformation, here are some steps you can take to troubleshoot:</p> <ol> <li> <p>Check the Input Data: Make sure the input DataFrame to the transformation is correct. You can use the <code>show</code>     method of the DataFrame to print the first few rows of the DataFrame.</p> </li> <li> <p>Check the Transformation Parameters: Make sure the parameters passed to the transformation are correct. For     example, if you're using a <code>DataframeLookup</code>, make sure the join mappings and target columns are correctly     specified.</p> </li> <li> <p>Check the Transformation Logic: If the input data and parameters are correct, there might be an issue with the     transformation logic. You can use PySpark's logging utilities to log intermediate results and debug the     transformation logic.</p> </li> <li> <p>Check the Output Data: If the transformation executes without errors but the output data is not as expected, you     can use the <code>show</code> method of the DataFrame to print the first few rows of the output DataFrame. This can help you     identify any issues with the transformation logic.</p> </li> </ol>"},{"location":"reference/spark/transformations.html#conclusion","title":"Conclusion","text":"<p>The <code>Transformation</code> module in Koheesio provides a powerful and flexible way to transform data in a DataFrame. By using <code>Transformation</code> classes, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable. Whether you're performing simple transformations like adding a new column, or complex transformations like joining multiple DataFrames, the <code>Transformation</code> module has you covered.</p>"},{"location":"reference/spark/writers.html","title":"Writer Module","text":"<p>The <code>Writer</code> module in Koheesio provides a set of classes for writing data to various destinations. A <code>Writer</code> is a type of <code>SparkStep</code> that takes data from <code>self.input.df</code> and writes it to a destination based on the output parameters.</p>"},{"location":"reference/spark/writers.html#what-is-a-writer","title":"What is a Writer?","text":"<p>A <code>Writer</code> is a subclass of <code>SparkStep</code> that writes data to a destination. The data to be written is taken from a DataFrame, which is accessible through the <code>df</code> property of the <code>Writer</code>.</p>"},{"location":"reference/spark/writers.html#how-to-define-a-writer","title":"How to Define a Writer?","text":"<p>To define a <code>Writer</code>, you create a subclass of the <code>Writer</code> class and implement the <code>execute</code> method. The <code>execute</code> method should take data from <code>self.input.df</code> and write it to the destination.</p> <p>Here's an example of a <code>Writer</code>:</p> <pre><code>class MyWriter(Writer):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # write data to destination\n        write_to_destination(data)\n</code></pre>"},{"location":"reference/spark/writers.html#key-features-of-a-writer","title":"Key Features of a Writer","text":"<ol> <li> <p>Write Method: The <code>Writer</code> class provides a <code>write</code> method that calls the <code>execute</code> method and writes the data     to the destination. Essentially, calling <code>.write()</code> is a shorthand for calling <code>.execute().output.df</code>. This allows     you to write data to a <code>Writer</code> without having to call the <code>execute</code> method directly. This is a convenience method     that simplifies the usage of a <code>Writer</code>.</p> <p>Here's an example of how to use the <code>.write()</code> method:</p> <pre><code># Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the .write() method to write the data\nmy_writer.write()\n\n# The data from MyWriter's DataFrame is now written to the destination\n</code></pre> <p>In this example, <code>MyWriter</code> is a subclass of <code>Writer</code> that you've defined. After creating an instance of <code>MyWriter</code>, you call the <code>.write()</code> method to write the data to the destination. The data from <code>MyWriter</code>'s DataFrame is now written to the destination.</p> </li> <li> <p>DataFrame Property: The <code>Writer</code> class provides a <code>df</code> property as a shorthand for accessing <code>self.input.df</code>.     This property ensures that the data is ready to be written, even if the <code>execute</code> method hasn't been explicitly     called.</p> <p>Here's an example of how to use the <code>df</code> property:</p> <pre><code># Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the df property to get the data as a DataFrame\ndf = my_writer.df\n\n# Now df is a DataFrame with the data that will be written by MyWriter\n</code></pre> <p>In this example, <code>MyWriter</code> is a subclass of <code>Writer</code> that you've defined. After creating an instance of <code>MyWriter</code>, you access the <code>df</code> property to get the data as a DataFrame. The DataFrame <code>df</code> now contains the data that will be written by <code>MyWriter</code>.</p> </li> <li> <p>SparkSession: Every <code>Writer</code> has a <code>SparkSession</code> available as <code>self.spark</code>. This is the currently active     <code>SparkSession</code>, which can be used to perform distributed data processing tasks.</p> <p>Here's an example of how to use the <code>spark</code> property:</p> <pre><code># Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the spark property to get the SparkSession\nspark = my_writer.spark\n\n# Now spark is the SparkSession associated with MyWriter\n</code></pre> <p>In this example, <code>MyWriter</code> is a subclass of <code>Writer</code> that you've defined. After creating an instance of <code>MyWriter</code>, you access the <code>spark</code> property to get the <code>SparkSession</code>. The <code>SparkSession</code> <code>spark</code> can now be used to perform distributed data processing tasks.</p> </li> </ol>"},{"location":"reference/spark/writers.html#understanding-inheritance-in-writers","title":"Understanding Inheritance in Writers","text":"<p>Just like a <code>Step</code>, a <code>Writer</code> is defined as a subclass that inherits from the base <code>Writer</code> class. This means it inherits all the properties and methods from the <code>Writer</code> class and can add or override them as needed. The main method that needs to be overridden is the <code>execute</code> method, which should implement the logic for writing data from <code>self.input.df</code> to the destination.</p>"},{"location":"reference/spark/writers.html#examples-of-writer-classes-in-koheesio","title":"Examples of Writer Classes in Koheesio","text":"<p>Koheesio provides a variety of <code>Writer</code> subclasses for writing data to different destinations. Here are just a few examples:</p> <ul> <li><code>BoxFileWriter</code></li> <li><code>DeltaTableStreamWriter</code></li> <li><code>DeltaTableWriter</code></li> <li><code>DummyWriter</code></li> <li><code>ForEachBatchStreamWriter</code></li> <li><code>KafkaWriter</code></li> <li><code>SnowflakeWriter</code></li> <li><code>StreamWriter</code></li> </ul> <p>Please note that this is not an exhaustive list. Koheesio provides many more <code>Writer</code> subclasses for a wide range of data destinations. For a complete list, please refer to the Koheesio documentation or the source code.</p>"},{"location":"reference/spark/writers.html#benefits-of-using-writers-in-data-pipelines","title":"Benefits of Using Writers in Data Pipelines","text":"<p>Using <code>Writer</code> classes in your data pipelines has several benefits:</p> <ol> <li>Simplicity: Writers abstract away the details of writing data to various destinations, allowing you to focus on     the logic of your pipeline.</li> <li>Consistency: By using Writers, you ensure that data is written in a consistent manner across different parts of     your pipeline.</li> <li>Flexibility: Writers can be easily swapped out for different data destinations without changing the rest of your     pipeline.</li> <li>Efficiency: Writers automatically manage resources like connections and file handles, ensuring efficient use of     resources.</li> <li>Early Input Validation: As a <code>Writer</code> is a type of <code>SparkStep</code>, which in turn is a <code>Step</code> and a type of Pydantic     <code>BaseModel</code>, all inputs are validated when an instance of a <code>Writer</code> class is created. This early validation helps     catch errors related to invalid input, such as an invalid URL for a database, before the PySpark pipeline starts     executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.</li> </ol> <p>By using the concept of a <code>Writer</code>, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.</p>"},{"location":"tutorials/advanced-data-processing.html","title":"Advanced Data Processing with Koheesio","text":"<p>In this guide, we will explore some advanced data processing techniques using Koheesio. We will cover topics such as  complex transformations, handling large datasets, and optimizing performance.</p>"},{"location":"tutorials/advanced-data-processing.html#complex-transformations","title":"Complex Transformations","text":"<p>Koheesio provides a variety of built-in transformations, but sometimes you may need to perform more complex operations  on your data. In such cases, you can create custom transformations.</p> <p>Here's an example of a custom transformation that normalizes a column in a DataFrame:</p> <pre><code>from pyspark.sql import DataFrame\nfrom koheesio.spark.transformations.transform import Transform\n\ndef normalize_column(df: DataFrame, column: str) -&gt; DataFrame:\n    max_value = df.agg({column: \"max\"}).collect()[0][0]\n    min_value = df.agg({column: \"min\"}).collect()[0][0]\n    return df.withColumn(column, (df[column] - min_value) / (max_value - min_value))\n\n\nclass NormalizeColumnTransform(Transform):\n    column: str\n\n    def transform(self, df: DataFrame) -&gt; DataFrame:\n        return normalize_column(df, self.column)\n</code></pre>"},{"location":"tutorials/advanced-data-processing.html#handling-large-datasets","title":"Handling Large Datasets","text":"<p>When working with large datasets, it's important to manage resources effectively to ensure good performance. Koheesio  provides several features to help with this.  </p>"},{"location":"tutorials/advanced-data-processing.html#partitioning","title":"Partitioning","text":"<p>Partitioning is a technique that divides your data into smaller, more manageable pieces, called partitions. Koheesio  allows you to specify the partitioning scheme for your data when writing it to a target.</p> <pre><code>from koheesio.steps.writers.delta import DeltaTableWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\nclass MyTask(EtlTask):\n    target = DeltaTableWriter(table=\"my_table\", partitionBy=[\"column1\", \"column2\"])\n</code></pre>"},{"location":"tutorials/getting-started.html","title":"Getting Started with Koheesio","text":""},{"location":"tutorials/getting-started.html#requirements","title":"Requirements","text":"<ul> <li>Python 3.9+</li> </ul>"},{"location":"tutorials/getting-started.html#installation","title":"Installation","text":""},{"location":"tutorials/getting-started.html#poetry","title":"Poetry","text":"<p>If you're using Poetry, add the following entry to the <code>pyproject.toml</code> file:</p> pyproject.toml<pre><code>[[tool.poetry.source]]\nname = \"nike\"\nurl = \"https://artifactory.nike.com/artifactory/api/pypi/python-virtual/simple\"\nsecondary = true\n</code></pre> <pre><code>poetry add koheesio\n</code></pre>"},{"location":"tutorials/getting-started.html#pip","title":"pip","text":"<p>If you're using pip, run the following command to install Koheesio:</p> <p>Requires pip.</p> <pre><code>pip install koheesio\n</code></pre>"},{"location":"tutorials/getting-started.html#basic-usage","title":"Basic Usage","text":"<p>Once you've installed Koheesio, you can start using it in your Python scripts. Here's a basic example:</p> <pre><code>from koheesio import Step\n\n# Define a step\nclass MyStep(Step):\n    def execute(self):\n        # Your step logic here\n\n# Create an instance of the step\nstep = MyStep()\n\n# Run the step\nstep.execute()\n</code></pre>"},{"location":"tutorials/getting-started.html#advanced-usage","title":"Advanced Usage","text":"<pre><code>from pyspark.sql.functions import lit\nfrom pyspark.sql import DataFrame, SparkSession\n\n# Step 1: import Koheesio dependencies\nfrom koheesio.context import Context\nfrom koheesio.steps.readers.dummy import DummyReader\nfrom koheesio.steps.transformations.camel_to_snake import CamelToSnakeTransformation\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\n# Step 2: Set up a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Step 3: Configure your Context\ncontext = Context({\n    \"source\": DummyReader(),\n    \"transformations\": [CamelToSnakeTransformation()],\n    \"target\": DummyWriter(),\n    \"my_favorite_movie\": \"inception\",\n})\n\n# Step 4: Create a Task\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: DataFrame = None) -&gt; DataFrame:\n        df = df.withColumn(\"MyFavoriteMovie\", lit(self.my_favorite_movie))\n        return super().transform(df)\n\n# Step 5: Run your Task\ntask = MyFavoriteMovieTask(**context)\ntask.run()\n</code></pre>"},{"location":"tutorials/getting-started.html#contributing","title":"Contributing","text":"<p>If you want to contribute to Koheesio, check out the <code>CONTRIBUTING.md</code> file in this repository. It contains guidelines for contributing, including how to submit issues and pull requests.</p>"},{"location":"tutorials/getting-started.html#testing","title":"Testing","text":"<p>To run the tests for Koheesio, use the following command:</p> <pre><code>make dev-test\n</code></pre> <p>This will run all the tests in the <code>tests</code> directory.</p>"},{"location":"tutorials/hello-world.html","title":"Simple Examples","text":""},{"location":"tutorials/hello-world.html#creating-a-custom-step","title":"Creating a Custom Step","text":"<p>This example demonstrates how to use the <code>SparkStep</code> class from the <code>koheesio</code> library to create a custom step named  <code>HelloWorldStep</code>.</p>"},{"location":"tutorials/hello-world.html#code","title":"Code","text":"<pre><code>from koheesio.steps.step import SparkStep\n\nclass HelloWorldStep(SparkStep):\n    message: str\n\n    def execute(self) -&gt; SparkStep.Output:\n        # create a DataFrame with a single row containing the message\n        self.output.df = self.spark.createDataFrame([(1, self.message)], [\"id\", \"message\"])\n</code></pre>"},{"location":"tutorials/hello-world.html#usage","title":"Usage","text":"<pre><code>hello_world_step = HelloWorldStep(message=\"Hello, World!\")\nhello_world_step.execute()\n\nhello_world_step.output.df.show()\n</code></pre>"},{"location":"tutorials/hello-world.html#understanding-the-code","title":"Understanding the Code","text":"<p>The <code>HelloWorldStep</code> class is a <code>SparkStep</code> in Koheesio, designed to generate a DataFrame with a single row containing a custom message. Here's a more detailed overview:</p> <ul> <li><code>HelloWorldStep</code> inherits from <code>SparkStep</code>, a fundamental building block in Koheesio for creating data processing steps with Apache Spark.</li> <li>It has a <code>message</code> attribute. When creating an instance of <code>HelloWorldStep</code>, you can pass a custom message that will be used in the DataFrame.</li> <li><code>SparkStep</code> has a <code>spark</code> attribute, which is the active SparkSession. This is the entry point for any Spark functionality, allowing the step to interact with the Spark cluster.</li> <li><code>SparkStep</code> also includes an <code>Output</code> class, used to store the output of the step. In this case, <code>Output</code> has a <code>df</code> attribute to store the output DataFrame.</li> <li>The <code>execute</code> method creates a DataFrame with the custom message and stores it in <code>output.df</code>. It doesn't return a value explicitly; instead, the output DataFrame can be accessed via <code>output.df</code>.</li> <li>Koheesio uses pydantic for automatic validation of the step's input and output, ensuring they are correctly defined and of the correct types.</li> </ul> <p>Note: Pydantic is a data validation library that provides a way to validate that the data (in this case, the input and output of the step) conforms to the expected format.</p>"},{"location":"tutorials/hello-world.html#creating-a-custom-task","title":"Creating a Custom Task","text":"<p>This example demonstrates how to use the <code>EtlTask</code> from the <code>koheesio</code> library to create a custom task named <code>MyFavoriteMovieTask</code>.</p>"},{"location":"tutorials/hello-world.html#code_1","title":"Code","text":"<pre><code>from typing import Any\nfrom pyspark.sql import DataFrame, functions as f\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.tasks.etl_task import EtlTask\n\n\ndef add_column(df: DataFrame, target_column: str, value: Any):\n    return df.withColumn(target_column, f.lit(value))\n\n\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: Optional[DataFrame] = None) -&gt; DataFrame:\n        df = df or self.extract()\n\n        # pre-transformations specific to this class\n        pre_transformations = [\n            Transform(add_column, target_column=\"myFavoriteMovie\", value=self.my_favorite_movie)\n        ]\n\n        # execute transformations one by one\n        for t in pre_transformations:\n            df = t.transform(df)\n\n        self.output.transform_df = df\n        return df\n</code></pre>"},{"location":"tutorials/hello-world.html#configuration","title":"Configuration","text":"<p>Here is the <code>sample.yaml</code> configuration file used in this example:</p> <pre><code>raw_layer:\n  catalog: development\n  schema: my_favorite_team\n  table: some_random_table\nmovies:\n  favorite: Office Space\nhash_settings:\n  source_columns:\n  - id\n  - foo\n  target_column: hash_uuid5\nsource:\n  range: 4\n</code></pre>"},{"location":"tutorials/hello-world.html#usage_1","title":"Usage","text":"<pre><code>from pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\n\ncontext = Context.from_yaml(\"sample.yaml\")\n\nSparkSession.builder.getOrCreate()\n\nmy_fav_mov_task = MyFavoriteMovieTask(\n    source=DummyReader(**context.raw_layer),\n    target=DummyWriter(truncate=False),\n    my_favorite_movie=context.movies.favorite,\n)\nmy_fav_mov_task.execute()\n</code></pre>"},{"location":"tutorials/hello-world.html#understanding-the-code_1","title":"Understanding the Code","text":"<p>This example creates a <code>MyFavoriteMovieTask</code> that adds a column named <code>myFavoriteMovie</code> to the DataFrame. The value for this column is provided when the task is instantiated.</p> <p>The <code>MyFavoriteMovieTask</code> class is a custom task that extends the <code>EtlTask</code> from the <code>koheesio</code> library. It demonstrates how to add a custom transformation to a DataFrame. Here's a detailed breakdown:</p> <ul> <li> <p><code>MyFavoriteMovieTask</code> inherits from <code>EtlTask</code>, a base class in Koheesio for creating Extract-Transform-Load (ETL) tasks with Apache Spark.</p> </li> <li> <p>It has a <code>my_favorite_movie</code> attribute. When creating an instance of <code>MyFavoriteMovieTask</code>, you can pass a custom movie title that will be used in the DataFrame.</p> </li> <li> <p>The <code>transform</code> method is where the main logic of the task is implemented. It first extracts the data (if not already provided), then applies a series of transformations to the DataFrame.</p> </li> <li> <p>In this case, the transformation is adding a new column to the DataFrame named <code>myFavoriteMovie</code>, with the value set to the <code>my_favorite_movie</code> attribute. This is done using the <code>add_column</code> function and the <code>Transform</code> class from Koheesio.</p> </li> <li> <p>The transformed DataFrame is then stored in <code>self.output.transform_df</code>.</p> </li> <li> <p>The <code>sample.yaml</code> configuration file is used to provide the context for the task, including the source data and the favorite movie title.</p> </li> <li> <p>In the usage example, an instance of <code>MyFavoriteMovieTask</code> is created with a <code>DummyReader</code> as the source, a <code>DummyWriter</code> as the target, and the favorite movie title from the context. The task is then executed, which runs the transformations and stores the result in <code>self.output.transform_df</code>.</p> </li> </ul>"},{"location":"tutorials/learn-koheesio.html","title":"Learn Koheesio","text":"<p>Koheesio is designed to simplify the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.</p>"},{"location":"tutorials/learn-koheesio.html#core-concepts","title":"Core Concepts","text":"<p>Koheesio is built around several core concepts:</p> <ul> <li>Step: The fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in    inputs and producing outputs.   <p>See the Step documentation for more information.</p> </li> <li>Context: A configuration class used to set up the environment for a Task. It can be used to share variables across   tasks and adapt the behavior of a Task based on its environment.   <p>See the Context documentation for more information.</p> </li> <li>Logger: A class for logging messages at different levels.   <p>See the Logger documentation for more information.</p> </li> </ul> <p>The Logger and Context classes provide support, enabling detailed logging of the pipeline's execution and customization  of the pipeline's behavior based on the environment, respectively.</p>"},{"location":"tutorials/learn-koheesio.html#implementations","title":"Implementations","text":"<p>In the context of Koheesio, an implementation refers to a specific way of executing Steps, the fundamental units of work in Koheesio. Each implementation uses a different technology or approach to process data along with its own set of  Steps, designed to work with the specific technology or approach used by the implementation. </p> <p>For example, the Spark implementation includes Steps for reading data from a Spark DataFrame, transforming the data  using Spark operations, and writing the data to a Spark-supported destination.</p> <p>Currently, Koheesio supports two implementations: Spark, and AsyncIO.</p>"},{"location":"tutorials/learn-koheesio.html#spark","title":"Spark","text":"<p>Requires: Apache Spark (pyspark) Installation: <code>pip install koheesio[spark]</code> Module: <code>koheesio.spark</code> </p> <p>This implementation uses Apache Spark, a powerful open-source unified analytics engine for large-scale data processing. </p> <p>Steps that use this implementation can leverage Spark's capabilities for distributed data processing, making it suitable for handling large volumes of data. The Spark implementation includes the following types of Steps:  </p> <ul> <li> <p>Reader: <code>from koheesio.spark.readers import Reader</code>   A type of Step that reads data from a source and stores the result (to make it available for subsequent steps).   For more information, see the Reader documentation.</p> </li> <li> <p>Writer: <code>from koheesio.spark.writers import Writer</code>    This controls how data is written to the output in both batch and streaming contexts.   For more information, see the Writer documentation.</p> </li> <li> <p>Transformation: <code>from koheesio.spark.transformations import Transformation</code>    A type of Step that takes a DataFrame as input and returns a DataFrame as output.   For more information, see the Transformation documentation.</p> </li> </ul> <p>In any given pipeline, you can expect to use Readers, Writers, and Transformations to express the ETL logic. Readers are responsible for extracting data from various sources, such as databases, files, or APIs. Transformations then process this  data, performing operations like filtering, aggregation, or conversion. Finally, Writers handle the loading of the transformed data to the desired destination, which could be a database, a file, or a data stream. </p>"},{"location":"tutorials/learn-koheesio.html#async","title":"Async","text":"<p>Module: <code>koheesio.asyncio</code></p> <p>This implementation uses Python's asyncio library for writing single-threaded concurrent code using coroutines,  multiplexing I/O access over sockets and other resources, running network clients and servers, and other related  primitives. Steps that use this implementation can perform data processing tasks asynchronously, which can be beneficial for IO-bound tasks.  </p>"},{"location":"tutorials/learn-koheesio.html#best-practices","title":"Best Practices","text":"<p>Here are some best practices for using Koheesio:</p> <ol> <li> <p>Use Context: The <code>Context</code> class in Koheesio is designed to behave like a dictionary, but with added features.       It's a good practice to use <code>Context</code> to customize the behavior of a task. This allows you to share variables      across tasks and adapt the behavior of a task based on its environment; for example, by changing the source or      target of the data between development and production environments.</p> </li> <li> <p>Modular Design: Each step in the pipeline (reading, transformation, writing) should be encapsulated in its own      class, making the code easier to understand and maintain. This also promotes re-usability as steps can be reused      across different tasks.</p> </li> <li> <p>Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks. Make      sure to leverage this feature to make your pipelines robust and fault-tolerant.</p> </li> <li> <p>Logging: Use the built-in logging feature in Koheesio to log information and errors in data processing tasks.      This can be very helpful for debugging and monitoring the pipeline. Koheesio sets the log level to <code>WARNING</code> by      default, but you can change it to <code>INFO</code> or <code>DEBUG</code> as needed.</p> </li> <li> <p>Testing: Each step can be tested independently, making it easier to write unit tests. It's a good practice to      write tests for your steps to ensure they are working as expected.</p> </li> <li> <p>Use Transformations: The <code>Transform</code> class in Koheesio allows you to define transformations on your data. It's a      good practice to encapsulate your transformation logic in <code>Transform</code> classes for better readability and      maintainability.</p> </li> <li> <p>Consistent Structure: Koheesio enforces a consistent structure for data processing tasks. Stick to this structure      to make your codebase easier to understand for new developers.</p> </li> <li> <p>Use Readers and Writers: Use the built-in <code>Reader</code> and <code>Writer</code> classes in Koheesio to handle data extraction      and loading. This not only simplifies your code but also makes it more robust and efficient.</p> </li> </ol> <p>Remember, these are general best practices and might need to be adapted based on your specific use case and requirements.</p>"},{"location":"tutorials/learn-koheesio.html#pydantic","title":"Pydantic","text":"<p>Koheesio Steps are Pydantic models, which means they can be validated and serialized. This makes it easy to define the inputs and outputs of a Step, and to validate them before running the Step. Pydantic models also provide a consistent way to define the schema of the data that a Step expects and produces, making it easier to understand and maintain the code.</p> <p>Learn more about Pydantic here.</p>"},{"location":"tutorials/onboarding.html","title":"Onboarding","text":"<p>tags:     - doctype/how-to</p>"},{"location":"tutorials/onboarding.html#onboarding-to-koheesio","title":"Onboarding to Koheesio","text":"<p>Koheesio is a Python library that simplifies the development of data engineering pipelines. It provides a structured  way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows. </p> <p>This guide will walk you through the process of transforming a traditional Spark application into a Koheesio pipeline  along with explaining the advantages of using Koheesio over raw Spark.</p>"},{"location":"tutorials/onboarding.html#traditional-spark-application","title":"Traditional Spark Application","text":"<p>First let's create a simple Spark application that you might use to process data.</p> <p>The following Spark application reads a CSV file, performs a transformation, and writes the result to a Delta table.  The transformation includes filtering data where age is greater than 18 and performing an aggregation to calculate the  average salary per country. The result is then written to a Delta table partitioned by country.</p> <pre><code>from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, avg\n\nspark = SparkSession.builder.getOrCreate()\n\n# Read data from CSV file\ndf = spark.read.csv(\"input.csv\", header=True, inferSchema=True)\n\n# Filter data where age is greater than 18\ndf = df.filter(col(\"age\") &gt; 18)\n\n# Perform aggregation\ndf = df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n\n# Write data to Delta table with partitioning\ndf.write.format(\"delta\").partitionBy(\"country\").save(\"/path/to/delta_table\")\n</code></pre>"},{"location":"tutorials/onboarding.html#transforming-to-koheesio","title":"Transforming to Koheesio","text":"<p>The same pipeline can be rewritten using Koheesio's <code>EtlTask</code>. In this version, each step (reading, transformations,  writing) is encapsulated in its own class, making the code easier to understand and maintain.  </p> <p>First, a <code>CsvReader</code> is defined to read the input CSV file. Then, a <code>DeltaTableWriter</code> is defined to write the result  to a Delta table partitioned by country. </p> <p>Two transformations are defined:  1. one to filter data where age is greater than 18 2. and, another to calculate the average salary per country. </p> <p>These transformations are then passed to an <code>EtlTask</code> along with the reader and writer. Finally, the <code>EtlTask</code> is  executed to run the pipeline.</p> <p><pre><code>from koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta.batch import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\nfrom pyspark.sql.functions import col, avg\n\n# Define reader\nreader = CsvReader(path=\"input.csv\", header=True, inferSchema=True)\n\n# Define writer\nwriter = DeltaTableWriter(table=\"delta_table\", partition_by=[\"country\"])\n\n# Define transformations\nage_transformation = Transform(\n    func=lambda df: df.filter(col(\"age\") &gt; 18)\n)\navg_salary_per_country = Transform(\n    func=lambda df: df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n)\n\n# Define and execute EtlTask\ntask = EtlTask(\n    source=reader, \n    target=writer, \n    transformations=[\n        age_transformation,\n        avg_salary_per_country\n    ]\n)\ntask.execute()\n</code></pre> This approach with Koheesio provides several advantages. It makes the code more modular and easier to test. Each step can be tested independently and reused across different tasks. It also makes the pipeline more readable and easier to maintain.</p>"},{"location":"tutorials/onboarding.html#advantages-of-koheesio","title":"Advantages of Koheesio","text":"<p>Using Koheesio instead of raw Spark has several advantages:</p> <ul> <li>Modularity: Each step in the pipeline (reading, transformation, writing) is encapsulated in its own class,      making the code easier to understand and maintain.</li> <li>Reusability: Steps can be reused across different tasks, reducing code duplication.</li> <li>Testability: Each step can be tested independently, making it easier to write unit tests.</li> <li>Flexibility: The behavior of a task can be customized using a <code>Context</code> class.</li> <li>Consistency: Koheesio enforces a consistent structure for data processing tasks, making it easier for new      developers to understand the codebase.</li> <li>Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks.</li> <li>Logging: Koheesio provides a consistent way to log information and errors in data processing tasks.</li> </ul> <p>In contrast, using the plain PySpark API for transformations can lead to more verbose and less structured code, which  can be harder to understand, maintain, and test. It also doesn't provide the same level of error handling, logging, and flexibility as the Koheesio Transform class.</p>"},{"location":"tutorials/onboarding.html#using-a-context-class","title":"Using a Context Class","text":"<p>Here's a simple example of how to use a <code>Context</code> class to customize the behavior of a task. The Context class in Koheesio is designed to behave like a dictionary, but with added features. </p> <pre><code>from koheesio import Context\nfrom koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\n\ncontext = Context({  # this could be stored in a JSON or YAML\n    \"age_threshold\": 18,\n    \"reader_options\": {\n        \"path\": \"input.csv\",\n        \"header\": True,\n        \"inferSchema\": True\n    },\n    \"writer_options\": {\n        \"table\": \"delta_table\",\n        \"partition_by\": [\"country\"]\n    }\n})\n\ntask = EtlTask(\n    source = CsvReader(**context.reader_options),\n    target = DeltaTableWriter(**context.writer_options),\n    transformations = [\n        Transform(func=lambda df: df.filter(df[\"age\"] &gt; context.age_threshold))\n    ]\n)\n\ntask.execute()\n</code></pre> <p>In this example, we're using <code>CsvReader</code> to read the input data, <code>DeltaTableWriter</code> to write the output data, and a  <code>Transform</code> step to filter the data based on the age threshold. The options for the reader and writer are stored in a <code>Context</code> object, which can be easily updated or loaded from a JSON or YAML file.</p>"},{"location":"tutorials/testing-koheesio-steps.html","title":"Testing Koheesio Tasks","text":"<p>Testing is a crucial part of any software development process. Koheesio provides a structured way to define and execute data processing tasks, which makes it easier to build, test, and maintain complex data workflows. This guide will walk you through the process of testing Koheesio tasks.</p>"},{"location":"tutorials/testing-koheesio-steps.html#unit-testing","title":"Unit Testing","text":"<p>Unit testing involves testing individual components of the software in isolation. In the context of Koheesio, this means testing individual tasks or steps.</p> <p>Here's an example of how to unit test a Koheesio task:</p> <pre><code>from koheesio.tasks.etl_task import EtlTask\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.steps.transformations import Transform\nfrom pyspark.sql import SparkSession, DataFrame\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df: DataFrame) -&gt; DataFrame:\n    return df.filter(col(\"Age\") &gt; 18)\n\n\ndef test_etl_task():\n    # Initialize SparkSession\n    spark = SparkSession.builder.getOrCreate()\n\n    # Create a DataFrame for testing\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    df = spark.createDataFrame(data, [\"Name\", \"Age\"])\n\n    # Define the task\n    task = EtlTask(\n        source=DummyReader(df=df),\n        target=DummyWriter(),\n        transformations=[\n            Transform(filter_age)\n        ]\n    )\n\n    # Execute the task\n    task.execute()\n\n    # Assert the result\n    result_df = task.output.df\n    assert result_df.count() == 2\n    assert result_df.filter(\"Name == 'Tom'\").count() == 0\n</code></pre> <p>In this example, we're testing an EtlTask that reads data from a DataFrame, applies a filter transformation, and writes  the result to another DataFrame. The test asserts that the task correctly filters out rows where the age is less than or equal to 18.</p>"},{"location":"tutorials/testing-koheesio-steps.html#integration-testing","title":"Integration Testing","text":"<p>Integration testing involves testing the interactions between different components of the software. In the context of  Koheesio, this means testing the entirety of data flowing through one or more tasks.</p> <p>We'll create a simple test for a hypothetical EtlTask that uses DeltaReader and DeltaWriter. We'll use pytest and unittest.mock to mock the responses of the reader and writer.  First, let's assume that you have an EtlTask defined in a module named my_module. This task reads data from a Delta table, applies some transformations, and writes the result to another Delta table.</p> <p>Here's an example of how to write an integration test for this task:</p> <pre><code># my_module.py\nfrom koheesio.tasks.etl_task import EtlTask\nfrom koheesio.spark.readers.delta import DeltaReader\nfrom koheesio.steps.writers.delta import DeltaWriter\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.context import Context\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df):\n    return df.filter(col(\"Age\") &gt; 18)\n\n\ncontext = Context({\n    \"reader_options\": {\n        \"table\": \"input_table\"\n    },\n    \"writer_options\": {\n        \"table\": \"output_table\"\n    }\n})\n\ntask = EtlTask(\n    source=DeltaReader(**context.reader_options),\n    target=DeltaWriter(**context.writer_options),\n    transformations=[\n        Transform(filter_age)\n    ]\n)\n</code></pre> <p>Now, let's create a test for this task. We'll use pytest and unittest.mock to mock the responses of the reader and writer. We'll also use a pytest fixture to create a test context and a test DataFrame.</p> <pre><code># test_my_module.py\nimport pytest\nfrom unittest.mock import MagicMock, patch\nfrom pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import Reader\nfrom koheesio.steps.writers import Writer\n\nfrom my_module import task\n\n@pytest.fixture(scope=\"module\")\ndef spark():\n    return SparkSession.builder.getOrCreate()\n\n@pytest.fixture(scope=\"module\")\ndef test_context():\n    return Context({\n        \"reader_options\": {\n            \"table\": \"test_input_table\"\n        },\n        \"writer_options\": {\n            \"table\": \"test_output_table\"\n        }\n    })\n\n@pytest.fixture(scope=\"module\")\ndef test_df(spark):\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    return spark.createDataFrame(data, [\"Name\", \"Age\"])\n\ndef test_etl_task(spark, test_context, test_df):\n    # Mock the read method of the Reader class\n    with patch.object(Reader, \"read\", return_value=test_df):\n        # Mock the write method of the Writer class\n        with patch.object(Writer, \"write\") as mock_write:\n            # Execute the task\n            task.execute()\n\n            # Assert the result\n            result_df = task.output.df\n            assert result_df.count() == 2\n            assert result_df.filter(\"Name == 'Tom'\").count() == 0\n\n            # Assert that the reader and writer were called with the correct arguments\n            Reader.read.assert_called_once_with(**test_context.reader_options)\n            mock_write.assert_called_once_with(**test_context.writer_options)\n</code></pre> <p>In this test, we're mocking the DeltaReader and DeltaWriter to return a test DataFrame and check that they're called  with the correct arguments. We're also asserting that the task correctly filters out rows where the age is less than  or equal to 18.</p>"},{"location":"misc/tags.html","title":"{{ page.title }}","text":""},{"location":"misc/tags.html#doctypeexplanation","title":"doctype/explanation","text":"<ul> <li>Approach documentation</li> </ul>"},{"location":"misc/tags.html#doctypehow-to","title":"doctype/how-to","text":"<ul> <li>How to</li> </ul>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":""},{"location":"index.html#koheesio","title":"Koheesio","text":"CI/CD Package Meta <p>Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines.  It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components. </p> <p>The framework is versatile, aiming to support multiple implementations and working seamlessly with various data  processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.</p> <p>Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type  safety and structured configurations within pipeline components.</p> <p>Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.</p>"},{"location":"index.html#what-sets-koheesio-apart-from-other-libraries","title":"What sets Koheesio apart from other libraries?\"","text":"<p>Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While  similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks  like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.</p> <p>Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus  on utility in a multitude of scenarios. Our preference is for integration, not competition...</p> <p>We invite contributions from all, promoting collaboration and innovation in the data engineering community.</p>"},{"location":"index.html#koheesio-core-components","title":"Koheesio Core Components","text":"<p>Here are the key components included in Koheesio:</p> <ul> <li>Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline,   taking in inputs and producing outputs.     <pre><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre></li> <li>Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.</li> <li>Logger: This is a class for logging messages at different levels.</li> </ul>"},{"location":"index.html#installation","title":"Installation","text":"<p>You can install Koheesio using either pip or poetry.</p>"},{"location":"index.html#using-pip","title":"Using Pip","text":"<p>To install Koheesio using pip, run the following command in your terminal:</p> <pre><code>pip install koheesio\n</code></pre>"},{"location":"index.html#using-hatch","title":"Using Hatch","text":"<p>If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your <code>pyproject.toml</code>.</p>"},{"location":"index.html#using-poetry","title":"Using Poetry","text":"<p>If you're using poetry for package management, you can add Koheesio to your project with the following command:</p> <pre><code>poetry add koheesio\n</code></pre> <p>or add the following line to your <code>pyproject.toml</code> (under <code>[tool.poetry.dependencies]</code>), making sure to replace <code>...</code> with the version you want to have installed:</p> <pre><code>koheesio = {version = \"...\"}\n</code></pre>"},{"location":"index.html#extras","title":"Extras","text":"<p>Koheesio also provides some additional features that can be useful in certain scenarios. These include:</p> <ul> <li> <p>Spark Expectations:  Available through the <code>koheesio.steps.integration.spark.dq.spark_expectations</code> module; installable through the <code>se</code> extra.</p> <ul> <li>SE Provides Data Quality checks for Spark DataFrames.</li> <li>For more information, refer to the Spark Expectations docs.</li> </ul> </li> <li> <p>Box: Available through the <code>koheesio.steps.integration.box</code> module; installable through the <code>box</code> extra.</p> <ul> <li>Box is a cloud content management and file sharing service for businesses.</li> </ul> </li> <li> <p>SFTP: Available through the <code>koheesio.steps.integration.spark.sftp</code> module; installable through the <code>sftp</code> extra.</p> <ul> <li>SFTP is a network protocol used for secure file transfer over a secure shell.</li> </ul> </li> </ul> <p>Note: Some of the steps require extra dependencies. See the Extras section for additional info. Extras can be added to Poetry by adding <code>extras=['name_of_the_extra']</code> to the toml entry mentioned above</p>"},{"location":"index.html#contributing","title":"Contributing","text":""},{"location":"index.html#how-to-contribute","title":"How to Contribute","text":"<p>We welcome contributions to our project! Here's a brief overview of our development process:</p> <ul> <li> <p>Code Standards: We use <code>pylint</code>, <code>black</code>, and <code>mypy</code> to maintain code standards. Please ensure your code passes   these checks by running <code>make check</code>. No errors or warnings should be reported by the linter before you submit a pull   request.</p> </li> <li> <p>Testing: We use <code>pytest</code> for testing. Run the tests with <code>make test</code> and ensure all tests pass before submitting   a pull request.</p> </li> <li> <p>Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with   admin rights will create a new release on GitHub and publish the new version to PyPI.</p> </li> </ul> <p>For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.</p>"},{"location":"index.html#additional-resources","title":"Additional Resources","text":"<ul> <li>General GitHub documentation</li> <li>GitHub pull request documentation</li> <li>Nike OSS</li> </ul>"},{"location":"api_reference/index.html","title":"API Reference","text":""},{"location":"api_reference/index.html#koheesio.ABOUT","title":"koheesio.ABOUT  <code>module-attribute</code>","text":"<pre><code>ABOUT = _about()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.VERSION","title":"koheesio.VERSION  <code>module-attribute</code>","text":"<pre><code>VERSION = __version__\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel","title":"koheesio.BaseModel","text":"<p>Base model for all models.</p> <p>Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.</p> Additional methods and properties: Different Modes <p>This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.</p> <ul> <li> <p>Normal mode:     you need to know the values ahead of time     <pre><code>normal_mode = YourOwnModel(a=\"foo\", b=42)\n</code></pre></p> </li> <li> <p>Lazy mode:     being able to defer the validation until later     <pre><code>lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n</code></pre>     The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add     them as they become available. All while still being able to validate that you have collected all your output     at the end.</p> </li> <li> <p>With statements:     With statements are also allowed. The <code>validate_output</code> method from the earlier example will run upon exit of     the with-statement.     <pre><code>with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n</code></pre>     Note: that a lazy mode BaseModel object is required to work with a with-statement.</p> </li> </ul> <p>Examples:</p> <pre><code>from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n</code></pre> <p>In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The <code>validate_output</code> method is then called to validate the instance.</p> Koheesio specific configuration: <p>Koheesio models are configured differently from Pydantic defaults. The following configuration is used:</p> <ol> <li> <p>extra=\"allow\"</p> <p>This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.</p> </li> <li> <p>arbitrary_types_allowed=True</p> <p>This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.</p> </li> <li> <p>populate_by_name=True</p> <p>This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.</p> </li> <li> <p>validate_assignment=False</p> <p>This setting determines whether the model should be revalidated when the data is changed. If set to <code>True</code>, every time a field is assigned a new value, the entire model is validated again.</p> <p>Pydantic default is (also) <code>False</code>, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.</p> </li> <li> <p>revalidate_instances=\"subclass-instances\"</p> <p>This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is <code>never</code>, which means that the model and dataclass instances are not revalidated during validation.</p> </li> <li> <p>validate_default=True</p> <p>This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.</p> </li> <li> <p>frozen=False</p> <p>This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.</p> </li> <li> <p>coerce_numbers_to_str=True</p> <p>This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any <code>Number</code> type to <code>str</code>. Pydantic doesn't allow number types (<code>int</code>, <code>float</code>, <code>Decimal</code>) to be coerced as type <code>str</code> by default.</p> </li> <li> <p>use_enum_values=True</p> <p>This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.</p> </li> </ol>"},{"location":"api_reference/index.html#koheesio.BaseModel--fields","title":"Fields","text":"<p>Every Koheesio BaseModel has two fields: <code>name</code> and <code>description</code>. These fields are used to provide a name and a description to the model.</p> <ul> <li> <p><code>name</code>: This is the name of the Model. If not provided, it defaults to the class name.</p> </li> <li> <p><code>description</code>: This is the description of the Model. It has several default behaviors:</p> <ul> <li>If not provided, it defaults to the docstring of the class.</li> <li>If the docstring is not provided, it defaults to the name of the class.</li> <li>For multi-line descriptions, it has the following behaviors:<ul> <li>Only the first non-empty line is used.</li> <li>Empty lines are removed.</li> <li>Only the first 3 lines are considered.</li> <li>Only the first 120 characters are considered.</li> </ul> </li> </ul> </li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--validators","title":"Validators","text":"<ul> <li><code>_set_name_and_description</code>: Set the name and description of the Model as per the rules mentioned above.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--properties","title":"Properties","text":"<ul> <li><code>log</code>: Returns a logger with the name of the class.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--class-methods","title":"Class Methods","text":"<ul> <li><code>from_basemodel</code>: Returns a new BaseModel instance based on the data of another BaseModel.</li> <li><code>from_context</code>: Creates BaseModel instance from a given Context.</li> <li><code>from_dict</code>: Creates BaseModel instance from a given dictionary.</li> <li><code>from_json</code>: Creates BaseModel instance from a given JSON string.</li> <li><code>from_toml</code>: Creates BaseModel object from a given toml file.</li> <li><code>from_yaml</code>: Creates BaseModel object from a given yaml file.</li> <li><code>lazy</code>: Constructs the model without doing validation.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--dunder-methods","title":"Dunder Methods","text":"<ul> <li><code>__add__</code>: Allows to add two BaseModel instances together.</li> <li><code>__enter__</code>: Allows for using the model in a with-statement.</li> <li><code>__exit__</code>: Allows for using the model in a with-statement.</li> <li><code>__setitem__</code>: Set Item dunder method for BaseModel.</li> <li><code>__getitem__</code>: Get Item dunder method for BaseModel.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel--instance-methods","title":"Instance Methods","text":"<ul> <li><code>hasattr</code>: Check if given key is present in the model.</li> <li><code>get</code>: Get an attribute of the model, but don't fail if not present.</li> <li><code>merge</code>: Merge key,value map with self.</li> <li><code>set</code>: Allows for subscribing / assigning to <code>class[key]</code>.</li> <li><code>to_context</code>: Converts the BaseModel instance to a Context object.</li> <li><code>to_dict</code>: Converts the BaseModel instance to a dictionary.</li> <li><code>to_json</code>: Converts the BaseModel instance to a JSON string.</li> <li><code>to_yaml</code>: Converts the BaseModel instance to a YAML string.</li> </ul>"},{"location":"api_reference/index.html#koheesio.BaseModel.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: Optional[str] = Field(default=None, description='Description of the Model')\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.log","title":"log  <code>property</code>","text":"<pre><code>log: Logger\n</code></pre> <p>Returns a logger with the name of the class</p>"},{"location":"api_reference/index.html#koheesio.BaseModel.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.name","title":"name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>name: Optional[str] = Field(default=None, description='Name of the Model')\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_basemodel","title":"from_basemodel  <code>classmethod</code>","text":"<pre><code>from_basemodel(basemodel: BaseModel, **kwargs)\n</code></pre> <p>Returns a new BaseModel instance based on the data of another BaseModel</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_context","title":"from_context  <code>classmethod</code>","text":"<pre><code>from_context(context: Context) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given Context</p> <p>You have to make sure that the Context object has the necessary attributes to create the model.</p> <p>Examples:</p> <pre><code>class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_context(cls, context: Context) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(data: Dict[str, Any]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given dictionary</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.from_json : Deserializes a JSON string to a Context object</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given toml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file, or string containing toml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given yaml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.get","title":"get","text":"<pre><code>get(key: str, default: Optional[Any] = None)\n</code></pre> <p>Get an attribute of the model, but don't fail if not present</p> <p>Similar to dict.get()</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>name of the key to get</p> required <code>default</code> <code>Optional[Any]</code> <p>Default value in case the attribute does not exist</p> <code>None</code> <p>Returns:</p> Type Description <code>Any</code> <p>The value of the attribute</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.hasattr","title":"hasattr","text":"<pre><code>hasattr(key: str) -&gt; bool\n</code></pre> <p>Check if given key is present in the model</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def hasattr(self, key: str) -&gt; bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.lazy","title":"lazy  <code>classmethod</code>","text":"<pre><code>lazy()\n</code></pre> <p>Constructs the model without doing validation</p> <p>Essentially an alias to BaseModel.construct()</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.merge","title":"merge","text":"<pre><code>merge(other: Union[Dict, BaseModel])\n</code></pre> <p>Merge key,value map with self</p> <p>Functionally similar to adding two dicts together; like running <code>{**dict_a, **dict_b}</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[Dict, BaseModel]</code> <p>Dict or another instance of a BaseModel class that will be added to self</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.set","title":"set","text":"<pre><code>set(key: str, value: Any)\n</code></pre> <p>Allows for subscribing / assigning to <code>class[key]</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>The key of the attribute to assign to</p> required <code>value</code> <code>Any</code> <p>Value that should be assigned to the given key</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_context","title":"to_context","text":"<pre><code>to_context() -&gt; Context\n</code></pre> <p>Converts the BaseModel instance to a Context object</p> <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_context(self) -&gt; Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Converts the BaseModel instance to a dictionary</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False)\n</code></pre> <p>Converts the BaseModel instance to a JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.to_json : Serializes a Context object to a JSON string</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Converts the BaseModel instance to a YAML string</p> <p>BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.BaseModel.validate","title":"validate","text":"<pre><code>validate() -&gt; BaseModel\n</code></pre> <p>Validate the BaseModel instance</p> <p>This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.</p> <p>This method is intended to be used with the <code>lazy</code> method. The <code>lazy</code> method is used to create an instance of the BaseModel without immediate validation. The <code>validate</code> method is then used to validate the instance after.</p> <p>Note: in the Pydantic BaseModel, the <code>validate</code> method throws a deprecated warning. This is because Pydantic recommends using the <code>validate_model</code> method instead. However, we are using the <code>validate</code> method here in a different context and a slightly different way.</p> <p>Examples:</p> <p><pre><code>class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n</code></pre> In this example, the <code>foo_model</code> instance is created without immediate validation. The attributes foo and lorem are set afterward. The <code>validate</code> method is then called to validate the instance.</p> <p>Returns:</p> Type Description <code>BaseModel</code> <p>The BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def validate(self) -&gt; BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    &gt; Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context","title":"koheesio.Context","text":"<pre><code>Context(*args, **kwargs)\n</code></pre> <p>The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.</p> Key Features <ul> <li>Nested keys: Supports accessing and adding nested keys similar to dictionary keys.</li> <li>Recursive merging: Merges two Contexts together, with the incoming Context having priority.</li> <li>Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be     converted back to a dictionary.</li> <li>Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects     to and from JSON.</li> </ul> <p>For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.</p> <p>Methods:</p> Name Description <code>add</code> <p>Add a key/value pair to the context.</p> <code>get</code> <p>Get value of a given key.</p> <code>get_item</code> <p>Acts just like <code>.get</code>, except that it returns the key also.</p> <code>contains</code> <p>Check if the context contains a given key.</p> <code>merge</code> <p>Merge this context with the context of another, where the incoming context has priority.</p> <code>to_dict</code> <p>Returns all parameters of the context as a dict.</p> <code>from_dict</code> <p>Creates Context object from the given dict.</p> <code>from_yaml</code> <p>Creates Context object from a given yaml file.</p> <code>from_json</code> <p>Creates Context object from a given json file.</p> Dunder methods <ul> <li>_<code>_iter__()</code>: Allows for iteration across a Context.</li> <li><code>__len__()</code>: Returns the length of the Context.</li> <li><code>__getitem__(item)</code>: Makes class subscriptable.</li> </ul> Inherited from Mapping <ul> <li><code>items()</code>: Returns all items of the Context.</li> <li><code>keys()</code>: Returns all keys of the Context.</li> <li><code>values()</code>: Returns all values of the Context.</li> </ul> Source code in <code>src/koheesio/context.py</code> <pre><code>def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.add","title":"add","text":"<pre><code>add(key: str, value: Any) -&gt; Context\n</code></pre> <p>Add a key/value pair to the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def add(self, key: str, value: Any) -&gt; Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.contains","title":"contains","text":"<pre><code>contains(key: str) -&gt; bool\n</code></pre> <p>Check if the context contains a given key</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/context.py</code> <pre><code>def contains(self, key: str) -&gt; bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(kwargs: dict) -&gt; Context\n</code></pre> <p>Creates Context object from the given dict</p> <p>Parameters:</p> Name Type Description Default <code>kwargs</code> <code>dict</code> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_dict(cls, kwargs: dict) -&gt; Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given json file</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> Security <p>(from https://jsonpickle.github.io/)</p> <p>jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    &gt; jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    &gt; The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_json--warning","title":"! Warning !","text":"<p>The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.</p>"},{"location":"api_reference/index.html#koheesio.Context.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given toml file</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file or string containing toml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; Context\n</code></pre> <p>Creates Context object from a given yaml file</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.get","title":"get","text":"<pre><code>get(key: str, default: Any = None, safe: bool = True) -&gt; Any\n</code></pre> <p>Get value of a given key</p> <p>The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's <code>.get()</code> method otherwise.</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>Can be a real key, or can be a dotted notation of a nested key</p> required <code>default</code> <code>Any</code> <p>Default value to return</p> <code>None</code> <code>safe</code> <code>bool</code> <p>Toggles whether to fail or not when item cannot be found</p> <code>True</code> <p>Returns:</p> Type Description <code>Any</code> <p>Value of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n</code></pre> <p>Returns <code>c</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get(self, key: str, default: Any = None, safe: bool = True) -&gt; Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.get_all","title":"get_all","text":"<pre><code>get_all() -&gt; dict\n</code></pre> <p>alias to to_dict()</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_all(self) -&gt; dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.get_item","title":"get_item","text":"<pre><code>get_item(key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]\n</code></pre> <p>Acts just like <code>.get</code>, except that it returns the key also</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>key/value-pair of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n</code></pre> <p>Returns <code>{'a.b': 'c'}</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_item(self, key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.merge","title":"merge","text":"<pre><code>merge(context: Context, recursive: bool = False) -&gt; Context\n</code></pre> <p>Merge this context with the context of another, where the incoming context has priority.</p> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> <p>Another Context class</p> required <code>recursive</code> <code>bool</code> <p>Recursively merge two dictionaries to an arbitrary depth</p> <code>False</code> <p>Returns:</p> Type Description <code>Context</code> <p>updated context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def merge(self, context: Context, recursive: bool = False) -&gt; Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.process_value","title":"process_value","text":"<pre><code>process_value(value: Any) -&gt; Any\n</code></pre> <p>Processes the given value, converting dictionaries to Context objects as needed.</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def process_value(self, value: Any) -&gt; Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Returns all parameters of the context as a dict</p> <p>Returns:</p> Type Description <code>dict</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a json string</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_json(self, pretty: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Context.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a yaml string</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n</code></pre>"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin","title":"koheesio.ExtraParamsMixin","text":"<p>Mixin class that adds support for arbitrary keyword arguments to Pydantic models.</p> <p>The keyword arguments are extracted from the model's <code>values</code> and moved to a <code>params</code> dictionary.</p>"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.extra_params","title":"extra_params  <code>cached</code> <code>property</code>","text":"<pre><code>extra_params: Dict[str, Any]\n</code></pre> <p>Extract params (passed as arbitrary kwargs) from values and move them to params dict</p>"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = Field(default_factory=dict)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory","title":"koheesio.LoggingFactory","text":"<pre><code>LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n</code></pre> <p>Logging factory to be used to generate logger instances.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>Optional[str]</code> <code>None</code> <code>env</code> <code>Optional[str]</code> <code>None</code> <code>logger_id</code> <code>Optional[str]</code> <code>None</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CONSOLE_HANDLER: Optional[Handler] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.ENV","title":"ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ENV: Optional[str] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER","title":"LOGGER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER: Optional[Logger] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_ENV: str = 'local'\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FILTER: Optional[Filter] = None\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_NAME: str = 'koheesio'\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.add_handlers","title":"add_handlers  <code>staticmethod</code>","text":"<pre><code>add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None\n</code></pre> <p>Add handlers to existing root logger.</p> <p>Parameters:</p> Name Type Description Default <code>handler_class</code> required <code>handlers_config</code> required Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.LoggingFactory.get_logger","title":"get_logger  <code>staticmethod</code>","text":"<pre><code>get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger\n</code></pre> <p>Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> required <code>inherit_from_koheesio</code> <code>bool</code> <code>False</code> <p>Returns:</p> Name Type Description <code>logger</code> <code>Logger</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step","title":"koheesio.Step","text":"<p>Base class for a step</p> <p>A custom unit of logic that can be executed.</p> <p>The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the <code>def execute(self)</code> method, specifying the expected inputs and outputs.</p> <p>Note: since the Step class is meta classed, the execute method is wrapped with the <code>do_execute</code> function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.</p> Methods and Attributes <p>The Step class has several attributes and methods.</p> Background <p>A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.</p> <p>A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!</p> <p>The diagram serves to illustrate the concept of a Step:</p> <pre><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre> <p>Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.</p> <ul> <li>Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to   automatically validate data against the defined fields and their types.</li> <li>Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the <code>execute</code> method of the Step   class with the <code>_execute_wrapper</code> function. This ensures that the <code>execute</code> method always returns the output of   the Step along with providing logging and validation of the output.</li> <li>Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output   of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be   accessed through the <code>Step.Output</code> attribute.</li> <li>The <code>Output</code> class can be extended to add additional fields to the output of the Step.</li> </ul> <p>Examples:</p> <pre><code>class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -&gt; MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step--input","title":"INPUT","text":"<p>The following fields are available by default on the Step class: - <code>name</code>: Name of the Step. If not set, the name of the class will be used. - <code>description</code>: Description of the Step. If not set, the docstring of the class will be used. If the docstring     contains multiple lines, only the first line will be used.</p> <p>When subclassing a Step, any additional pydantic field will be treated as <code>input</code> to the Step. See also the explanation on the <code>.execute()</code> method below.</p>"},{"location":"api_reference/index.html#koheesio.Step--output","title":"OUTPUT","text":"<p>Every Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be accessed through the <code>Step.Output</code> attribute. The <code>Output</code> class can be extended to add additional fields to the output of the Step. See also the explanation on the <code>.execute()</code>.</p> <ul> <li><code>Output</code>: A nested class representing the output of the Step used to validate the output of the     Step and based on the StepOutput class.</li> <li><code>output</code>: Allows you to interact with the Output of the Step lazily (see above and StepOutput)</li> </ul> <p>When subclassing a Step, any additional pydantic field added to the nested <code>Output</code> class will be treated as <code>output</code> of the Step. See also the description of <code>StepOutput</code> for more information.</p>"},{"location":"api_reference/index.html#koheesio.Step--methods","title":"Methods:","text":"<ul> <li><code>execute</code>: Abstract method to implement for new steps.<ul> <li>The Inputs of the step can be accessed, using <code>self.input_name</code>.</li> <li>The output of the step can be accessed, using <code>self.output.output_name</code>.</li> </ul> </li> <li><code>run</code>: Alias to .execute() method. You can use this to run the step, but execute is preferred.</li> <li><code>to_yaml</code>: YAML dump the step</li> <li><code>get_description</code>: Get the description of the Step</li> </ul> <p>When subclassing a Step, <code>execute</code> is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.</p> <p>Note: since the Step class is meta-classed, the execute method is automatically wrapped with the <code>do_execute</code> function making it always return a StepOutput. See also the explanation on the <code>do_execute</code> function.</p>"},{"location":"api_reference/index.html#koheesio.Step--class-methods","title":"class methods:","text":"<ul> <li><code>from_step</code>: Returns a new Step instance based on the data of another Step instance.     for example: <code>MyStep.from_step(other_step, a=\"foo\")</code></li> <li><code>get_description</code>: Get the description of the Step</li> </ul>"},{"location":"api_reference/index.html#koheesio.Step--dunder-methods","title":"dunder methods:","text":"<ul> <li><code>__getattr__</code>: Allows input to be accessed through <code>self.input_name</code></li> <li><code>__repr__</code> and <code>__str__</code>: String representation of a step</li> </ul>"},{"location":"api_reference/index.html#koheesio.Step.output","title":"output  <code>property</code> <code>writable</code>","text":"<pre><code>output: Output\n</code></pre> <p>Interact with the output of the Step</p>"},{"location":"api_reference/index.html#koheesio.Step.Output","title":"Output","text":"<p>Output class for Step</p>"},{"location":"api_reference/index.html#koheesio.Step.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Abstract method to implement for new steps.</p> <p>The Inputs of the step can be accessed, using <code>self.input_name</code></p> <p>Note: since the Step class is meta-classed, the execute method is wrapped with the <code>do_execute</code> function making   it always return the Steps output</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.from_step","title":"from_step  <code>classmethod</code>","text":"<pre><code>from_step(step: Step, **kwargs)\n</code></pre> <p>Returns a new Step instance based on the data of another Step or BaseModel instance</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.repr_json","title":"repr_json","text":"<pre><code>repr_json(simple=False) -&gt; str\n</code></pre> <p>dump the step to json, meant for representation</p> <p>Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid json</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_json(self, simple=False) -&gt; str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -&gt; Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.repr_yaml","title":"repr_yaml","text":"<pre><code>repr_yaml(simple=False) -&gt; str\n</code></pre> <p>dump the step to yaml, meant for representation</p> <p>Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_yaml())\ninput:\n  a: foo\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid yaml</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_yaml(self, simple=False) -&gt; str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.Step.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>Alias to .execute()</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/index.html#koheesio.StepOutput","title":"koheesio.StepOutput","text":"<p>Class for the StepOutput model</p> Usage <p>Setting up the StepOutputs class is done like this: <pre><code>class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n</code></pre></p>"},{"location":"api_reference/index.html#koheesio.StepOutput.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(validate_default=False, defer_build=True)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.StepOutput.validate_output","title":"validate_output","text":"<pre><code>validate_output() -&gt; StepOutput\n</code></pre> <p>Validate the output of the Step</p> <p>Essentially, this method is a wrapper around the validate method of the BaseModel class</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def validate_output(self) -&gt; StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n</code></pre>"},{"location":"api_reference/index.html#koheesio.print_logo","title":"koheesio.print_logo","text":"<pre><code>print_logo()\n</code></pre> Source code in <code>src/koheesio/__init__.py</code> <pre><code>def print_logo():\n    global _logo_printed\n    global _koheesio_print_logo\n\n    if not _logo_printed and _koheesio_print_logo:\n        print(ABOUT)\n        _logo_printed = True\n</code></pre>"},{"location":"api_reference/context.html","title":"Context","text":"<p>The Context module is a part of the Koheesio framework and is primarily used for managing the environment configuration where a Task or Step runs. It helps in adapting the behavior of a Task/Step based on the environment it operates in, thereby avoiding the repetition of configuration values across different tasks.</p> <p>The Context class, which is a key component of this module, functions similarly to a dictionary but with additional features. It supports operations like handling nested keys, recursive merging of contexts, and serialization/deserialization to and from various formats like JSON, YAML, and TOML.</p> <p>For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.</p>"},{"location":"api_reference/context.html#koheesio.context.Context","title":"koheesio.context.Context","text":"<pre><code>Context(*args, **kwargs)\n</code></pre> <p>The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.</p> Key Features <ul> <li>Nested keys: Supports accessing and adding nested keys similar to dictionary keys.</li> <li>Recursive merging: Merges two Contexts together, with the incoming Context having priority.</li> <li>Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be     converted back to a dictionary.</li> <li>Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects     to and from JSON.</li> </ul> <p>For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.</p> <p>Methods:</p> Name Description <code>add</code> <p>Add a key/value pair to the context.</p> <code>get</code> <p>Get value of a given key.</p> <code>get_item</code> <p>Acts just like <code>.get</code>, except that it returns the key also.</p> <code>contains</code> <p>Check if the context contains a given key.</p> <code>merge</code> <p>Merge this context with the context of another, where the incoming context has priority.</p> <code>to_dict</code> <p>Returns all parameters of the context as a dict.</p> <code>from_dict</code> <p>Creates Context object from the given dict.</p> <code>from_yaml</code> <p>Creates Context object from a given yaml file.</p> <code>from_json</code> <p>Creates Context object from a given json file.</p> Dunder methods <ul> <li>_<code>_iter__()</code>: Allows for iteration across a Context.</li> <li><code>__len__()</code>: Returns the length of the Context.</li> <li><code>__getitem__(item)</code>: Makes class subscriptable.</li> </ul> Inherited from Mapping <ul> <li><code>items()</code>: Returns all items of the Context.</li> <li><code>keys()</code>: Returns all keys of the Context.</li> <li><code>values()</code>: Returns all values of the Context.</li> </ul> Source code in <code>src/koheesio/context.py</code> <pre><code>def __init__(self, *args, **kwargs):\n    \"\"\"Initializes the Context object with given arguments.\"\"\"\n    for arg in args:\n        if isinstance(arg, dict):\n            kwargs.update(arg)\n        if isinstance(arg, Context):\n            kwargs = kwargs.update(arg.to_dict())\n\n    for key, value in kwargs.items():\n        self.__dict__[key] = self.process_value(value)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.add","title":"add","text":"<pre><code>add(key: str, value: Any) -&gt; Context\n</code></pre> <p>Add a key/value pair to the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def add(self, key: str, value: Any) -&gt; Context:\n    \"\"\"Add a key/value pair to the context\"\"\"\n    self.__dict__[key] = value\n    return self\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.contains","title":"contains","text":"<pre><code>contains(key: str) -&gt; bool\n</code></pre> <p>Check if the context contains a given key</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/context.py</code> <pre><code>def contains(self, key: str) -&gt; bool:\n    \"\"\"Check if the context contains a given key\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    try:\n        self.get(key, safe=False)\n        return True\n    except KeyError:\n        return False\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(kwargs: dict) -&gt; Context\n</code></pre> <p>Creates Context object from the given dict</p> <p>Parameters:</p> Name Type Description Default <code>kwargs</code> <code>dict</code> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_dict(cls, kwargs: dict) -&gt; Context:\n    \"\"\"Creates Context object from the given dict\n\n    Parameters\n    ----------\n    kwargs: dict\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return cls(kwargs)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given json file</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> Security <p>(from https://jsonpickle.github.io/)</p> <p>jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given json file\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Security\n    --------\n    (from https://jsonpickle.github.io/)\n\n    &gt; jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n    ### ! Warning !\n    &gt; The jsonpickle module is not secure. Only unpickle data you trust.\n    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n    Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n    Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n    Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n    untrusted data.\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    json_str = json_file_or_str\n\n    # check if json_str is pathlike\n    if (json_file := Path(json_file_or_str)).exists():\n        json_str = json_file.read_text(encoding=\"utf-8\")\n\n    json_dict = jsonpickle.loads(json_str)\n    return cls.from_dict(json_dict)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_json--warning","title":"! Warning !","text":"<p>The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.</p>"},{"location":"api_reference/context.html#koheesio.context.Context.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; Context\n</code></pre> <p>Creates Context object from a given toml file</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file or string containing toml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; Context:\n    \"\"\"Creates Context object from a given toml file\n\n    Parameters\n    ----------\n    toml_file_or_str: Union[str, Path]\n        Pathlike string or Path that points to the toml file or string containing toml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    toml_str = toml_file_or_str\n\n    # check if toml_str is pathlike\n    if (toml_file := Path(toml_file_or_str)).exists():\n        toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n    toml_dict = tomli.loads(toml_str)\n    return cls.from_dict(toml_dict)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; Context\n</code></pre> <p>Creates Context object from a given yaml file</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/context.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; Context:\n    \"\"\"Creates Context object from a given yaml file\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    yaml_str = yaml_file_or_str\n\n    # check if yaml_str is pathlike\n    if (yaml_file := Path(yaml_file_or_str)).exists():\n        yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n    # Bandit: disable yaml.load warning\n    yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader)  # nosec B506: yaml_load\n\n    return cls.from_dict(yaml_dict)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.get","title":"get","text":"<pre><code>get(key: str, default: Any = None, safe: bool = True) -&gt; Any\n</code></pre> <p>Get value of a given key</p> <p>The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's <code>.get()</code> method otherwise.</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>Can be a real key, or can be a dotted notation of a nested key</p> required <code>default</code> <code>Any</code> <p>Default value to return</p> <code>None</code> <code>safe</code> <code>bool</code> <p>Toggles whether to fail or not when item cannot be found</p> <code>True</code> <p>Returns:</p> Type Description <code>Any</code> <p>Value of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n</code></pre> <p>Returns <code>c</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get(self, key: str, default: Any = None, safe: bool = True) -&gt; Any:\n    \"\"\"Get value of a given key\n\n    The key can either be an actual key (top level) or the key of a nested value.\n    Behaves a lot like a dict's `.get()` method otherwise.\n\n    Parameters\n    ----------\n    key:\n        Can be a real key, or can be a dotted notation of a nested key\n    default:\n        Default value to return\n    safe:\n        Toggles whether to fail or not when item cannot be found\n\n    Returns\n    -------\n    Any\n        Value of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get(\"a.b\")\n    ```\n\n    Returns `c`\n    \"\"\"\n    try:\n        if \".\" not in key:\n            return self.__dict__[key]\n\n        # handle nested keys\n        nested_keys = key.split(\".\")\n        value = self  # parent object\n        for k in nested_keys:\n            value = value[k]  # iterate through nested values\n        return value\n\n    except (AttributeError, KeyError, TypeError) as e:\n        if not safe:\n            raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n        return default\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.get_all","title":"get_all","text":"<pre><code>get_all() -&gt; dict\n</code></pre> <p>alias to to_dict()</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_all(self) -&gt; dict:\n    \"\"\"alias to to_dict()\"\"\"\n    return self.to_dict()\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.get_item","title":"get_item","text":"<pre><code>get_item(key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]\n</code></pre> <p>Acts just like <code>.get</code>, except that it returns the key also</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>key/value-pair of the requested item</p> Example <p>Example of a nested call:</p> <pre><code>context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n</code></pre> <p>Returns <code>{'a.b': 'c'}</code></p> Source code in <code>src/koheesio/context.py</code> <pre><code>def get_item(self, key: str, default: Any = None, safe: bool = True) -&gt; Dict[str, Any]:\n    \"\"\"Acts just like `.get`, except that it returns the key also\n\n    Returns\n    -------\n    Dict[str, Any]\n        key/value-pair of the requested item\n\n    Example\n    -------\n    Example of a nested call:\n\n    ```python\n    context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n    context.get_item(\"a.b\")\n    ```\n\n    Returns `{'a.b': 'c'}`\n    \"\"\"\n    value = self.get(key, default, safe)\n    return {key: value}\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.merge","title":"merge","text":"<pre><code>merge(context: Context, recursive: bool = False) -&gt; Context\n</code></pre> <p>Merge this context with the context of another, where the incoming context has priority.</p> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> <p>Another Context class</p> required <code>recursive</code> <code>bool</code> <p>Recursively merge two dictionaries to an arbitrary depth</p> <code>False</code> <p>Returns:</p> Type Description <code>Context</code> <p>updated context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def merge(self, context: Context, recursive: bool = False) -&gt; Context:\n    \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n    Parameters\n    ----------\n    context: Context\n        Another Context class\n    recursive: bool\n        Recursively merge two dictionaries to an arbitrary depth\n\n    Returns\n    -------\n    Context\n        updated context\n    \"\"\"\n    if recursive:\n        return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n    # just merge on the top level keys\n    return Context.from_dict({**self.to_dict(), **context.to_dict()})\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.process_value","title":"process_value","text":"<pre><code>process_value(value: Any) -&gt; Any\n</code></pre> <p>Processes the given value, converting dictionaries to Context objects as needed.</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def process_value(self, value: Any) -&gt; Any:\n    \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n    if isinstance(value, dict):\n        return self.from_dict(value)\n\n    if isinstance(value, (list, set)):\n        return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n    return value\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Returns all parameters of the context as a dict</p> <p>Returns:</p> Type Description <code>dict</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns all parameters of the context as a dict\n\n    Returns\n    -------\n    dict\n        containing all parameters of the context\n    \"\"\"\n    result = {}\n\n    for key, value in self.__dict__.items():\n        if isinstance(value, Context):\n            result[key] = value.to_dict()\n        elif isinstance(value, list):\n            result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n        else:\n            result[key] = value\n\n    return result\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a json string</p> <p>Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.</p> Why jsonpickle? <p>(from https://jsonpickle.github.io/)</p> <p>Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_json(self, pretty: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a json string\n\n    Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n    stored in the Context object, which is not possible with the standard json library.\n\n    Why jsonpickle?\n    ---------------\n    (from https://jsonpickle.github.io/)\n\n    &gt; Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n    json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n    data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n    json.\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    d = self.to_dict()\n    return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n</code></pre>"},{"location":"api_reference/context.html#koheesio.context.Context.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Returns all parameters of the context as a yaml string</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the context</p> Source code in <code>src/koheesio/context.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Returns all parameters of the context as a yaml string\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the context\n    \"\"\"\n    # sort_keys=False to preserve order of keys\n    yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n    # remove `!!python/object:...` from yaml\n    if clean:\n        remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n        yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n    return yaml_str\n</code></pre>"},{"location":"api_reference/intro_api.html","title":"Intro api","text":""},{"location":"api_reference/intro_api.html#api-reference","title":"API Reference","text":"<p>You can navigate the API by clicking on the modules listed on the left to access the documentation. </p>"},{"location":"api_reference/logger.html","title":"Logger","text":"<p>Loggers are used to log messages from your application.</p> <p>For a comprehensive guide on the usage, examples, and additional features of the logging classes, please refer to the reference/concepts/logging section of the Koheesio documentation.</p> <p>Classes:</p> Name Description <code>LoggingFactory</code> <p>Logging factory to be used to generate logger instances.</p> <code>Masked</code> <p>Represents a masked value.</p> <code>MaskedString</code> <p>Represents a masked string value.</p> <code>MaskedInt</code> <p>Represents a masked integer value.</p> <code>MaskedFloat</code> <p>Represents a masked float value.</p> <code>MaskedDict</code> <p>Represents a masked dictionary value.</p> <code>LoggerIDFilter</code> <p>Filter which injects run_id information into the log.</p> <p>Functions:</p> Name Description <code>warn</code> <p>Issue a warning.</p>"},{"location":"api_reference/logger.html#koheesio.logger.T","title":"koheesio.logger.T  <code>module-attribute</code>","text":"<pre><code>T = TypeVar('T')\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter","title":"koheesio.logger.LoggerIDFilter","text":"<p>Filter which injects run_id information into the log.</p>"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.LOGGER_ID","title":"LOGGER_ID  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_ID: str = str(uuid4())\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.filter","title":"filter","text":"<pre><code>filter(record)\n</code></pre> Source code in <code>src/koheesio/logger.py</code> <pre><code>def filter(self, record):\n    record.logger_id = LoggerIDFilter.LOGGER_ID\n\n    return True\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory","title":"koheesio.logger.LoggingFactory","text":"<pre><code>LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n</code></pre> <p>Logging factory to be used to generate logger instances.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>Optional[str]</code> <code>None</code> <code>env</code> <code>Optional[str]</code> <code>None</code> <code>logger_id</code> <code>Optional[str]</code> <code>None</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(\n    self,\n    name: Optional[str] = None,\n    env: Optional[str] = None,\n    level: Optional[str] = None,\n    logger_id: Optional[str] = None,\n):\n    \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n    Parameters\n    ----------\n    name logger name.\n    env environment (\"local\", \"qa\", \"prod).\n    logger_id unique identifier for the logger.\n    \"\"\"\n\n    LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n    LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n    LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n    LoggingFactory.ENV = env or LoggingFactory.ENV\n\n    console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n    console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n    console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n    # WARNING is default level for root logger in python\n    logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n    LoggingFactory.CONSOLE_HANDLER = console_handler\n\n    logger = getLogger(LoggingFactory.LOGGER_NAME)\n    logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n    LoggingFactory.LOGGER = logger\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CONSOLE_HANDLER: Optional[Handler] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.ENV","title":"ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ENV: Optional[str] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER","title":"LOGGER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER: Optional[Logger] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_ENV: str = 'local'\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FILTER: Optional[Filter] = None\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LOGGER_NAME: str = 'koheesio'\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.add_handlers","title":"add_handlers  <code>staticmethod</code>","text":"<pre><code>add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None\n</code></pre> <p>Add handlers to existing root logger.</p> <p>Parameters:</p> Name Type Description Default <code>handler_class</code> required <code>handlers_config</code> required Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -&gt; None:\n    \"\"\"Add handlers to existing root logger.\n\n    Parameters\n    ----------\n    handler_class handler module and class for importing.\n    handlers_config configuration for handler.\n\n    \"\"\"\n    for handler_module_class, handler_conf in handlers:\n        handler_class: logging.Handler = import_class(handler_module_class)\n        handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n        # noinspection PyCallingNonCallable\n        handler = handler_class(**handler_conf)\n        handler.setLevel(handler_level)\n        handler.addFilter(LoggingFactory.LOGGER_FILTER)\n        handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n        LoggingFactory.LOGGER.addHandler(handler)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.get_logger","title":"get_logger  <code>staticmethod</code>","text":"<pre><code>get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger\n</code></pre> <p>Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> required <code>inherit_from_koheesio</code> <code>bool</code> <code>False</code> <p>Returns:</p> Name Type Description <code>logger</code> <code>Logger</code> Source code in <code>src/koheesio/logger.py</code> <pre><code>@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -&gt; Logger:\n    \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n    Parameters\n    ----------\n    name: Name of logger.\n    inherit_from_koheesio: Inherit logger from koheesio\n\n    Returns\n    -------\n    logger: Logger\n\n    \"\"\"\n    if inherit_from_koheesio:\n        LoggingFactory.__check_koheesio_logger_initialized()\n        name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n    return getLogger(name)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.Masked","title":"koheesio.logger.Masked","text":"<pre><code>Masked(value: T)\n</code></pre> <p>Represents a masked value.</p> <p>Parameters:</p> Name Type Description Default <code>value</code> <code>T</code> <p>The value to be masked.</p> required <p>Attributes:</p> Name Type Description <code>_value</code> <code>T</code> <p>The original value.</p> <p>Methods:</p> Name Description <code>__repr__</code> <p>Returns a string representation of the masked value.</p> <code>__str__</code> <p>Returns a string representation of the masked value.</p> <code>__get_validators__</code> <p>Returns a generator of validators for the masked value.</p> <code>validate</code> <p>Validates the masked value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.Masked.validate","title":"validate  <code>classmethod</code>","text":"<pre><code>validate(v: Any, _values)\n</code></pre> <p>Validate the input value and return an instance of the class.</p> <p>Parameters:</p> Name Type Description Default <code>v</code> <code>Any</code> <p>The input value to validate.</p> required <code>_values</code> <code>Any</code> <p>Additional values used for validation.</p> required <p>Returns:</p> Name Type Description <code>instance</code> <code>cls</code> <p>An instance of the class.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>@classmethod\ndef validate(cls, v: Any, _values):\n    \"\"\"\n    Validate the input value and return an instance of the class.\n\n    Parameters\n    ----------\n    v : Any\n        The input value to validate.\n    _values : Any\n        Additional values used for validation.\n\n    Returns\n    -------\n    instance : cls\n        An instance of the class.\n\n    \"\"\"\n    return cls(v)\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedDict","title":"koheesio.logger.MaskedDict","text":"<pre><code>MaskedDict(value: T)\n</code></pre> <p>Represents a masked dictionary value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedFloat","title":"koheesio.logger.MaskedFloat","text":"<pre><code>MaskedFloat(value: T)\n</code></pre> <p>Represents a masked float value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedInt","title":"koheesio.logger.MaskedInt","text":"<pre><code>MaskedInt(value: T)\n</code></pre> <p>Represents a masked integer value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/logger.html#koheesio.logger.MaskedString","title":"koheesio.logger.MaskedString","text":"<pre><code>MaskedString(value: T)\n</code></pre> <p>Represents a masked string value.</p> Source code in <code>src/koheesio/logger.py</code> <pre><code>def __init__(self, value: T):\n    self._value = value\n</code></pre>"},{"location":"api_reference/utils.html","title":"Utils","text":"<p>Utility functions</p>"},{"location":"api_reference/utils.html#koheesio.utils.convert_str_to_bool","title":"koheesio.utils.convert_str_to_bool","text":"<pre><code>convert_str_to_bool(value) -&gt; Any\n</code></pre> <p>Converts a string to a boolean if the string is either 'true' or 'false'</p> Source code in <code>src/koheesio/utils.py</code> <pre><code>def convert_str_to_bool(value) -&gt; Any:\n    \"\"\"Converts a string to a boolean if the string is either 'true' or 'false'\"\"\"\n    if isinstance(value, str) and (v := value.lower()) in [\"true\", \"false\"]:\n        value = v == \"true\"\n    return value\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.get_args_for_func","title":"koheesio.utils.get_args_for_func","text":"<pre><code>get_args_for_func(func: Callable, params: Dict) -&gt; Tuple[Callable, Dict[str, Any]]\n</code></pre> <p>Helper function that matches keyword arguments (params) on a given function</p> <p>This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to  construct a new Callable (partial) function on which the input was mapped.</p> Example <pre><code>input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\ndef example_func(a: str):\n    return a\n\n\nfunc, kwargs = get_args_for_func(example_func, input_dict)\n</code></pre> <p>In this example, - <code>func</code> would be a callable with the input mapped toward it (i.e. can be called like any normal function) - <code>kwargs</code> would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})</p> <p>Parameters:</p> Name Type Description Default <code>func</code> <code>Callable</code> <p>The function to inspect</p> required <code>params</code> <code>Dict</code> <p>Dictionary with keyword values that will be mapped on the 'func'</p> required <p>Returns:</p> Type Description <code>Tuple[Callable, Dict[str, Any]]</code> <ul> <li>Callable     a partial() func with the found keyword values mapped toward it</li> <li>Dict[str, Any]     the keyword args that match the func</li> </ul> Source code in <code>src/koheesio/utils.py</code> <pre><code>def get_args_for_func(func: Callable, params: Dict) -&gt; Tuple[Callable, Dict[str, Any]]:\n    \"\"\"Helper function that matches keyword arguments (params) on a given function\n\n    This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to\n     construct a new Callable (partial) function on which the input was mapped.\n\n    Example\n    -------\n    ```python\n    input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\n    def example_func(a: str):\n        return a\n\n\n    func, kwargs = get_args_for_func(example_func, input_dict)\n    ```\n\n    In this example,\n    - `func` would be a callable with the input mapped toward it (i.e. can be called like any normal function)\n    - `kwargs` would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})\n\n    Parameters\n    ----------\n    func: Callable\n        The function to inspect\n    params: Dict\n        Dictionary with keyword values that will be mapped on the 'func'\n\n    Returns\n    -------\n    Tuple[Callable, Dict[str, Any]]\n        - Callable\n            a partial() func with the found keyword values mapped toward it\n        - Dict[str, Any]\n            the keyword args that match the func\n    \"\"\"\n    _kwargs = {k: v for k, v in params.items() if k in inspect.getfullargspec(func).args}\n    return (\n        partial(func, **_kwargs),\n        _kwargs,\n    )\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.get_project_root","title":"koheesio.utils.get_project_root","text":"<pre><code>get_project_root() -&gt; Path\n</code></pre> <p>Returns project root path.</p> Source code in <code>src/koheesio/utils.py</code> <pre><code>def get_project_root() -&gt; Path:\n    \"\"\"Returns project root path.\"\"\"\n    cmd = Path(__file__)\n    return Path([i for i in cmd.parents if i.as_uri().endswith(\"src\")][0]).parent\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.get_random_string","title":"koheesio.utils.get_random_string","text":"<pre><code>get_random_string(length: int = 64, prefix: Optional[str] = None) -&gt; str\n</code></pre> <p>Generate a random string of specified length</p> Source code in <code>src/koheesio/utils.py</code> <pre><code>def get_random_string(length: int = 64, prefix: Optional[str] = None) -&gt; str:\n    \"\"\"Generate a random string of specified length\"\"\"\n    if prefix:\n        return f\"{prefix}_{uuid.uuid4().hex}\"[0:length]\n    return f\"{uuid.uuid4().hex}\"[0:length]\n</code></pre>"},{"location":"api_reference/utils.html#koheesio.utils.import_class","title":"koheesio.utils.import_class","text":"<pre><code>import_class(module_class: str) -&gt; Any\n</code></pre> <p>Import class and module based on provided string.</p> <p>Parameters:</p> Name Type Description Default <code>module_class</code> <code>str</code> required <p>Returns:</p> Type Description <code>object  Class from specified input string.</code> Source code in <code>src/koheesio/utils.py</code> <pre><code>def import_class(module_class: str) -&gt; Any:\n    \"\"\"Import class and module based on provided string.\n\n    Parameters\n    ----------\n    module_class module+class to be imported.\n\n    Returns\n    -------\n    object  Class from specified input string.\n\n    \"\"\"\n    module_path, class_name = module_class.rsplit(\".\", 1)\n    module = import_module(module_path)\n\n    return getattr(module, class_name)\n</code></pre>"},{"location":"api_reference/asyncio/index.html","title":"Asyncio","text":"<p>This module provides classes for asynchronous steps in the koheesio package.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep","title":"koheesio.asyncio.AsyncStep","text":"<p>Asynchronous step class that inherits from Step and uses the AsyncStepMetaClass metaclass.</p> <p>Attributes:</p> Name Type Description <code>Output</code> <code>AsyncStepOutput</code> <p>The output class for the asynchronous step.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep.Output","title":"Output","text":"<p>Output class for asyncio step.</p> <p>This class represents the output of the asyncio step. It inherits from the AsyncStepOutput class.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepMetaClass","title":"koheesio.asyncio.AsyncStepMetaClass","text":"<p>Metaclass for asynchronous steps.</p> <p>This metaclass is used to define asynchronous steps in the Koheesio framework. It inherits from the StepMetaClass and provides additional functionality for executing asynchronous steps.</p> <p>Attributes:     None</p> <p>Methods:     _execute_wrapper: Wrapper method for executing asynchronous steps.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput","title":"koheesio.asyncio.AsyncStepOutput","text":"<p>Represents the output of an asynchronous step.</p> <p>This class extends the base <code>Step.Output</code> class and provides additional functionality for merging key-value maps.</p> <p>Attributes:</p> Name Type Description <code>...</code> <p>Methods:</p> Name Description <code>merge</code> <p>Merge key-value map with self.</p>"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput.merge","title":"merge","text":"<pre><code>merge(other: Union[Dict, StepOutput])\n</code></pre> <p>Merge key,value map with self</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n</code></pre> <p>Functionally similar to adding two dicts together; like running <code>{**dict_a, **dict_b}</code>.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[Dict, StepOutput]</code> <p>Dict or another instance of a StepOutputs class that will be added to self</p> required Source code in <code>src/koheesio/asyncio/__init__.py</code> <pre><code>def merge(self, other: Union[Dict, StepOutput]):\n    \"\"\"Merge key,value map with self\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Parameters\n    ----------\n    other: Union[Dict, StepOutput]\n        Dict or another instance of a StepOutputs class that will be added to self\n    \"\"\"\n    if isinstance(other, StepOutput):\n        other = other.model_dump()  # ensures we really have a dict\n\n    if not iscoroutine(other):\n        for k, v in other.items():\n            self.set(k, v)\n\n    return self\n</code></pre>"},{"location":"api_reference/asyncio/http.html","title":"Http","text":"<p>This module contains async implementation of HTTP step.</p>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep","title":"koheesio.asyncio.http.AsyncHttpGetStep","text":"<p>Represents an asynchronous HTTP GET step.</p> <p>This class inherits from the AsyncHttpStep class and specifies the HTTP method as GET.</p> <p>Attributes:     method (HttpMethod): The HTTP method for the step, set to HttpMethod.GET.</p>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = GET\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep","title":"koheesio.asyncio.http.AsyncHttpStep","text":"<p>Asynchronous HTTP step for making HTTP requests using aiohttp.</p> <p>Parameters:</p> Name Type Description Default <code>client_session</code> <code>Optional[ClientSession]</code> <p>Aiohttp ClientSession.</p> required <code>url</code> <code>List[URL]</code> <p>List of yarl.URL.</p> required <code>retry_options</code> <code>Optional[RetryOptionsBase]</code> <p>Retry options for the request.</p> required <code>connector</code> <code>Optional[BaseConnector]</code> <p>Connector for the aiohttp request.</p> required <code>headers</code> <code>Optional[Dict[str, Union[str, SecretStr]]]</code> <p>Request headers.</p> required Output <p>responses_urls : Optional[List[Tuple[Dict[str, Any], yarl.URL]]]     List of responses from the API and request URL.</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; import asyncio\n&gt;&gt;&gt; from aiohttp import ClientSession\n&gt;&gt;&gt; from aiohttp.connector import TCPConnector\n&gt;&gt;&gt; from aiohttp_retry import ExponentialRetry\n&gt;&gt;&gt; from koheesio.steps.async.http import AsyncHttpStep\n&gt;&gt;&gt; from yarl import URL\n&gt;&gt;&gt; from typing import Dict, Any, Union, List, Tuple\n&gt;&gt;&gt;\n&gt;&gt;&gt; # Initialize the AsyncHttpStep\n&gt;&gt;&gt; async def main():\n&gt;&gt;&gt;     session = ClientSession()\n&gt;&gt;&gt;     urls = [URL('https://example.com/api/1'), URL('https://example.com/api/2')]\n&gt;&gt;&gt;     retry_options = ExponentialRetry()\n&gt;&gt;&gt;     connector = TCPConnector(limit=10)\n&gt;&gt;&gt;     headers = {'Content-Type': 'application/json'}\n&gt;&gt;&gt;     step = AsyncHttpStep(\n&gt;&gt;&gt;         client_session=session,\n&gt;&gt;&gt;         url=urls,\n&gt;&gt;&gt;         retry_options=retry_options,\n&gt;&gt;&gt;         connector=connector,\n&gt;&gt;&gt;         headers=headers\n&gt;&gt;&gt;     )\n&gt;&gt;&gt;\n&gt;&gt;&gt;     # Execute the step\n&gt;&gt;&gt;     responses_urls=  await step.get()\n&gt;&gt;&gt;\n&gt;&gt;&gt;     return responses_urls\n&gt;&gt;&gt;\n&gt;&gt;&gt; # Run the main function\n&gt;&gt;&gt; responses_urls = asyncio.run(main())\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.client_session","title":"client_session  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_session: Optional[ClientSession] = Field(default=None, description='Aiohttp ClientSession', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.connector","title":"connector  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>connector: Optional[BaseConnector] = Field(default=None, description='Connector for the aiohttp request', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.headers","title":"headers  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Dict[str, Union[str, SecretStr]] = Field(default_factory=dict, description='Request headers', alias='header', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.retry_options","title":"retry_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>retry_options: Optional[RetryOptionsBase] = Field(default=None, description='Retry options for the request', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.timeout","title":"timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timeout: None = Field(default=None, description='[Optional] Request timeout')\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: List[URL] = Field(default=None, alias='urls', description='Expecting list, as there is no value in executing async request for one value.\\n        yarl.URL is preferable, because params/data can be injected into URL instance', exclude=True)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output","title":"Output","text":"<p>Output class for Step</p>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output.responses_urls","title":"responses_urls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>responses_urls: Optional[List[Tuple[Dict[str, Any], URL]]] = Field(default=None, description='List of responses from the API and request URL', repr=False)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.delete","title":"delete  <code>async</code>","text":"<pre><code>delete() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make DELETE requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def delete(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make DELETE requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.DELETE)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute the step.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the specified HTTP method is not implemented in AsyncHttpStep.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def execute(self) -&gt; AsyncHttpStep.Output:\n    \"\"\"\n    Execute the step.\n\n    Raises\n    ------\n    ValueError\n        If the specified HTTP method is not implemented in AsyncHttpStep.\n    \"\"\"\n    # By design asyncio does not allow its event loop to be nested. This presents a practical problem:\n    #   When in an environment where the event loop is already running\n    #   it\u2019s impossible to run tasks and wait for the result.\n    #   Trying to do so will give the error \u201cRuntimeError: This event loop is already running\u201d.\n    #   The issue pops up in various environments, such as web servers, GUI applications and in\n    #   Jupyter/DataBricks notebooks.\n    nest_asyncio.apply()\n\n    map_method_func = {\n        HttpMethod.GET: self.get,\n        HttpMethod.POST: self.post,\n        HttpMethod.PUT: self.put,\n        HttpMethod.DELETE: self.delete,\n    }\n\n    if self.method not in map_method_func:\n        raise ValueError(f\"Method {self.method} not implemented in AsyncHttpStep.\")\n\n    self.output.responses_urls = asyncio.run(map_method_func[self.method]())\n\n    return self.output\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get","title":"get  <code>async</code>","text":"<pre><code>get() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make GET requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def get(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make GET requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.GET)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_headers","title":"get_headers","text":"<pre><code>get_headers()\n</code></pre> <p>Get the request headers.</p> <p>Returns:</p> Type Description <code>Optional[Dict[str, Union[str, SecretStr]]]</code> <p>The request headers.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def get_headers(self):\n    \"\"\"\n    Get the request headers.\n\n    Returns\n    -------\n    Optional[Dict[str, Union[str, SecretStr]]]\n        The request headers.\n    \"\"\"\n    _headers = None\n\n    if self.headers:\n        _headers = {k: v.get_secret_value() if isinstance(v, SecretStr) else v for k, v in self.headers.items()}\n\n        for k, v in self.headers.items():\n            if isinstance(v, SecretStr):\n                self.headers[k] = v.get_secret_value()\n\n    return _headers or self.headers\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Get the options of the step.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def get_options(self):\n    \"\"\"\n    Get the options of the step.\n    \"\"\"\n    warnings.warn(\"get_options is not implemented in AsyncHttpStep.\")\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.post","title":"post  <code>async</code>","text":"<pre><code>post() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make POST requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def post(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make POST requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.POST)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.put","title":"put  <code>async</code>","text":"<pre><code>put() -&gt; List[Tuple[Dict[str, Any], URL]]\n</code></pre> <p>Make PUT requests.</p> <p>Returns:</p> Type Description <code>List[Tuple[Dict[str, Any], URL]]</code> <p>A list of response data and corresponding request URLs.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def put(self) -&gt; List[Tuple[Dict[str, Any], yarl.URL]]:\n    \"\"\"\n    Make PUT requests.\n\n    Returns\n    -------\n    List[Tuple[Dict[str, Any], yarl.URL]]\n        A list of response data and corresponding request URLs.\n    \"\"\"\n    tasks = self.__tasks_generator(method=HttpMethod.PUT)\n    responses_urls = await self._execute(tasks=tasks)\n\n    return responses_urls\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.request","title":"request  <code>async</code>","text":"<pre><code>request(method: HttpMethod, url: URL, **kwargs) -&gt; Tuple[Dict[str, Any], URL]\n</code></pre> <p>Make an HTTP request.</p> <p>Parameters:</p> Name Type Description Default <code>method</code> <code>HttpMethod</code> <p>The HTTP method to use for the request.</p> required <code>url</code> <code>URL</code> <p>The URL to make the request to.</p> required <code>kwargs</code> <code>Any</code> <p>Additional keyword arguments to pass to the request.</p> <code>{}</code> <p>Returns:</p> Type Description <code>Tuple[Dict[str, Any], URL]</code> <p>A tuple containing the response data and the request URL.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>async def request(\n    self,\n    method: HttpMethod,\n    url: yarl.URL,\n    **kwargs,\n) -&gt; Tuple[Dict[str, Any], yarl.URL]:\n    \"\"\"\n    Make an HTTP request.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        The HTTP method to use for the request.\n    url : yarl.URL\n        The URL to make the request to.\n    kwargs : Any\n        Additional keyword arguments to pass to the request.\n\n    Returns\n    -------\n    Tuple[Dict[str, Any], yarl.URL]\n        A tuple containing the response data and the request URL.\n    \"\"\"\n    async with self.__retry_client.request(method=method, url=url, **kwargs) as response:\n        res = await response.json()\n\n    return (res, response.request_info.url)\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.set_outputs","title":"set_outputs","text":"<pre><code>set_outputs(response)\n</code></pre> <p>Set the outputs of the step.</p> <p>Parameters:</p> Name Type Description Default <code>response</code> <code>Any</code> <p>The response data.</p> required Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>def set_outputs(self, response):\n    \"\"\"\n    Set the outputs of the step.\n\n    Parameters\n    ----------\n    response : Any\n        The response data.\n    \"\"\"\n    warnings.warn(\"set outputs is not implemented in AsyncHttpStep.\")\n</code></pre>"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.validate_timeout","title":"validate_timeout","text":"<pre><code>validate_timeout(timeout)\n</code></pre> <p>Validate the 'data' field.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Any</code> <p>The value of the 'timeout' field.</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If 'data' is not allowed in AsyncHttpStep.</p> Source code in <code>src/koheesio/asyncio/http.py</code> <pre><code>@field_validator(\"timeout\")\ndef validate_timeout(cls, timeout):\n    \"\"\"\n    Validate the 'data' field.\n\n    Parameters\n    ----------\n    data : Any\n        The value of the 'timeout' field.\n\n    Raises\n    ------\n    ValueError\n        If 'data' is not allowed in AsyncHttpStep.\n    \"\"\"\n    if timeout:\n        raise ValueError(\"timeout is not allowed in AsyncHttpStep. Provide timeout through retry_options.\")\n</code></pre>"},{"location":"api_reference/integrations/index.html","title":"Integrations","text":"<p>Nothing to see here, move along.</p>"},{"location":"api_reference/integrations/box.html","title":"Box","text":"<p>Box Module</p> <p>The module is used to facilitate various interactions with Box service. The implementation is based on the functionalities available in Box Python SDK: https://github.com/box/box-python-sdk</p> Prerequisites <ul> <li>Box Application is created in the developer portal using the JWT auth method (Developer Portal - My Apps - Create)</li> <li>Application is authorized for the enterprise (Developer Portal - MyApp - Authorization)</li> </ul>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box","title":"koheesio.integrations.box.Box","text":"<pre><code>Box(**data)\n</code></pre> <p>Configuration details required for the authentication can be obtained in the Box Developer Portal by generating the Public / Private key pair in \"Application Name -&gt; Configuration -&gt; Add and Manage Public Keys\".</p> <p>The downloaded JSON file will look like this: <pre><code>{\n  \"boxAppSettings\": {\n    \"clientID\": \"client_id\",\n    \"clientSecret\": \"client_secret\",\n    \"appAuth\": {\n      \"publicKeyID\": \"public_key_id\",\n      \"privateKey\": \"private_key\",\n      \"passphrase\": \"pass_phrase\"\n    }\n  },\n  \"enterpriseID\": \"123456\"\n}\n</code></pre> This class is used as a base for the rest of Box integrations, however it can also be used separately to obtain the Box client which is created at class initialization.</p> <p>Examples:</p> <pre><code>b = Box(\n    client_id=\"client_id\",\n    client_secret=\"client_secret\",\n    enterprise_id=\"enterprise_id\",\n    jwt_key_id=\"jwt_key_id\",\n    rsa_private_key_data=\"rsa_private_key_data\",\n    rsa_private_key_passphrase=\"rsa_private_key_passphrase\",\n)\nb.client\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.auth_options","title":"auth_options  <code>property</code>","text":"<pre><code>auth_options\n</code></pre> <p>Get a dictionary of authentication options, that can be handily used in the child classes</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client","title":"client  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client: SkipValidation[Client] = None\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_id","title":"client_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientID', description='Client ID from the Box Developer console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_secret","title":"client_secret  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_secret: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientSecret', description='Client Secret from the Box Developer console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.enterprise_id","title":"enterprise_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enterprise_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='enterpriseID', description='Enterprise ID from the Box Developer console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.jwt_key_id","title":"jwt_key_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>jwt_key_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='publicKeyID', description='PublicKeyID for the public/private generated key pair.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_data","title":"rsa_private_key_data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rsa_private_key_data: Union[SecretStr, SecretBytes] = Field(default=..., alias='privateKey', description='Private key generated in the app management console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_passphrase","title":"rsa_private_key_passphrase  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rsa_private_key_passphrase: Union[SecretStr, SecretBytes] = Field(default=..., alias='passphrase', description='Private key passphrase generated in the app management console.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    # Plug to be able to unit test ABC\n    pass\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.init_client","title":"init_client","text":"<pre><code>init_client()\n</code></pre> <p>Set up the Box client.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def init_client(self):\n    \"\"\"Set up the Box client.\"\"\"\n    if not self.client:\n        self.client = Client(JWTAuth(**self.auth_options))\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader","title":"koheesio.integrations.box.BoxCsvFileReader","text":"<pre><code>BoxCsvFileReader(**data)\n</code></pre> <p>Class facilitates reading one or multiple CSV files with the same structure directly from Box and producing Spark Dataframe.</p> Notes <p>To manually identify the ID of the file in Box, open the file through Web UI, and copy ID from the page URL, e.g. https://foo.ent.box.com/file/1234567890 , where 1234567890 is the ID.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxCsvFileReader\nfrom pyspark.sql.types import StructType\n\nschema = StructType(...)\nb = BoxCsvFileReader(\n    client_id=\"\",\n    client_secret=\"\",\n    enterprise_id=\"\",\n    jwt_key_id=\"\",\n    rsa_private_key_data=\"\",\n    rsa_private_key_passphrase=\"\",\n    file=[\"1\", \"2\"],\n    schema=schema,\n).execute()\nb.df.show()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.file","title":"file  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file: Union[str, list[str]] = Field(default=..., description='ID or list of IDs for the files to read.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Loop through the list of provided file identifiers and load data into dataframe. For traceability purposes the following columns will be added to the dataframe:     * meta_file_id: the identifier of the file on Box     * meta_file_name: name of the file</p> <p>Returns:</p> Type Description <code>DataFrame</code> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Loop through the list of provided file identifiers and load data into dataframe.\n    For traceability purposes the following columns will be added to the dataframe:\n        * meta_file_id: the identifier of the file on Box\n        * meta_file_name: name of the file\n\n    Returns\n    -------\n    DataFrame\n    \"\"\"\n    df = None\n    for f in self.file:\n        self.log.debug(f\"Reading contents of file with the ID '{f}' into Spark DataFrame\")\n        file = self.client.file(file_id=f)\n        data = file.content().decode(\"utf-8\").splitlines()\n        rdd = self.spark.sparkContext.parallelize(data)\n        temp_df = self.spark.read.csv(rdd, header=True, schema=self.schema_, **self.params)\n        temp_df = (\n            temp_df\n            # fmt: off\n            .withColumn(\"meta_file_id\", lit(file.object_id))\n            .withColumn(\"meta_file_name\", lit(file.get().name))\n            .withColumn(\"meta_load_timestamp\", expr(\"to_utc_timestamp(current_timestamp(), current_timezone())\"))\n            # fmt: on\n        )\n\n        df = temp_df if not df else df.union(temp_df)\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader","title":"koheesio.integrations.box.BoxCsvPathReader","text":"<pre><code>BoxCsvPathReader(**data)\n</code></pre> <p>Read all CSV files from the specified path into the dataframe. Files can be filtered using the regular expression in the 'filter' parameter. The default behavior is to read all CSV / TXT files from the specified path.</p> Notes <p>The class does not contain archival capability as it is presumed that the user wants to make sure that the full pipeline is successful (for example, the source data was transformed and saved) prior to moving the source files. Use BoxToBoxFileMove class instead and provide the list of IDs from 'file_id' output.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxCsvPathReader\n\nauth_params = {...}\nb = BoxCsvPathReader(**auth_params, path=\"foo/bar/\").execute()\nb.df  # Spark Dataframe\n...  # do something with the dataframe\nfrom koheesio.steps.integrations.box import BoxToBoxFileMove\n\nbm = BoxToBoxFileMove(**auth_params, file=b.file_id, path=\"/foo/bar/archive\")\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.filter","title":"filter  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>filter: Optional[str] = Field(default='.csv|.txt$', description='[Optional] Regexp to filter folder contents')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: str = Field(default=..., description='Box path')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Identify the list of files from the source Box path that match desired filter and load them into Dataframe</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Identify the list of files from the source Box path that match desired filter and load them into Dataframe\n    \"\"\"\n    folder = BoxFolderGet.from_step(self).execute().folder\n\n    # Identify the list of files that should be processed\n    files = [item for item in folder.get_items() if item.type == \"file\" and re.search(self.filter, item.name)]\n\n    if len(files) &gt; 0:\n        self.log.info(\n            f\"A total of {len(files)} files, that match the mask '{self.mask}' has been detected in {self.path}.\"\n            f\" They will be loaded into Spark Dataframe: {files}\"\n        )\n    else:\n        raise BoxPathIsEmptyError(f\"Path '{self.path}' is empty or none of files match the mask '{self.filter}'\")\n\n    file = [file_id.object_id for file_id in files]\n    self.output.df = BoxCsvFileReader.from_step(self, file=file).read()\n    self.output.file = file  # e.g. if files should be archived after pipeline is successful\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase","title":"koheesio.integrations.box.BoxFileBase","text":"<pre><code>BoxFileBase(**data)\n</code></pre> <p>Generic class to facilitate interactions with Box folders.</p> <p>Box SDK provides File class that has various properties and methods to interact with Box files. The object can be obtained in multiple ways:     * provide Box file identified to <code>file</code> parameter (the identifier can be obtained, for example, from URL)     * provide existing object to <code>file</code> parameter (boxsdk.object.file.File)</p> Notes <p>Refer to BoxFolderBase for mor info about <code>folder</code> and <code>path</code> parameters</p> See Also <p>boxsdk.object.file.File</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.files","title":"files  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>files: conlist(Union[File, str], min_length=1) = Field(default=..., alias='file', description='List of Box file objects or identifiers')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.folder","title":"folder  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.action","title":"action","text":"<pre><code>action(file: File, folder: Folder)\n</code></pre> <p>Abstract class for File level actions.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self, file: File, folder: Folder):\n    \"\"\"\n    Abstract class for File level actions.\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects from various parameter inputs</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects\n    from various parameter inputs\n    \"\"\"\n    if self.path:\n        _folder = BoxFolderGet.from_step(self).execute().folder\n    else:\n        _folder = self.client.folder(folder_id=self.folder) if isinstance(self.folder, str) else self.folder\n\n    for _file in self.files:\n        _file = self.client.file(file_id=_file) if isinstance(_file, str) else _file\n        self.action(file=_file, folder=_folder)\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter","title":"koheesio.integrations.box.BoxFileWriter","text":"<pre><code>BoxFileWriter(**data)\n</code></pre> <p>Write file or a file-like object to Box.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFileWriter\n\nauth_params = {...}\nf1 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=\"path/to/my/file.ext\").execute()\n# or\nimport io\n\nb = io.BytesIO(b\"my-sample-data\")\nf2 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=b, name=\"file.ext\").execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: Optional[str] = Field(None, description='Optional description to add to the file in Box')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file","title":"file  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file: Union[str, BytesIO] = Field(default=..., description='Path to file or a file-like object')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file_name","title":"file_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file_name: Optional[str] = Field(default=None, description=\"When file path or name is provided to 'file' parameter, this will override the original name.When binary stream is provided, the 'name' should be used to set the desired name for the Box file.\")\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output","title":"Output","text":"<p>Output class for BoxFileWriter.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.file","title":"file  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file: File = Field(default=..., description='File object in Box')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.shared_link","title":"shared_link  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>shared_link: str = Field(default=..., description='Shared link for the Box file')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.action","title":"action","text":"<pre><code>action()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    _file = self.file\n    _name = self.file_name\n\n    if isinstance(_file, str):\n        _name = _name if _name else PurePath(_file).name\n        with open(_file, \"rb\") as f:\n            _file = BytesIO(f.read())\n\n    folder: Folder = BoxFolderGet.from_step(self, create_sub_folders=True).execute().folder\n    folder.preflight_check(size=0, name=_name)\n\n    self.log.info(f\"Uploading file '{_name}' to Box folder '{folder.get().name}'...\")\n    _box_file: File = folder.upload_stream(file_stream=_file, file_name=_name, file_description=self.description)\n\n    self.output.file = _box_file\n    self.output.shared_link = _box_file.get_shared_link()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self) -&gt; Output:\n    self.action()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.validate_name_for_binary_data","title":"validate_name_for_binary_data","text":"<pre><code>validate_name_for_binary_data(values)\n</code></pre> <p>Validate 'file_name' parameter when providing a binary input for 'file'.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_name_for_binary_data(cls, values):\n    \"\"\"Validate 'file_name' parameter when providing a binary input for 'file'.\"\"\"\n    file, file_name = values.get(\"file\"), values.get(\"file_name\")\n    if not isinstance(file, str) and not file_name:\n        raise AttributeError(\"The parameter 'file_name' is mandatory when providing a binary input for 'file'.\")\n\n    return values\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase","title":"koheesio.integrations.box.BoxFolderBase","text":"<pre><code>BoxFolderBase(**data)\n</code></pre> <p>Generic class to facilitate interactions with Box folders.</p> <p>Box SDK provides Folder class that has various properties and methods to interact with Box folders. The object can be obtained in multiple ways:     * provide Box folder identified to <code>folder</code> parameter (the identifier can be obtained, for example, from URL)     * provide existing object to <code>folder</code> parameter (boxsdk.object.folder.Folder)     * provide filesystem-like path to <code>path</code> parameter</p> See Also <p>boxsdk.object.folder.Folder</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.folder","title":"folder  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.root","title":"root  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>root: Optional[Union[Folder, str]] = Field(default='0', description='Folder object or identifier of the folder that should be used as root')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output","title":"Output","text":"<p>Define outputs for the BoxFolderBase class</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output.folder","title":"folder  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>folder: Optional[Folder] = Field(default=None, description='Box folder object')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.action","title":"action","text":"<pre><code>action()\n</code></pre> <p>Placeholder for 'action' method, that should be implemented in the child classes</p> <p>Returns:</p> Type Description <code>    Folder or None</code> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    \"\"\"\n    Placeholder for 'action' method, that should be implemented in the child classes\n\n    Returns\n    -------\n        Folder or None\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def execute(self) -&gt; Output:\n    self.output.folder = self.action()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.validate_folder_or_path","title":"validate_folder_or_path","text":"<pre><code>validate_folder_or_path()\n</code></pre> <p>Validations for 'folder' and 'path' parameter usage</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_folder_or_path(self):\n    \"\"\"\n    Validations for 'folder' and 'path' parameter usage\n    \"\"\"\n    folder_value = self.folder\n    path_value = self.path\n\n    if folder_value and path_value:\n        raise AttributeError(\"Cannot user 'folder' and 'path' parameter at the same time\")\n\n    if not folder_value and not path_value:\n        raise AttributeError(\"Neither 'folder' nor 'path' parameters are set\")\n\n    return self\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate","title":"koheesio.integrations.box.BoxFolderCreate","text":"<pre><code>BoxFolderCreate(**data)\n</code></pre> <p>Explicitly create the new Box folder object and parent directories.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFolderCreate\n\nauth_params = {...}\nfolder = BoxFolderCreate(**auth_params, path=\"/foo/bar\").execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.create_sub_folders","title":"create_sub_folders  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_sub_folders: bool = Field(default=True, description='Create sub-folders recursively if the path does not exist.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.validate_folder","title":"validate_folder","text":"<pre><code>validate_folder(folder)\n</code></pre> <p>Validate 'folder' parameter</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@field_validator(\"folder\")\ndef validate_folder(cls, folder):\n    \"\"\"\n    Validate 'folder' parameter\n    \"\"\"\n    if folder:\n        raise AttributeError(\"Only 'path' parameter is allowed in the context of folder creation.\")\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete","title":"koheesio.integrations.box.BoxFolderDelete","text":"<pre><code>BoxFolderDelete(**data)\n</code></pre> <p>Delete existing Box folder based on object, identifier or path.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFolderDelete\n\nauth_params = {...}\nBoxFolderDelete(**auth_params, path=\"/foo/bar\").execute()\n# or\nBoxFolderDelete(**auth_params, folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxFolderDelete(**auth_params, folder=folder).execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete.action","title":"action","text":"<pre><code>action()\n</code></pre> <p>Delete folder action</p> <p>Returns:</p> Type Description <code>    None</code> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    \"\"\"\n    Delete folder action\n\n    Returns\n    -------\n        None\n    \"\"\"\n    if self.folder:\n        folder = self._obj_from_id\n    else:  # path\n        folder = BoxFolderGet.from_step(self).action()\n\n    self.log.info(f\"Deleting Box folder '{folder}'...\")\n    folder.delete()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet","title":"koheesio.integrations.box.BoxFolderGet","text":"<pre><code>BoxFolderGet(**data)\n</code></pre> <p>Get the Box folder object for an existing folder or create a new folder and parent directories.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxFolderGet\n\nauth_params = {...}\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\n# or\nfolder = BoxFolderGet(**auth_params, path=\"1\").execute().folder\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.create_sub_folders","title":"create_sub_folders  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_sub_folders: Optional[bool] = Field(False, description='Create sub-folders recursively if the path does not exist.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.action","title":"action","text":"<pre><code>action()\n</code></pre> <p>Get folder action</p> <p>Returns:</p> Name Type Description <code>folder</code> <code>Folder</code> <p>Box Folder object as specified in Box SDK</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self):\n    \"\"\"\n    Get folder action\n\n    Returns\n    -------\n    folder: Folder\n        Box Folder object as specified in Box SDK\n    \"\"\"\n    current_folder_object = None\n\n    if self.folder:\n        current_folder_object = self._obj_from_id\n\n    if self.path:\n        cleaned_path_parts = [p for p in PurePath(self.path).parts if p.strip() not in [None, \"\", \" \", \"/\"]]\n        current_folder_object = self.client.folder(folder_id=self.root) if isinstance(self.root, str) else self.root\n\n        for next_folder_name in cleaned_path_parts:\n            current_folder_object = self._get_or_create_folder(current_folder_object, next_folder_name)\n\n    self.log.info(f\"Folder identified or created: {current_folder_object}\")\n    return current_folder_object\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderNotFoundError","title":"koheesio.integrations.box.BoxFolderNotFoundError","text":"<p>Error when a provided box path does not exist.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxPathIsEmptyError","title":"koheesio.integrations.box.BoxPathIsEmptyError","text":"<p>Exception when provided Box path is empty or no files matched the mask.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase","title":"koheesio.integrations.box.BoxReaderBase","text":"<pre><code>BoxReaderBase(**data)\n</code></pre> <p>Base class for Box readers.</p> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the Spark reader.')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.schema_","title":"schema_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_: Optional[StructType] = Field(None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output","title":"Output","text":"<p>Make default reader output optional to gracefully handle 'no-files / folder' cases.</p>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>@abstractmethod\ndef execute(self) -&gt; Output:\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy","title":"koheesio.integrations.box.BoxToBoxFileCopy","text":"<pre><code>BoxToBoxFileCopy(**data)\n</code></pre> <p>Copy one or multiple files to the target Box path.</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxToBoxFileCopy\n\nauth_params = {...}\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileCopy(**auth_params, file=[File(), File()], folder=folder).execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy.action","title":"action","text":"<pre><code>action(file: File, folder: Folder)\n</code></pre> <p>Copy file to the desired destination and extend file description with the processing info</p> <p>Parameters:</p> Name Type Description Default <code>file</code> <code>File</code> <p>File object as specified in Box SDK</p> required <code>folder</code> <code>Folder</code> <p>Folder object as specified in Box SDK</p> required Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self, file: File, folder: Folder):\n    \"\"\"\n    Copy file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Copying '{file.get()}' to '{folder.get()}'...\")\n    file.copy(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove","title":"koheesio.integrations.box.BoxToBoxFileMove","text":"<pre><code>BoxToBoxFileMove(**data)\n</code></pre> <p>Move one or multiple files to the target Box path</p> <p>Examples:</p> <pre><code>from koheesio.steps.integrations.box import BoxToBoxFileMove\n\nauth_params = {...}\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileMove(**auth_params, file=[File(), File()], folder=folder).execute()\n</code></pre> Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def __init__(self, **data):\n    super().__init__(**data)\n    self.init_client()\n</code></pre>"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove.action","title":"action","text":"<pre><code>action(file: File, folder: Folder)\n</code></pre> <p>Move file to the desired destination and extend file description with the processing info</p> <p>Parameters:</p> Name Type Description Default <code>file</code> <code>File</code> <p>File object as specified in Box SDK</p> required <code>folder</code> <code>Folder</code> <p>Folder object as specified in Box SDK</p> required Source code in <code>src/koheesio/integrations/box.py</code> <pre><code>def action(self, file: File, folder: Folder):\n    \"\"\"\n    Move file to the desired destination and extend file description with the processing info\n\n    Parameters\n    ----------\n    file: File\n        File object as specified in Box SDK\n    folder: Folder\n        Folder object as specified in Box SDK\n    \"\"\"\n    self.log.info(f\"Moving '{file.get()}' to '{folder.get()}'...\")\n    file.move(parent_folder=folder).update_info(\n        data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n    )\n</code></pre>"},{"location":"api_reference/integrations/spark/index.html","title":"Spark","text":""},{"location":"api_reference/integrations/spark/sftp.html","title":"Sftp","text":"<p>This module contains the SFTPWriter class and the SFTPWriteMode enum.</p> <p>The SFTPWriter class is used to write data to a file on an SFTP server. It uses the Paramiko library to establish an SFTP connection and write data to the server. The data to be written is provided by a BufferWriter, which generates the data in a buffer. See the docstring of the SFTPWriter class for more details. Refer to koheesio.spark.writers.buffer for more details on the BufferWriter interface.</p> <p>The SFTPWriteMode enum defines the different write modes that the SFTPWriter can use. These modes determine how the SFTPWriter behaves when the file it is trying to write to already exists on the server. For more details on each mode, see the docstring of the SFTPWriteMode enum.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode","title":"koheesio.integrations.spark.sftp.SFTPWriteMode","text":"<p>The different write modes for the SFTPWriter.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--overwrite","title":"OVERWRITE:","text":"<ul> <li>If the file exists, it will be overwritten.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--append","title":"APPEND:","text":"<ul> <li>If the file exists, the new data will be appended to it.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--ignore","title":"IGNORE:","text":"<ul> <li>If the file exists, the method will return without writing anything.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--exclusive","title":"EXCLUSIVE:","text":"<ul> <li>If the file exists, an error will be raised.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--backup","title":"BACKUP:","text":"<ul> <li>If the file exists and the new data is different from the existing data, a backup will be created and the file     will be overwritten.</li> <li>If it does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--update","title":"UPDATE:","text":"<ul> <li>If the file exists and the new data is different from the existing data, the file will be overwritten.</li> <li>If the file exists and the new data is the same as the existing data, the method will return without writing     anything.</li> <li>If the file does not exist, a new file will be created.</li> </ul>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.BACKUP","title":"BACKUP  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BACKUP = 'backup'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.EXCLUSIVE","title":"EXCLUSIVE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>EXCLUSIVE = 'exclusive'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.IGNORE","title":"IGNORE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IGNORE = 'ignore'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.OVERWRITE","title":"OVERWRITE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OVERWRITE = 'overwrite'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.UPDATE","title":"UPDATE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>UPDATE = 'update'\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.write_mode","title":"write_mode  <code>property</code>","text":"<pre><code>write_mode\n</code></pre> <p>Return the write mode for the given SFTPWriteMode.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(mode: str)\n</code></pre> <p>Return the SFTPWriteMode for the given string.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@classmethod\ndef from_string(cls, mode: str):\n    \"\"\"Return the SFTPWriteMode for the given string.\"\"\"\n    return cls[mode.upper()]\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter","title":"koheesio.integrations.spark.sftp.SFTPWriter","text":"<p>Write a Dataframe to SFTP through a BufferWriter</p> Concept <ul> <li>This class uses Paramiko to connect to an SFTP server and write the contents of a buffer to a file on the server.</li> <li>This implementation takes inspiration from https://github.com/springml/spark-sftp</li> </ul> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Union[str, Path]</code> <p>Path to the folder to write to</p> required <code>file_name</code> <code>Optional[str]</code> <p>Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension.</p> <code>None</code> <code>host</code> <code>str</code> <p>SFTP Host</p> required <code>port</code> <code>int</code> <p>SFTP Port</p> required <code>username</code> <code>SecretStr</code> <p>SFTP Server Username</p> <code>None</code> <code>password</code> <code>SecretStr</code> <p>SFTP Server Password</p> <code>None</code> <code>buffer_writer</code> <code>BufferWriter</code> <p>This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.</p> required <code>mode</code> <p>Write mode: overwrite, append, ignore, exclusive, backup, or update. See the docstring of SFTPWriteMode for more details.</p> required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.buffer_writer","title":"buffer_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer_writer: InstanceOf[BufferWriter] = Field(default=..., description='This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.client","title":"client  <code>property</code>","text":"<pre><code>client: SFTPClient\n</code></pre> <p>Return the SFTP client. If it doesn't exist, create it.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.file_name","title":"file_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>file_name: Optional[str] = Field(default=None, description='Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension!', alias='filename')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.host","title":"host  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>host: str = Field(default=..., description='SFTP Host')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.mode","title":"mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>mode: SFTPWriteMode = Field(default=OVERWRITE, description='Write mode: overwrite, append, ignore, exclusive, backup, or update.' + __doc__)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.password","title":"password  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>password: Optional[SecretStr] = Field(default=None, description='SFTP Server Password')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Union[str, Path] = Field(default=..., description='Path to the folder to write to', alias='prefix')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.port","title":"port  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>port: int = Field(default=..., description='SFTP Port')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.transport","title":"transport  <code>property</code>","text":"<pre><code>transport\n</code></pre> <p>Return the transport for the SFTP connection. If it doesn't exist, create it.</p> <p>If the username and password are provided, use them to connect to the SFTP server.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.username","title":"username  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>username: Optional[SecretStr] = Field(default=None, description='SFTP Server Username')\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_mode","title":"write_mode  <code>property</code>","text":"<pre><code>write_mode\n</code></pre> <p>Return the write mode for the given SFTPWriteMode.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.check_file_exists","title":"check_file_exists","text":"<pre><code>check_file_exists(file_path: str) -&gt; bool\n</code></pre> <p>Check if a file exists on the SFTP server.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def check_file_exists(self, file_path: str) -&gt; bool:\n    \"\"\"\n    Check if a file exists on the SFTP server.\n    \"\"\"\n    try:\n        self.client.stat(file_path)\n        return True\n    except IOError:\n        return False\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def execute(self):\n    buffer_output: InstanceOf[BufferWriter.Output] = self.buffer_writer.write(self.df)\n\n    # write buffer to the SFTP server\n    try:\n        self._handle_write_mode(self.path.as_posix(), buffer_output)\n    finally:\n        self._close_client()\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_path_and_file_name","title":"validate_path_and_file_name","text":"<pre><code>validate_path_and_file_name(data: dict) -&gt; dict\n</code></pre> <p>Validate the path, make sure path and file_name are Path objects.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_path_and_file_name(cls, data: dict) -&gt; dict:\n    \"\"\"Validate the path, make sure path and file_name are Path objects.\"\"\"\n    path_or_str = data.get(\"path\")\n\n    if isinstance(path_or_str, str):\n        # make sure the path is a Path object\n        path_or_str = Path(path_or_str)\n\n    if not isinstance(path_or_str, Path):\n        raise ValueError(f\"Invalid path: {path_or_str}\")\n\n    if file_name := data.get(\"file_name\", data.get(\"filename\")):\n        path_or_str = path_or_str / file_name\n        try:\n            del data[\"filename\"]\n        except KeyError:\n            pass\n        data[\"file_name\"] = file_name\n\n    data[\"path\"] = path_or_str\n    return data\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_sftp_host","title":"validate_sftp_host","text":"<pre><code>validate_sftp_host(v) -&gt; str\n</code></pre> <p>Validate the host</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@field_validator(\"host\")\ndef validate_sftp_host(cls, v) -&gt; str:\n    \"\"\"Validate the host\"\"\"\n    # remove the sftp:// prefix if present\n    if v.startswith(\"sftp://\"):\n        v = v.replace(\"sftp://\", \"\")\n\n    # remove the trailing slash if present\n    if v.endswith(\"/\"):\n        v = v[:-1]\n\n    return v\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_file","title":"write_file","text":"<pre><code>write_file(file_path: str, buffer_output: InstanceOf[Output])\n</code></pre> <p>Using Paramiko, write the data in the buffer to SFTP.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def write_file(self, file_path: str, buffer_output: InstanceOf[BufferWriter.Output]):\n    \"\"\"\n    Using Paramiko, write the data in the buffer to SFTP.\n    \"\"\"\n    with self.client.open(file_path, self.write_mode) as file:\n        self.log.debug(f\"Writing file {file_path} to SFTP...\")\n        file.write(buffer_output.read())\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp","title":"koheesio.integrations.spark.sftp.SendCsvToSftp","text":"<p>Write a DataFrame to an SFTP server as a CSV file.</p> <p>This class uses the PandasCsvBufferWriter to generate the CSV data and the SFTPWriter to write the data to the SFTP server.</p> Example <pre><code>from koheesio.spark.writers import SendCsvToSftp\n\nwriter = SendCsvToSftp(\n    # SFTP Parameters\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.tsv.gz\",\n    # CSV Parameters\n    header=True,\n    sep=\"   \",\n    quote='\"',\n    timestampFormat=\"%Y-%m-%d\",\n    lineSep=os.linesep,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n</code></pre> <p>In this example, the DataFrame <code>df</code> is written to the file <code>file.csv.gz</code> in the folder <code>/path/to/folder</code> on the SFTP server. The file is written as a CSV file with a tab delimiter (TSV), double quotes as the quote character, and gzip compression.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Union[str, Path]</code> <p>Path to the folder to write to.</p> required <code>file_name</code> <code>Optional[str]</code> <p>Name of the file. If not provided, it's expected to be part of the path.</p> required <code>host</code> <code>str</code> <p>SFTP Host.</p> required <code>port</code> <code>int</code> <p>SFTP Port.</p> required <code>username</code> <code>SecretStr</code> <p>SFTP Server Username.</p> required <code>password</code> <code>SecretStr</code> <p>SFTP Server Password.</p> required <code>mode</code> <p>Write mode: overwrite, append, ignore, exclusive, backup, or update.</p> required <code>header</code> <p>Whether to write column names as the first line. Default is True.</p> required <code>sep</code> <p>Field delimiter for the output file. Default is ','.</p> required <code>quote</code> <p>Character used to quote fields. Default is '\"'.</p> required <code>quoteAll</code> <p>Whether all values should be enclosed in quotes. Default is False.</p> required <code>escape</code> <p>Character used to escape sep and quote when needed. Default is '\\'.</p> required <code>timestampFormat</code> <p>Date format for datetime objects. Default is '%Y-%m-%dT%H:%M:%S.%f'.</p> required <code>lineSep</code> <p>Character used as line separator. Default is os.linesep.</p> required <code>compression</code> <p>Compression to use for the output data. Default is None.</p> required <code>For</code> required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.buffer_writer","title":"buffer_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer_writer: PandasCsvBufferWriter = Field(default=None, validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def execute(self):\n    SFTPWriter.execute(self)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"<pre><code>set_up_buffer_writer() -&gt; SendCsvToSftp\n</code></pre> <p>Set up the buffer writer, passing all CSV related options to it.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -&gt; \"SendCsvToSftp\":\n    \"\"\"Set up the buffer writer, passing all CSV related options to it.\"\"\"\n    self.buffer_writer = PandasCsvBufferWriter(**self.get_options(options_type=\"kohesio_pandas_buffer_writer\"))\n    return self\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp","title":"koheesio.integrations.spark.sftp.SendJsonToSftp","text":"<p>Write a DataFrame to an SFTP server as a JSON file.</p> <p>This class uses the PandasJsonBufferWriter to generate the JSON data and the SFTPWriter to write the data to the SFTP server.</p> Example <pre><code>from koheesio.spark.writers import SendJsonToSftp\n\nwriter = SendJsonToSftp(\n    # SFTP Parameters (Inherited from SFTPWriter)\n    host=\"sftp.example.com\",\n    port=22,\n    username=\"user\",\n    password=\"password\",\n    path=\"/path/to/folder\",\n    file_name=\"file.json.gz\",\n    # JSON Parameters (Inherited from PandasJsonBufferWriter)\n    orient=\"records\",\n    date_format=\"iso\",\n    double_precision=2,\n    date_unit=\"ms\",\n    lines=False,\n    compression=\"gzip\",\n    index=False,\n)\n\nwriter.write(df)\n</code></pre> <p>In this example, the DataFrame <code>df</code> is written to the file <code>file.json.gz</code> in the folder <code>/path/to/folder</code> on the SFTP server. The file is written as a JSON file with gzip compression.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Union[str, Path]</code> <p>Path to the folder on the SFTP server.</p> required <code>file_name</code> <code>Optional[str]</code> <p>Name of the file, including extension. If not provided, expected to be part of the path.</p> required <code>host</code> <code>str</code> <p>SFTP Host.</p> required <code>port</code> <code>int</code> <p>SFTP Port.</p> required <code>username</code> <code>SecretStr</code> <p>SFTP Server Username.</p> required <code>password</code> <code>SecretStr</code> <p>SFTP Server Password.</p> required <code>mode</code> <p>Write mode: overwrite, append, ignore, exclusive, backup, or update.</p> required <code>orient</code> <p>Format of the JSON string. Default is 'records'.</p> required <code>lines</code> <p>If True, output is one JSON object per line. Only used when orient='records'. Default is True.</p> required <code>date_format</code> <p>Type of date conversion. Default is 'iso'.</p> required <code>double_precision</code> <p>Decimal places for encoding floating point values. Default is 10.</p> required <code>force_ascii</code> <p>If True, encoded string is ASCII. Default is True.</p> required <code>compression</code> <p>Compression to use for output data. Default is None.</p> required See Also <p>For more details on the JSON parameters, refer to the PandasJsonBufferWriter class documentation.</p>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.buffer_writer","title":"buffer_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer_writer: PandasJsonBufferWriter = Field(default=None, validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>def execute(self):\n    SFTPWriter.execute(self)\n</code></pre>"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"<pre><code>set_up_buffer_writer() -&gt; SendJsonToSftp\n</code></pre> <p>Set up the buffer writer, passing all JSON related options to it.</p> Source code in <code>src/koheesio/integrations/spark/sftp.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -&gt; \"SendJsonToSftp\":\n    \"\"\"Set up the buffer writer, passing all JSON related options to it.\"\"\"\n    self.buffer_writer = PandasJsonBufferWriter(\n        **self.get_options(), compression=self.compression, columns=self.columns\n    )\n    return self\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/index.html","title":"Dq","text":""},{"location":"api_reference/integrations/spark/dq/spark_expectations.html","title":"Spark expectations","text":"<p>Koheesio step for running data quality rules with Spark Expectations engine.</p>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","title":"koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","text":"<p>Run DQ rules for an input dataframe with Spark Expectations engine.</p> References <p>Spark Expectations: https://engineering.nike.com/spark-expectations/1.0.0/</p>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.drop_meta_column","title":"drop_meta_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>drop_meta_column: bool = Field(default=False, alias='drop_meta_columns', description='Whether to drop meta columns added by spark expectations on the output df')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.enable_debugger","title":"enable_debugger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_debugger: bool = Field(default=False, alias='debugger', description='...')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_format","title":"error_writer_format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_writer_format: Optional[str] = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the err table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_mode","title":"error_writer_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writing_options","title":"error_writing_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_writing_options: Optional[Dict[str, str]] = Field(default_factory=dict, alias='error_writing_options', description='Options for writing to the error table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the stats and err table. Separate output formats can be specified for each table using the error_writer_format and stats_writer_format params')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.mode","title":"mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>mode: Union[str, BatchOutputMode] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err and stats table. Separate output modes can be specified for each table using the error_writer_mode and stats_writer_mode params')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.product_id","title":"product_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>product_id: str = Field(default=..., description='Spark Expectations product identifier')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.rules_table","title":"rules_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rules_table: str = Field(default=..., alias='product_rules_table', description='DQ rules table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.se_user_conf","title":"se_user_conf  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>se_user_conf: Dict[str, Any] = Field(default={se_notifications_enable_email: False, se_notifications_enable_slack: False}, alias='user_conf', description='SE user provided confs', validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_streaming","title":"statistics_streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>statistics_streaming: Dict[str, Any] = Field(default={se_enable_streaming: False}, alias='stats_streaming_options', description='SE stats streaming options ', validate_default=False)\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_table","title":"statistics_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>statistics_table: str = Field(default=..., alias='dq_stats_table_name', description='DQ stats table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_format","title":"stats_writer_format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>stats_writer_format: Optional[str] = Field(default='delta', alias='stats_writer_format', description='The format used to write to the stats table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_mode","title":"stats_writer_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>stats_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='stats_writer_mode', description='The write mode that will be used to write to the stats table')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.target_table","title":"target_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_table: str = Field(default=..., alias='target_table_name', description=\"The table that will contain good records. Won't write to it, but will write to the err table with same name plus _err suffix\")\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output","title":"Output","text":"<p>Output of the SparkExpectationsTransformation step.</p>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.error_table_writer","title":"error_table_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>error_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations error table writer')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.rules_df","title":"rules_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rules_df: DataFrame = Field(default=..., description='Output dataframe')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.se","title":"se  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>se: SparkExpectations = Field(default=..., description='Spark Expectations object')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.stats_table_writer","title":"stats_table_writer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>stats_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations stats table writer')\n</code></pre>"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Apply data quality rules to a dataframe using the out-of-the-box SE decorator</p> Source code in <code>src/koheesio/integrations/spark/dq/spark_expectations.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"\n    Apply data quality rules to a dataframe using the out-of-the-box SE decorator\n    \"\"\"\n    # read rules table\n    rules_df = self.spark.read.table(self.rules_table).cache()\n    self.output.rules_df = rules_df\n\n    @self._se.with_expectations(\n        target_table=self.target_table,\n        user_conf=self.se_user_conf,\n        # Below params are `False` by default, however exposing them here for extra visibility\n        # The writes can be handled by downstream Koheesio steps\n        write_to_table=False,\n        write_to_temp_table=False,\n    )\n    def inner(df: DataFrame) -&gt; DataFrame:\n        \"\"\"Just a wrapper to be able to use Spark Expectations decorator\"\"\"\n        return df\n\n    output_df = inner(self.df)\n\n    if self.drop_meta_column:\n        output_df = output_df.drop(\"meta_dq_run_id\", \"meta_dq_run_datetime\")\n\n    self.output.df = output_df\n</code></pre>"},{"location":"api_reference/models/index.html","title":"Models","text":"<p>Models package creates models that can be used to base other classes on.</p> <ul> <li>Every model should be at least a pydantic BaseModel, but can also be a Step, or a StepOutput.</li> <li>Every model is expected to be an ABC (Abstract Base Class)</li> <li>Optionally a model can inherit ExtraParamsMixin that provides unpacking of kwargs into <code>extra_params</code> dict property   removing need to create a dict before passing kwargs to a model initializer.</li> </ul> <p>A Model class can be exceptionally handy when you need similar Pydantic models in multiple places, for example across Transformation and Reader classes.</p>"},{"location":"api_reference/models/index.html#koheesio.models.ListOfColumns","title":"koheesio.models.ListOfColumns  <code>module-attribute</code>","text":"<pre><code>ListOfColumns = Annotated[List[str], BeforeValidator(_list_of_columns_validation)]\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel","title":"koheesio.models.BaseModel","text":"<p>Base model for all models.</p> <p>Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.</p> Additional methods and properties: Different Modes <p>This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.</p> <ul> <li> <p>Normal mode:     you need to know the values ahead of time     <pre><code>normal_mode = YourOwnModel(a=\"foo\", b=42)\n</code></pre></p> </li> <li> <p>Lazy mode:     being able to defer the validation until later     <pre><code>lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n</code></pre>     The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add     them as they become available. All while still being able to validate that you have collected all your output     at the end.</p> </li> <li> <p>With statements:     With statements are also allowed. The <code>validate_output</code> method from the earlier example will run upon exit of     the with-statement.     <pre><code>with YourOwnModel.lazy() as with_output:\n    with_output.a = \"foo\"\n    with_output.b = 42\n</code></pre>     Note: that a lazy mode BaseModel object is required to work with a with-statement.</p> </li> </ul> <p>Examples:</p> <pre><code>from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n    name: str\n    age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n</code></pre> <p>In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The <code>validate_output</code> method is then called to validate the instance.</p> Koheesio specific configuration: <p>Koheesio models are configured differently from Pydantic defaults. The following configuration is used:</p> <ol> <li> <p>extra=\"allow\"</p> <p>This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.</p> </li> <li> <p>arbitrary_types_allowed=True</p> <p>This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.</p> </li> <li> <p>populate_by_name=True</p> <p>This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.</p> </li> <li> <p>validate_assignment=False</p> <p>This setting determines whether the model should be revalidated when the data is changed. If set to <code>True</code>, every time a field is assigned a new value, the entire model is validated again.</p> <p>Pydantic default is (also) <code>False</code>, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.</p> </li> <li> <p>revalidate_instances=\"subclass-instances\"</p> <p>This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is <code>never</code>, which means that the model and dataclass instances are not revalidated during validation.</p> </li> <li> <p>validate_default=True</p> <p>This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.</p> </li> <li> <p>frozen=False</p> <p>This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.</p> </li> <li> <p>coerce_numbers_to_str=True</p> <p>This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any <code>Number</code> type to <code>str</code>. Pydantic doesn't allow number types (<code>int</code>, <code>float</code>, <code>Decimal</code>) to be coerced as type <code>str</code> by default.</p> </li> <li> <p>use_enum_values=True</p> <p>This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.</p> </li> </ol>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--fields","title":"Fields","text":"<p>Every Koheesio BaseModel has two fields: <code>name</code> and <code>description</code>. These fields are used to provide a name and a description to the model.</p> <ul> <li> <p><code>name</code>: This is the name of the Model. If not provided, it defaults to the class name.</p> </li> <li> <p><code>description</code>: This is the description of the Model. It has several default behaviors:</p> <ul> <li>If not provided, it defaults to the docstring of the class.</li> <li>If the docstring is not provided, it defaults to the name of the class.</li> <li>For multi-line descriptions, it has the following behaviors:<ul> <li>Only the first non-empty line is used.</li> <li>Empty lines are removed.</li> <li>Only the first 3 lines are considered.</li> <li>Only the first 120 characters are considered.</li> </ul> </li> </ul> </li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--validators","title":"Validators","text":"<ul> <li><code>_set_name_and_description</code>: Set the name and description of the Model as per the rules mentioned above.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--properties","title":"Properties","text":"<ul> <li><code>log</code>: Returns a logger with the name of the class.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--class-methods","title":"Class Methods","text":"<ul> <li><code>from_basemodel</code>: Returns a new BaseModel instance based on the data of another BaseModel.</li> <li><code>from_context</code>: Creates BaseModel instance from a given Context.</li> <li><code>from_dict</code>: Creates BaseModel instance from a given dictionary.</li> <li><code>from_json</code>: Creates BaseModel instance from a given JSON string.</li> <li><code>from_toml</code>: Creates BaseModel object from a given toml file.</li> <li><code>from_yaml</code>: Creates BaseModel object from a given yaml file.</li> <li><code>lazy</code>: Constructs the model without doing validation.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--dunder-methods","title":"Dunder Methods","text":"<ul> <li><code>__add__</code>: Allows to add two BaseModel instances together.</li> <li><code>__enter__</code>: Allows for using the model in a with-statement.</li> <li><code>__exit__</code>: Allows for using the model in a with-statement.</li> <li><code>__setitem__</code>: Set Item dunder method for BaseModel.</li> <li><code>__getitem__</code>: Get Item dunder method for BaseModel.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--instance-methods","title":"Instance Methods","text":"<ul> <li><code>hasattr</code>: Check if given key is present in the model.</li> <li><code>get</code>: Get an attribute of the model, but don't fail if not present.</li> <li><code>merge</code>: Merge key,value map with self.</li> <li><code>set</code>: Allows for subscribing / assigning to <code>class[key]</code>.</li> <li><code>to_context</code>: Converts the BaseModel instance to a Context object.</li> <li><code>to_dict</code>: Converts the BaseModel instance to a dictionary.</li> <li><code>to_json</code>: Converts the BaseModel instance to a JSON string.</li> <li><code>to_yaml</code>: Converts the BaseModel instance to a YAML string.</li> </ul>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: Optional[str] = Field(default=None, description='Description of the Model')\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.log","title":"log  <code>property</code>","text":"<pre><code>log: Logger\n</code></pre> <p>Returns a logger with the name of the class</p>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.name","title":"name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>name: Optional[str] = Field(default=None, description='Name of the Model')\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_basemodel","title":"from_basemodel  <code>classmethod</code>","text":"<pre><code>from_basemodel(basemodel: BaseModel, **kwargs)\n</code></pre> <p>Returns a new BaseModel instance based on the data of another BaseModel</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n    \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n    kwargs = {**basemodel.model_dump(), **kwargs}\n    return cls(**kwargs)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_context","title":"from_context  <code>classmethod</code>","text":"<pre><code>from_context(context: Context) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given Context</p> <p>You have to make sure that the Context object has the necessary attributes to create the model.</p> <p>Examples:</p> <pre><code>class SomeStep(BaseModel):\n    foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo)  # prints 'bar'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>context</code> <code>Context</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_context(cls, context: Context) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given Context\n\n    You have to make sure that the Context object has the necessary attributes to create the model.\n\n    Examples\n    --------\n    ```python\n    class SomeStep(BaseModel):\n        foo: str\n\n\n    context = Context(foo=\"bar\")\n    some_step = SomeStep.from_context(context)\n    print(some_step.foo)  # prints 'bar'\n    ```\n\n    Parameters\n    ----------\n    context: Context\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(data: Dict[str, Any]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given dictionary</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given dictionary\n\n    Parameters\n    ----------\n    data: Dict[str, Any]\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    return cls(**data)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(json_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel instance from a given JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.from_json : Deserializes a JSON string to a Context object</p> <p>Parameters:</p> Name Type Description Default <code>json_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the json file or string containing json</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel instance from a given JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.from_json : Deserializes a JSON string to a Context object\n\n    Parameters\n    ----------\n    json_file_or_str : Union[str, Path]\n        Pathlike string or Path that points to the json file or string containing json\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_json(json_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_toml","title":"from_toml  <code>classmethod</code>","text":"<pre><code>from_toml(toml_file_or_str: Union[str, Path]) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given toml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>toml_file_or_str</code> <code>Union[str, Path]</code> <p>Pathlike string or Path that points to the toml file, or string containing toml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given toml file\n\n    Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n    Parameters\n    ----------\n    toml_file_or_str: str or Path\n        Pathlike string or Path that points to the toml file, or string containing toml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_toml(toml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_yaml","title":"from_yaml  <code>classmethod</code>","text":"<pre><code>from_yaml(yaml_file_or_str: str) -&gt; BaseModel\n</code></pre> <p>Creates BaseModel object from a given yaml file</p> <p>Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>yaml_file_or_str</code> <code>str</code> <p>Pathlike string or Path that points to the yaml file, or string containing yaml</p> required <p>Returns:</p> Type Description <code>BaseModel</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -&gt; BaseModel:\n    \"\"\"Creates BaseModel object from a given yaml file\n\n    Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    yaml_file_or_str: str or Path\n        Pathlike string or Path that points to the yaml file, or string containing yaml\n\n    Returns\n    -------\n    BaseModel\n    \"\"\"\n    _context = Context.from_yaml(yaml_file_or_str)\n    return cls.from_context(_context)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.get","title":"get","text":"<pre><code>get(key: str, default: Optional[Any] = None)\n</code></pre> <p>Get an attribute of the model, but don't fail if not present</p> <p>Similar to dict.get()</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\")  # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>name of the key to get</p> required <code>default</code> <code>Optional[Any]</code> <p>Default value in case the attribute does not exist</p> <code>None</code> <p>Returns:</p> Type Description <code>Any</code> <p>The value of the attribute</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def get(self, key: str, default: Optional[Any] = None):\n    \"\"\"Get an attribute of the model, but don't fail if not present\n\n    Similar to dict.get()\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.get(\"foo\")  # returns 'bar'\n    step_output.get(\"non_existent_key\", \"oops\")  # returns 'oops'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        name of the key to get\n    default: Optional[Any]\n        Default value in case the attribute does not exist\n\n    Returns\n    -------\n    Any\n        The value of the attribute\n    \"\"\"\n    if self.hasattr(key):\n        return self.__getitem__(key)\n    return default\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.hasattr","title":"hasattr","text":"<pre><code>hasattr(key: str) -&gt; bool\n</code></pre> <p>Check if given key is present in the model</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> required <p>Returns:</p> Type Description <code>bool</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def hasattr(self, key: str) -&gt; bool:\n    \"\"\"Check if given key is present in the model\n\n    Parameters\n    ----------\n    key: str\n\n    Returns\n    -------\n    bool\n    \"\"\"\n    return hasattr(self, key)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.lazy","title":"lazy  <code>classmethod</code>","text":"<pre><code>lazy()\n</code></pre> <p>Constructs the model without doing validation</p> <p>Essentially an alias to BaseModel.construct()</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>@classmethod\ndef lazy(cls):\n    \"\"\"Constructs the model without doing validation\n\n    Essentially an alias to BaseModel.construct()\n    \"\"\"\n    return cls.model_construct()\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.merge","title":"merge","text":"<pre><code>merge(other: Union[Dict, BaseModel])\n</code></pre> <p>Merge key,value map with self</p> <p>Functionally similar to adding two dicts together; like running <code>{**dict_a, **dict_b}</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[Dict, BaseModel]</code> <p>Dict or another instance of a BaseModel class that will be added to self</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def merge(self, other: Union[Dict, BaseModel]):\n    \"\"\"Merge key,value map with self\n\n    Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.merge({\"lorem\": \"ipsum\"})  # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n    ```\n\n    Parameters\n    ----------\n    other: Union[Dict, BaseModel]\n        Dict or another instance of a BaseModel class that will be added to self\n    \"\"\"\n    if isinstance(other, BaseModel):\n        other = other.model_dump()  # ensures we really have a dict\n\n    for k, v in other.items():\n        self.set(k, v)\n\n    return self\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.set","title":"set","text":"<pre><code>set(key: str, value: Any)\n</code></pre> <p>Allows for subscribing / assigning to <code>class[key]</code>.</p> <p>Examples:</p> <pre><code>step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>The key of the attribute to assign to</p> required <code>value</code> <code>Any</code> <p>Value that should be assigned to the given key</p> required Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def set(self, key: str, value: Any):\n    \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n    Examples\n    --------\n    ```python\n    step_output = StepOutput(foo=\"bar\")\n    step_output.set(foo\", \"baz\")  # overwrites 'foo' to be 'baz'\n    ```\n\n    Parameters\n    ----------\n    key: str\n        The key of the attribute to assign to\n    value: Any\n        Value that should be assigned to the given key\n    \"\"\"\n    self.__setitem__(key, value)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_context","title":"to_context","text":"<pre><code>to_context() -&gt; Context\n</code></pre> <p>Converts the BaseModel instance to a Context object</p> <p>Returns:</p> Type Description <code>Context</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_context(self) -&gt; Context:\n    \"\"\"Converts the BaseModel instance to a Context object\n\n    Returns\n    -------\n    Context\n    \"\"\"\n    return Context(**self.to_dict())\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_dict","title":"to_dict","text":"<pre><code>to_dict() -&gt; Dict[str, Any]\n</code></pre> <p>Converts the BaseModel instance to a dictionary</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_dict(self) -&gt; Dict[str, Any]:\n    \"\"\"Converts the BaseModel instance to a dictionary\n\n    Returns\n    -------\n    Dict[str, Any]\n    \"\"\"\n    return self.model_dump()\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_json","title":"to_json","text":"<pre><code>to_json(pretty: bool = False)\n</code></pre> <p>Converts the BaseModel instance to a JSON string</p> <p>BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.</p> See Also <p>Context.to_json : Serializes a Context object to a JSON string</p> <p>Parameters:</p> Name Type Description Default <code>pretty</code> <code>bool</code> <p>Toggles whether to return a pretty json string or not</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_json(self, pretty: bool = False):\n    \"\"\"Converts the BaseModel instance to a JSON string\n\n    BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n    jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n    in the BaseModel object, which is not possible with the standard json library.\n\n    See Also\n    --------\n    Context.to_json : Serializes a Context object to a JSON string\n\n    Parameters\n    ----------\n    pretty : bool, optional, default=False\n        Toggles whether to return a pretty json string or not\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_json(pretty=pretty)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_yaml","title":"to_yaml","text":"<pre><code>to_yaml(clean: bool = False) -&gt; str\n</code></pre> <p>Converts the BaseModel instance to a YAML string</p> <p>BaseModel offloads the serialization and deserialization of the YAML string to Context class.</p> <p>Parameters:</p> Name Type Description Default <code>clean</code> <code>bool</code> <p>Toggles whether to remove <code>!!python/object:...</code> from yaml or not. Default: False</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>containing all parameters of the BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def to_yaml(self, clean: bool = False) -&gt; str:\n    \"\"\"Converts the BaseModel instance to a YAML string\n\n    BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n    Parameters\n    ----------\n    clean: bool\n        Toggles whether to remove `!!python/object:...` from yaml or not.\n        Default: False\n\n    Returns\n    -------\n    str\n        containing all parameters of the BaseModel instance\n    \"\"\"\n    _context = self.to_context()\n    return _context.to_yaml(clean=clean)\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.validate","title":"validate","text":"<pre><code>validate() -&gt; BaseModel\n</code></pre> <p>Validate the BaseModel instance</p> <p>This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.</p> <p>This method is intended to be used with the <code>lazy</code> method. The <code>lazy</code> method is used to create an instance of the BaseModel without immediate validation. The <code>validate</code> method is then used to validate the instance after.</p> <p>Note: in the Pydantic BaseModel, the <code>validate</code> method throws a deprecated warning. This is because Pydantic recommends using the <code>validate_model</code> method instead. However, we are using the <code>validate</code> method here in a different context and a slightly different way.</p> <p>Examples:</p> <p><pre><code>class FooModel(BaseModel):\n    foo: str\n    lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n</code></pre> In this example, the <code>foo_model</code> instance is created without immediate validation. The attributes foo and lorem are set afterward. The <code>validate</code> method is then called to validate the instance.</p> <p>Returns:</p> Type Description <code>BaseModel</code> <p>The BaseModel instance</p> Source code in <code>src/koheesio/models/__init__.py</code> <pre><code>def validate(self) -&gt; BaseModel:\n    \"\"\"Validate the BaseModel instance\n\n    This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n    validate the instance after all the attributes have been set.\n\n    This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n    the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n    &gt; Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n    recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n    different context and a slightly different way.\n\n    Examples\n    --------\n    ```python\n    class FooModel(BaseModel):\n        foo: str\n        lorem: str\n\n\n    foo_model = FooModel.lazy()\n    foo_model.foo = \"bar\"\n    foo_model.lorem = \"ipsum\"\n    foo_model.validate()\n    ```\n    In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n    are set afterward. The `validate` method is then called to validate the instance.\n\n    Returns\n    -------\n    BaseModel\n        The BaseModel instance\n    \"\"\"\n    return self.model_validate(self.model_dump())\n</code></pre>"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin","title":"koheesio.models.ExtraParamsMixin","text":"<p>Mixin class that adds support for arbitrary keyword arguments to Pydantic models.</p> <p>The keyword arguments are extracted from the model's <code>values</code> and moved to a <code>params</code> dictionary.</p>"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.extra_params","title":"extra_params  <code>cached</code> <code>property</code>","text":"<pre><code>extra_params: Dict[str, Any]\n</code></pre> <p>Extract params (passed as arbitrary kwargs) from values and move them to params dict</p>"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = Field(default_factory=dict)\n</code></pre>"},{"location":"api_reference/models/sql.html","title":"Sql","text":"<p>This module contains the base class for SQL steps.</p>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep","title":"koheesio.models.sql.SqlBaseStep","text":"<p>Base class for SQL steps</p> <p><code>params</code> are used as placeholders for templating. These are identified with ${placeholder} in the SQL script.</p> <p>Parameters:</p> Name Type Description Default <code>sql_path</code> <p>Path to a SQL file</p> required <code>sql</code> <p>SQL script to apply</p> required <code>params</code> <p>Placeholders (parameters) for templating. These are identified with <code>${placeholder}</code> in the SQL script.</p> <p>Note: any arbitrary kwargs passed to the class will be added to params.</p> required"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = Field(default_factory=dict, description='Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script. Note: any arbitrary kwargs passed to the class will be added to params.')\n</code></pre>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.query","title":"query  <code>property</code>","text":"<pre><code>query\n</code></pre> <p>Returns the query while performing params replacement</p>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql","title":"sql  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sql: Optional[str] = Field(default=None, description='SQL script to apply')\n</code></pre>"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql_path","title":"sql_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sql_path: Optional[Union[Path, str]] = Field(default=None, description='Path to a SQL file')\n</code></pre>"},{"location":"api_reference/notifications/index.html","title":"Notifications","text":"<p>Notification module for sending messages to notification services (e.g. Slack, Email, etc.)</p>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity","title":"koheesio.notifications.NotificationSeverity","text":"<p>Enumeration of allowed message severities</p>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.ERROR","title":"ERROR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERROR = 'error'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.INFO","title":"INFO  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INFO = 'info'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.SUCCESS","title":"SUCCESS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SUCCESS = 'success'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.WARN","title":"WARN  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>WARN = 'warn'\n</code></pre>"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.alert_icon","title":"alert_icon  <code>property</code>","text":"<pre><code>alert_icon: str\n</code></pre> <p>Return a colored circle in slack markup</p>"},{"location":"api_reference/notifications/slack.html","title":"Slack","text":"<p>Classes to ease interaction with Slack</p>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification","title":"koheesio.notifications.slack.SlackNotification","text":"<p>Generic Slack notification class via the <code>Blocks</code> API</p> <p>NOTE: <code>channel</code> parameter is used only with Slack Web API: https://api.slack.com/messaging/sending     If webhook is used, the channel specification is not required</p> <p>Example: <pre><code>s = SlackNotification(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\",\n)\ns.execute()\n</code></pre></p>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.channel","title":"channel  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>channel: Optional[str] = Field(default=None, description='Slack channel id')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.headers","title":"headers  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Optional[Dict[str, Any]] = {'Content-type': 'application/json'}\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.message","title":"message  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>message: str = Field(default=..., description='The message that gets posted to Slack')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Generate payload and send post request</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.get_payload","title":"get_payload","text":"<pre><code>get_payload()\n</code></pre> <p>Generate payload with <code>Block Kit</code>. More details: https://api.slack.com/block-kit</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def get_payload(self):\n    \"\"\"\n    Generate payload with `Block Kit`.\n    More details: https://api.slack.com/block-kit\n    \"\"\"\n    payload = {\n        \"attachments\": [\n            {\n                \"blocks\": [\n                    {\n                        \"type\": \"section\",\n                        \"text\": {\n                            \"type\": \"mrkdwn\",\n                            \"text\": self.message,\n                        },\n                    }\n                ],\n            }\n        ]\n    }\n\n    if self.channel:\n        payload[\"channel\"] = self.channel\n\n    return json.dumps(payload)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity","title":"koheesio.notifications.slack.SlackNotificationWithSeverity","text":"<p>Slack notification class via the <code>Blocks</code> API with etra severity information and predefined extra fields</p> <p>Example:     from koheesio.steps.integrations.notifications import NotificationSeverity</p> <pre><code>s = SlackNotificationWithSeverity(\n    url=\"slack-webhook-url\",\n    channel=\"channel\",\n    message=\"Some *markdown* compatible text\"\n    severity=NotificationSeverity.ERROR,\n    title=\"Title\",\n    environment=\"dev\",\n    application=\"Application\"\n)\ns.execute()\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.application","title":"application  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>application: str = Field(default=..., description='Pipeline or application name')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.environment","title":"environment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>environment: str = Field(default=..., description='Environment description, e.g. dev / qa /prod')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(use_enum_values=False)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.severity","title":"severity  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>severity: NotificationSeverity = Field(default=..., description='Severity of the message')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.timestamp","title":"timestamp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timestamp: datetime = Field(default=utcnow(), alias='execution_timestamp', description='Pipeline or application execution timestamp')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.title","title":"title  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>title: str = Field(default=..., description='Title of your message')\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Generate payload and send post request</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Generate payload and send post request\n    \"\"\"\n    self.message = self.get_payload_message()\n    self.data = self.get_payload()\n    HttpPostStep.execute(self)\n</code></pre>"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.get_payload_message","title":"get_payload_message","text":"<pre><code>get_payload_message()\n</code></pre> <p>Generate payload message based on the predefined set of parameters</p> Source code in <code>src/koheesio/notifications/slack.py</code> <pre><code>def get_payload_message(self):\n    \"\"\"\n    Generate payload message based on the predefined set of parameters\n    \"\"\"\n    return dedent(\n        f\"\"\"\n            {self.severity.alert_icon}   *{self.severity.name}:*  {self.title}\n            *Environment:* {self.environment}\n            *Application:* {self.application}\n            *Message:* {self.message}\n            *Timestamp:* {self.timestamp}\n        \"\"\"\n    )\n</code></pre>"},{"location":"api_reference/secrets/index.html","title":"Secrets","text":"<p>Module for secret integrations.</p> <p>Contains abstract class for various secret integrations also known as SecretContext.</p>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret","title":"koheesio.secrets.Secret","text":"<p>Abstract class for various secret integrations. All secrets are wrapped into Context class for easy access. Either existing context can be provided, or new context will be created and returned at runtime.</p> <p>Secrets are wrapped into the pydantic.SecretStr.</p>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.context","title":"context  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>context: Optional[Context] = Field(Context({}), description='Existing `Context` instance can be used for secrets, otherwise new empty context will be created.')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.parent","title":"parent  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>parent: Optional[str] = Field(default=..., description='Group secrets from one secure path under this friendly name', pattern='^[a-zA-Z0-9_]+$')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.root","title":"root  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>root: Optional[str] = Field(default='secrets', description='All secrets will be grouped under this root.')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output","title":"Output","text":"<p>Output class for Secret.</p>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output.context","title":"context  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>context: Context = Field(default=..., description='Koheesio context')\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.encode_secret_values","title":"encode_secret_values  <code>classmethod</code>","text":"<pre><code>encode_secret_values(data: dict)\n</code></pre> <p>Encode secret values in the dictionary.</p> <p>Ensures that all values in the dictionary are wrapped in SecretStr.</p> Source code in <code>src/koheesio/secrets/__init__.py</code> <pre><code>@classmethod\ndef encode_secret_values(cls, data: dict):\n    \"\"\"Encode secret values in the dictionary.\n\n    Ensures that all values in the dictionary are wrapped in SecretStr.\n    \"\"\"\n    encoded_dict = {}\n    for key, value in data.items():\n        if isinstance(value, dict):\n            encoded_dict[key] = cls.encode_secret_values(value)\n        else:\n            encoded_dict[key] = SecretStr(value)\n    return encoded_dict\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.</p> Source code in <code>src/koheesio/secrets/__init__.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.\n    \"\"\"\n    context = Context(self.encode_secret_values(data={self.root: {self.parent: self._get_secrets()}}))\n    self.output.context = self.context.merge(context=context)\n</code></pre>"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.get","title":"get","text":"<pre><code>get() -&gt; Context\n</code></pre> <p>Convenience method to return context with secrets.</p> Source code in <code>src/koheesio/secrets/__init__.py</code> <pre><code>def get(self) -&gt; Context:\n    \"\"\"\n    Convenience method to return context with secrets.\n    \"\"\"\n    self.execute()\n    return self.output.context\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html","title":"Cerberus","text":"<p>Module for retrieving secrets from Cerberus.</p> <p>Secrets are stored as SecretContext and can be accessed accordingly.</p> <p>See CerberusSecret for more information.</p>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret","title":"koheesio.secrets.cerberus.CerberusSecret","text":"<p>Retrieve secrets from Cerberus and wrap them into Context class for easy access. All secrets are stored under the \"secret\" root and \"parent\". \"Parent\" either derived from the secure data path by replacing \"/\" and \"-\", or manually provided by the user. Secrets are wrapped into the pydantic.SecretStr.</p> <p>Example: <pre><code>context = {\n    \"secrets\": {\n        \"parent\": {\n            \"webhook\": SecretStr(\"**********\"),\n            \"description\": SecretStr(\"**********\"),\n        }\n    }\n}\n</code></pre></p> <p>Values can be decoded like this: <pre><code>context.secrets.parent.webhook.get_secret_value()\n</code></pre> or if working with dictionary is preferable: <pre><code>for key, value in context.get_all().items():\n    value.get_secret_value()\n</code></pre></p>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.aws_session","title":"aws_session  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>aws_session: Optional[Session] = Field(default=None, description='AWS Session to pass to Cerberus client, can be used for local execution.')\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: str = Field(default=..., description=\"Secure data path, eg. 'app/my-sdb/my-secrets'\")\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.token","title":"token  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>token: Optional[SecretStr] = Field(default=get('CERBERUS_TOKEN', None), description='Cerberus token, can be used for local development without AWS auth mechanism.Note: Token has priority over AWS session.')\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., description='Cerberus URL, eg. https://cerberus.domain.com')\n</code></pre>"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.verbose","title":"verbose  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>verbose: bool = Field(default=False, description='Enable verbose for Cerberus client')\n</code></pre>"},{"location":"api_reference/spark/index.html","title":"Spark","text":"<p>Spark step module</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.AnalysisException","title":"koheesio.spark.AnalysisException  <code>module-attribute</code>","text":"<pre><code>AnalysisException = AnalysisException\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.DataFrame","title":"koheesio.spark.DataFrame  <code>module-attribute</code>","text":"<pre><code>DataFrame = DataFrame\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkSession","title":"koheesio.spark.SparkSession  <code>module-attribute</code>","text":"<pre><code>SparkSession = SparkSession\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep","title":"koheesio.spark.SparkStep","text":"<p>Base class for a Spark step</p> <p>Extends the Step class with SparkSession support. The following: - Spark steps are expected to return a Spark DataFrame as output. - spark property is available to access the active SparkSession instance.</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.spark","title":"spark  <code>property</code>","text":"<pre><code>spark: Optional[SparkSession]\n</code></pre> <p>Get active SparkSession instance</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output","title":"Output","text":"<p>Output class for SparkStep</p>"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/index.html#koheesio.spark.current_timestamp_utc","title":"koheesio.spark.current_timestamp_utc","text":"<pre><code>current_timestamp_utc(spark: SparkSession) -&gt; Column\n</code></pre> <p>Get the current timestamp in UTC</p> Source code in <code>src/koheesio/spark/__init__.py</code> <pre><code>def current_timestamp_utc(spark: SparkSession) -&gt; Column:\n    \"\"\"Get the current timestamp in UTC\"\"\"\n    return F.to_utc_timestamp(F.current_timestamp(), spark.conf.get(\"spark.sql.session.timeZone\"))\n</code></pre>"},{"location":"api_reference/spark/delta.html","title":"Delta","text":"<p>Module for creating and managing Delta tables.</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep","title":"koheesio.spark.delta.DeltaTableStep","text":"<p>Class for creating and managing Delta tables.</p> <p>DeltaTable aims to provide a simple interface to create and manage Delta tables. It is a wrapper around the Spark SQL API for Delta tables.</p> Example <pre><code>from koheesio.steps import DeltaTableStep\n\nDeltaTableStep(\n    table=\"my_table\",\n    database=\"my_database\",\n    catalog=\"my_catalog\",\n    create_if_not_exists=True,\n    default_create_properties={\n        \"delta.randomizeFilePrefixes\": \"true\",\n        \"delta.checkpoint.writeStatsAsStruct\": \"true\",\n        \"delta.minReaderVersion\": \"2\",\n        \"delta.minWriterVersion\": \"5\",\n    },\n)\n</code></pre> <p>Methods:</p> Name Description <code>get_persisted_properties</code> <p>Get persisted properties of table.</p> <code>add_property</code> <p>Alter table and set table property.</p> <code>add_properties</code> <p>Alter table and add properties.</p> <code>execute</code> <p>Nothing to execute on a Table.</p> <code>max_version_ts_of_last_execution</code> <p>Max version timestamp of last execution. If no timestamp is found, returns 1900-01-01 00:00:00. Note: will raise an error if column <code>VERSION_TIMESTAMP</code> does not exist.</p> Properties <ul> <li>name -&gt; str     Deprecated. Use <code>.table_name</code> instead.</li> <li>table_name -&gt; str     Table name.</li> <li>dataframe -&gt; DataFrame     Returns a DataFrame to be able to interact with this table.</li> <li>columns -&gt; Optional[List[str]]     Returns all column names as a list.</li> <li>has_change_type -&gt; bool     Checks if a column named <code>_change_type</code> is present in the table.</li> <li>exists -&gt; bool     Check if table exists.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>str</code> <p>Table name.</p> required <code>database</code> <code>str</code> <p>Database or Schema name.</p> <code>None</code> <code>catalog</code> <code>str</code> <p>Catalog name.</p> <code>None</code> <code>create_if_not_exists</code> <code>bool</code> <p>Force table creation if it doesn't exist. Note: Default properties will be applied to the table during CREATION.</p> <code>False</code> <code>default_create_properties</code> <code>Dict[str, str]</code> <p>Default table properties to be applied during CREATION if <code>force_creation</code> True.</p> <code>{\"delta.randomizeFilePrefixes\": \"true\", \"delta.checkpoint.writeStatsAsStruct\": \"true\", \"delta.minReaderVersion\": \"2\", \"delta.minWriterVersion\": \"5\"}</code>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.catalog","title":"catalog  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>catalog: Optional[str] = Field(default=None, description='Catalog name. Note: Can be ignored if using a SparkCatalog that does not support catalog notation (e.g. Hive)')\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.columns","title":"columns  <code>property</code>","text":"<pre><code>columns: Optional[List[str]]\n</code></pre> <p>Returns all column names as a list.</p> Example <p><pre><code>DeltaTableStep(...).columns\n</code></pre> Would for example return <code>['age', 'name']</code> if the table has columns <code>age</code> and <code>name</code>.</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.create_if_not_exists","title":"create_if_not_exists  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_if_not_exists: bool = Field(default=False, alias='force_creation', description=\"Force table creation if it doesn't exist.Note: Default properties will be applied to the table during CREATION.\")\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.database","title":"database  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>database: Optional[str] = Field(default=None, description='Database or Schema name.')\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.dataframe","title":"dataframe  <code>property</code>","text":"<pre><code>dataframe: DataFrame\n</code></pre> <p>Returns a DataFrame to be able to interact with this table</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.default_create_properties","title":"default_create_properties  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>default_create_properties: Dict[str, Union[str, bool, int]] = Field(default={'delta.randomizeFilePrefixes': 'true', 'delta.checkpoint.writeStatsAsStruct': 'true', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'}, description='Default table properties to be applied during CREATION if `create_if_not_exists` True')\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.exists","title":"exists  <code>property</code>","text":"<pre><code>exists: bool\n</code></pre> <p>Check if table exists</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.has_change_type","title":"has_change_type  <code>property</code>","text":"<pre><code>has_change_type: bool\n</code></pre> <p>Checks if a column named <code>_change_type</code> is present in the table</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.is_cdf_active","title":"is_cdf_active  <code>property</code>","text":"<pre><code>is_cdf_active: bool\n</code></pre> <p>Check if CDF property is set and activated</p> <p>Returns:</p> Type Description <code>bool</code> <p>delta.enableChangeDataFeed property is set to 'true'</p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table","title":"table  <code>instance-attribute</code>","text":"<pre><code>table: str\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table_name","title":"table_name  <code>property</code>","text":"<pre><code>table_name: str\n</code></pre> <p>Fully qualified table name in the form of <code>catalog.database.table</code></p>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_properties","title":"add_properties","text":"<pre><code>add_properties(properties: Dict[str, Union[str, bool, int]], override: bool = False)\n</code></pre> <p>Alter table and add properties.</p> <p>Parameters:</p> Name Type Description Default <code>properties</code> <code>Dict[str, Union[str, int, bool]]</code> <p>Properties to be added to table.</p> required <code>override</code> <code>bool</code> <p>Enable override of existing value for property in table.</p> <code>False</code> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def add_properties(self, properties: Dict[str, Union[str, bool, int]], override: bool = False):\n    \"\"\"Alter table and add properties.\n\n    Parameters\n    ----------\n    properties : Dict[str, Union[str, int, bool]]\n        Properties to be added to table.\n    override : bool, optional, default=False\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    for k, v in properties.items():\n        v_str = str(v) if not isinstance(v, bool) else str(v).lower()\n        self.add_property(key=k, value=v_str, override=override)\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_property","title":"add_property","text":"<pre><code>add_property(key: str, value: Union[str, int, bool], override: bool = False)\n</code></pre> <p>Alter table and set table property.</p> <p>Parameters:</p> Name Type Description Default <code>key</code> <code>str</code> <p>Property key(name).</p> required <code>value</code> <code>Union[str, int, bool]</code> <p>Property value.</p> required <code>override</code> <code>bool</code> <p>Enable override of existing value for property in table.</p> <code>False</code> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def add_property(self, key: str, value: Union[str, int, bool], override: bool = False):\n    \"\"\"Alter table and set table property.\n\n    Parameters\n    ----------\n    key: str\n        Property key(name).\n    value: Union[str, int, bool]\n        Property value.\n    override: bool\n        Enable override of existing value for property in table.\n\n    \"\"\"\n    persisted_properties = self.get_persisted_properties()\n    v_str = str(value) if not isinstance(value, bool) else str(value).lower()\n\n    def _alter_table() -&gt; None:\n        property_pair = f\"'{key}'='{v_str}'\"\n\n        try:\n            # noinspection SqlNoDataSourceInspection\n            self.spark.sql(f\"ALTER TABLE {self.table_name} SET TBLPROPERTIES ({property_pair})\")\n            self.log.debug(f\"Table `{self.table_name}` has been altered. Property `{property_pair}` added.\")\n        except Py4JJavaError as e:\n            msg = f\"Property `{key}` can not be applied to table `{self.table_name}`. Exception: {e}\"\n            self.log.warning(msg)\n            warnings.warn(msg)\n\n    if self.exists:\n        if key in persisted_properties and persisted_properties[key] != v_str:\n            if override:\n                self.log.debug(\n                    f\"Property `{key}` presents in `{self.table_name}` and has value `{persisted_properties[key]}`.\"\n                    f\"Override is enabled.The value will be changed to `{v_str}`.\"\n                )\n                _alter_table()\n            else:\n                self.log.debug(\n                    f\"Skipping adding property `{key}`, because it is already set \"\n                    f\"for table `{self.table_name}` to `{v_str}`. To override it, provide override=True\"\n                )\n        else:\n            _alter_table()\n    else:\n        self.default_create_properties[key] = v_str\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Nothing to execute on a Table</p> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def execute(self):\n    \"\"\"Nothing to execute on a Table\"\"\"\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_column_type","title":"get_column_type","text":"<pre><code>get_column_type(column: str) -&gt; Optional[DataType]\n</code></pre> <p>Get the type of a column in the table.</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>str</code> <p>Column name.</p> required <p>Returns:</p> Type Description <code>Optional[DataType]</code> <p>Column type.</p> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def get_column_type(self, column: str) -&gt; Optional[DataType]:\n    \"\"\"Get the type of a column in the table.\n\n    Parameters\n    ----------\n    column : str\n        Column name.\n\n    Returns\n    -------\n    Optional[DataType]\n        Column type.\n    \"\"\"\n    return self.dataframe.schema[column].dataType if self.columns and column in self.columns else None\n</code></pre>"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_persisted_properties","title":"get_persisted_properties","text":"<pre><code>get_persisted_properties() -&gt; Dict[str, str]\n</code></pre> <p>Get persisted properties of table.</p> <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Persisted properties as a dictionary.</p> Source code in <code>src/koheesio/spark/delta.py</code> <pre><code>def get_persisted_properties(self) -&gt; Dict[str, str]:\n    \"\"\"Get persisted properties of table.\n\n    Returns\n    -------\n    Dict[str, str]\n        Persisted properties as a dictionary.\n    \"\"\"\n    persisted_properties = {}\n    raw_options = self.spark.sql(f\"SHOW TBLPROPERTIES {self.table_name}\").collect()\n\n    for ro in raw_options:\n        key, value = ro.asDict().values()\n        persisted_properties[key] = value\n\n    return persisted_properties\n</code></pre>"},{"location":"api_reference/spark/etl_task.html","title":"Etl task","text":"<p>ETL Task</p> <p>Extract -&gt; Transform -&gt; Load</p>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask","title":"koheesio.spark.etl_task.EtlTask","text":"<p>ETL Task</p> <p>Etl stands for: Extract -&gt; Transform -&gt; Load</p> <p>This task is a composition of a Reader (extract), a series of Transformations (transform) and a Writer (load). In other words, it reads data from a source, applies a series of transformations, and writes the result to a target.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>Name of the task</p> required <code>description</code> <code>str</code> <p>Description of the task</p> required <code>source</code> <code>Reader</code> <p>Source to read from [extract]</p> required <code>transformations</code> <code>list[Transformation]</code> <p>Series of transformations [transform]. The order of the transformations is important!</p> required <code>target</code> <code>Writer</code> <p>Target to write to [load]</p> required Example <pre><code>from koheesio.tasks import EtlTask\n\nfrom koheesio.steps.readers import CsvReader\nfrom koheesio.steps.transformations.repartition import Repartition\nfrom koheesio.steps.writers import CsvWriter\n\netl_task = EtlTask(\n    name=\"My ETL Task\",\n    description=\"This is an example ETL task\",\n    source=CsvReader(path=\"path/to/source.csv\"),\n    transformations=[Repartition(num_partitions=2)],\n    target=DummyWriter(),\n)\n\netl_task.execute()\n</code></pre> <p>This code will read from a CSV file, repartition the DataFrame to 2 partitions, and write the result to the console.</p> Extending the EtlTask <p>The EtlTask is designed to be a simple and flexible way to define ETL processes. It is not designed to be a one-size-fits-all solution, but rather a starting point for building more complex ETL processes. If you need more complex functionality, you can extend the EtlTask class and override the <code>extract</code>, <code>transform</code> and <code>load</code> methods. You can also implement your own <code>execute</code> method to define the entire ETL process from scratch should you need more flexibility.</p> Advantages of using the EtlTask <ul> <li>It is a simple way to define ETL processes</li> <li>It is easy to understand and extend</li> <li>It is easy to test and debug</li> <li>It is easy to maintain and refactor</li> <li>It is easy to integrate with other tools and libraries</li> <li>It is easy to use in a production environment</li> </ul>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.etl_date","title":"etl_date  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>etl_date: datetime = Field(default=utcnow(), description=\"Date time when this object was created as iso format. Example: '2023-01-24T09:39:23.632374'\")\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.source","title":"source  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source: InstanceOf[Reader] = Field(default=..., description='Source to read from [extract]')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.target","title":"target  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target: InstanceOf[Writer] = Field(default=..., description='Target to write to [load]')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transformations","title":"transformations  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>transformations: conlist(min_length=0, item_type=InstanceOf[Transformation]) = Field(default_factory=list, description='Series of transformations', alias='transforms')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output","title":"Output","text":"<p>Output class for EtlTask</p>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.source_df","title":"source_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .extract() method')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.target_df","title":"target_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_df: DataFrame = Field(default=..., description='The Spark DataFrame used by .load() method')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.transform_df","title":"transform_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>transform_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .transform() method')\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Run the ETL process</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def execute(self):\n    \"\"\"Run the ETL process\"\"\"\n    self.log.info(f\"Task started at {self.etl_date}\")\n\n    # extract from source\n    self.output.source_df = self.extract()\n\n    # transform\n    self.output.transform_df = self.transform(self.output.source_df)\n\n    # load to target\n    self.output.target_df = self.load(self.output.transform_df)\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.extract","title":"extract","text":"<pre><code>extract() -&gt; DataFrame\n</code></pre> <p>Read from Source</p> <p>logging is handled by the Reader.execute()-method's @do_execute decorator</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def extract(self) -&gt; DataFrame:\n    \"\"\"Read from Source\n\n    logging is handled by the Reader.execute()-method's @do_execute decorator\n    \"\"\"\n    reader: Reader = self.source\n    return reader.read()\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.load","title":"load","text":"<pre><code>load(df: DataFrame) -&gt; DataFrame\n</code></pre> <p>Write to Target</p> <p>logging is handled by the Writer.execute()-method's @do_execute decorator</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def load(self, df: DataFrame) -&gt; DataFrame:\n    \"\"\"Write to Target\n\n    logging is handled by the Writer.execute()-method's @do_execute decorator\n    \"\"\"\n    writer: Writer = self.target\n    writer.write(df)\n    return df\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>alias of execute</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transform","title":"transform","text":"<pre><code>transform(df: DataFrame) -&gt; DataFrame\n</code></pre> <p>Transform recursively</p> <p>logging is handled by the Transformation.execute()-method's @do_execute decorator</p> Source code in <code>src/koheesio/spark/etl_task.py</code> <pre><code>def transform(self, df: DataFrame) -&gt; DataFrame:\n    \"\"\"Transform recursively\n\n    logging is handled by the Transformation.execute()-method's @do_execute decorator\n    \"\"\"\n    for t in self.transformations:\n        df = t.transform(df)\n    return df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html","title":"Snowflake","text":"<p>Snowflake steps and tasks for Koheesio</p> <p>Every class in this module is a subclass of <code>Step</code> or <code>Task</code> and is used to perform operations on Snowflake.</p> Notes <p>Every Step in this module is based on SnowflakeBaseModel. The following parameters are available for every Step.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for <code>sfURL</code>. required <code>user</code> <code>str</code> <p>Login name for the Snowflake user. Alias for <code>sfUser</code>.</p> required <code>password</code> <code>SecretStr</code> <p>Password for the Snowflake user. Alias for <code>sfPassword</code>.</p> required <code>database</code> <code>str</code> <p>The database to use for the session after connecting. Alias for <code>sfDatabase</code>.</p> required <code>sfSchema</code> <code>str</code> <p>The schema to use for the session after connecting. Alias for <code>schema</code> (\"schema\" is a reserved name in Pydantic, so we use <code>sfSchema</code> as main name instead).</p> required <code>role</code> <code>str</code> <p>The default security role to use for the session after connecting. Alias for <code>sfRole</code>.</p> required <code>warehouse</code> <code>str</code> <p>The default virtual warehouse to use for the session after connecting. Alias for <code>sfWarehouse</code>.</p> required <code>authenticator</code> <code>Optional[str]</code> <p>Authenticator for the Snowflake user. Example: \"okta.com\".</p> <code>None</code> <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the Snowflake connector.</p> <code>{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}</code> <code>format</code> <code>str</code> <p>The default <code>snowflake</code> format can be used natively in Databricks, use <code>net.snowflake.spark.snowflake</code> in other environments and make sure to install required JARs.</p> <code>\"snowflake\"</code>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn","title":"koheesio.spark.snowflake.AddColumn","text":"<p>Add an empty column to a Snowflake table with given name and DataType</p> Example <pre><code>AddColumn(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    col=\"MY_COL\",\n    dataType=StringType(),\n).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.column","title":"column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>column: str = Field(default=..., description='The name of the new column')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The name of the Snowflake table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: DataType = Field(default=..., description='The DataType represented as a Spark DataType')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output","title":"Output","text":"<p>Output class for AddColumn</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='Query that was executed to add the column')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    query = f\"ALTER TABLE {self.table} ADD COLUMN {self.column} {map_spark_type(self.type)}\".upper()\n    self.output.query = query\n    RunQuery(**self.get_options(), query=query).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","title":"koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","text":"<p>Create (or Replace) a Snowflake table which has the same schema as a Spark DataFrame</p> <p>Can be used as any Transformation. The DataFrame is however left unchanged, and only used for determining the schema of the Snowflake Table that is to be created (or replaced).</p> Example <pre><code>CreateOrReplaceTableFromDataFrame(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=\"super-secret-password\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"MY_TABLE\",\n    df=df,\n).execute()\n</code></pre> <p>Or, as a Transformation: <pre><code>CreateOrReplaceTableFromDataFrame(\n    ...\n    table=\"MY_TABLE\",\n).transform(df)\n</code></pre></p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., alias='table_name', description='The name of the (new) table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output","title":"Output","text":"<p>Output class for CreateOrReplaceTableFromDataFrame</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.input_schema","title":"input_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>input_schema: StructType = Field(default=..., description='The original schema from the input DataFrame')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='Query that was executed to create the table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.snowflake_schema","title":"snowflake_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>snowflake_schema: str = Field(default=..., description='Derived Snowflake table schema based on the input DataFrame')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    self.output.df = self.df\n\n    input_schema = self.df.schema\n    self.output.input_schema = input_schema\n\n    snowflake_schema = \", \".join([f\"{c.name} {map_spark_type(c.dataType)}\" for c in input_schema])\n    self.output.snowflake_schema = snowflake_schema\n\n    table_name = f\"{self.database}.{self.sfSchema}.{self.table}\"\n    query = f\"CREATE OR REPLACE TABLE {table_name} ({snowflake_schema})\"\n    self.output.query = query\n\n    RunQuery(**self.get_options(), query=query).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery","title":"koheesio.spark.snowflake.DbTableQuery","text":"<p>Read table from Snowflake using the <code>dbtable</code> option instead of <code>query</code></p> Example <pre><code>DbTableQuery(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"user\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    table=\"db.schema.table\",\n).execute().df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery.dbtable","title":"dbtable  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>dbtable: str = Field(default=..., alias='table', description='The name of the table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema","title":"koheesio.spark.snowflake.GetTableSchema","text":"<p>Get the schema from a Snowflake table as a Spark Schema</p> Notes <ul> <li>This Step will execute a <code>SELECT * FROM &lt;table&gt; LIMIT 1</code> query to get the schema of the table.</li> <li>The schema will be stored in the <code>table_schema</code> attribute of the output.</li> <li><code>table_schema</code> is used as the attribute name to avoid conflicts with the <code>schema</code> attribute of Pydantic's     BaseModel.</li> </ul> Example <pre><code>schema = (\n    GetTableSchema(\n        database=\"MY_DB\",\n        schema_=\"MY_SCHEMA\",\n        warehouse=\"MY_WH\",\n        user=\"gid.account@nike.com\",\n        password=\"super-secret-password\",\n        role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n        table=\"MY_TABLE\",\n    )\n    .execute()\n    .table_schema\n)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The Snowflake table name')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output","title":"Output","text":"<p>Output class for GetTableSchema</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output.table_schema","title":"table_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table_schema: StructType = Field(default=..., serialization_alias='schema', description='The Spark Schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self) -&gt; Output:\n    query = f\"SELECT * FROM {self.table} LIMIT 1\"  # nosec B608: hardcoded_sql_expressions\n    df = Query(**self.get_options(), query=query).execute().df\n    self.output.table_schema = df.schema\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","text":"<p>Grant Snowflake privileges to a set of roles on a fully qualified object, i.e. <code>database.schema.object_name</code></p> <p>This class is a subclass of <code>GrantPrivilegesOnObject</code> and is used to grant privileges on a fully qualified object. The advantage of using this class is that it sets the object name to be fully qualified, i.e. <code>database.schema.object_name</code>.</p> <p>Meaning, you can set the <code>database</code>, <code>schema</code> and <code>object</code> separately and the object name will be set to be fully qualified, i.e. <code>database.schema.object_name</code>.</p> Example <pre><code>GrantPrivilegesOnFullyQualifiedObject(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    ...\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    ...\n)\n</code></pre> <p>In this example, the object name will be set to be fully qualified, i.e. <code>MY_DB.MY_SCHEMA.MY_TABLE</code>. If you were to use <code>GrantPrivilegesOnObject</code> instead, you would have to set the object name to be fully qualified yourself.</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject.set_object_name","title":"set_object_name","text":"<pre><code>set_object_name()\n</code></pre> <p>Set the object name to be fully qualified, i.e. database.schema.object_name</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_object_name(self):\n    \"\"\"Set the object name to be fully qualified, i.e. database.schema.object_name\"\"\"\n    # database, schema, obj_name\n    db = self.database\n    schema = self.model_dump()[\"sfSchema\"]  # since \"schema\" is a reserved name\n    obj_name = self.object\n\n    self.object = f\"{db}.{schema}.{obj_name}\"\n\n    return self\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnObject","text":"<p>A wrapper on Snowflake GRANT privileges</p> <p>With this Step, you can grant Snowflake privileges to a set of roles on a table, a view, or an object</p> See Also <p>https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html</p> <p>Parameters:</p> Name Type Description Default <code>warehouse</code> <code>str</code> <p>The name of the warehouse. Alias for <code>sfWarehouse</code></p> required <code>user</code> <code>str</code> <p>The username. Alias for <code>sfUser</code></p> required <code>password</code> <code>SecretStr</code> <p>The password. Alias for <code>sfPassword</code></p> required <code>role</code> <code>str</code> <p>The role name</p> required <code>object</code> <code>str</code> <p>The name of the object to grant privileges on</p> required <code>type</code> <code>str</code> <p>The type of object to grant privileges on, e.g. TABLE, VIEW</p> required <code>privileges</code> <code>Union[conlist(str, min_length=1), str]</code> <p>The Privilege/Permission or list of Privileges/Permissions to grant on the given object.</p> required <code>roles</code> <code>Union[conlist(str, min_length=1), str]</code> <p>The Role or list of Roles to grant the privileges to</p> required Example <pre><code>GrantPermissionsOnTable(\n    object=\"MY_TABLE\",\n    type=\"TABLE\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    permissions=[\"SELECT\", \"INSERT\"],\n).execute()\n</code></pre> <p>In this example, the <code>APPLICATION.SNOWFLAKE.ADMIN</code> role will be granted <code>SELECT</code> and <code>INSERT</code> privileges on the <code>MY_TABLE</code> table using the <code>MY_WH</code> warehouse.</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.object","title":"object  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>object: str = Field(default=..., description='The name of the object to grant privileges on')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.privileges","title":"privileges  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>privileges: Union[conlist(str, min_length=1), str] = Field(default=..., alias='permissions', description='The Privilege/Permission or list of Privileges/Permissions to grant on the given object. See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.roles","title":"roles  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>roles: Union[conlist(str, min_length=1), str] = Field(default=..., alias='role', validation_alias='roles', description='The Role or list of Roles to grant the privileges to')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: str = Field(default=..., description='The type of object to grant privileges on, e.g. TABLE, VIEW')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output","title":"Output","text":"<p>Output class for GrantPrivilegesOnObject</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: conlist(str, min_length=1) = Field(default=..., description='Query that was executed to grant privileges', validate_default=False)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    self.output.query = []\n    roles = self.roles\n\n    for role in roles:\n        query = self.get_query(role)\n        self.output.query.append(query)\n        RunQuery(**self.get_options(), query=query).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.get_query","title":"get_query","text":"<pre><code>get_query(role: str)\n</code></pre> <p>Build the GRANT query</p> <p>Parameters:</p> Name Type Description Default <code>role</code> <code>str</code> <p>The role name</p> required <p>Returns:</p> Name Type Description <code>query</code> <code>str</code> <p>The Query that performs the grant</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_query(self, role: str):\n    \"\"\"Build the GRANT query\n\n    Parameters\n    ----------\n    role: str\n        The role name\n\n    Returns\n    -------\n    query : str\n        The Query that performs the grant\n    \"\"\"\n    query = f\"GRANT {','.join(self.privileges)} ON {self.type} {self.object} TO ROLE {role}\".upper()\n    return query\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.set_roles_privileges","title":"set_roles_privileges","text":"<pre><code>set_roles_privileges(values)\n</code></pre> <p>Coerce roles and privileges to be lists if they are not already.</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@model_validator(mode=\"before\")\ndef set_roles_privileges(cls, values):\n    \"\"\"Coerce roles and privileges to be lists if they are not already.\"\"\"\n    roles_value = values.get(\"roles\") or values.get(\"role\")\n    privileges_value = values.get(\"privileges\")\n\n    if not (roles_value and privileges_value):\n        raise ValueError(\"You have to specify roles AND privileges when using 'GrantPrivilegesOnObject'.\")\n\n    # coerce values to be lists\n    values[\"roles\"] = [roles_value] if isinstance(roles_value, str) else roles_value\n    values[\"role\"] = values[\"roles\"][0]  # hack to keep the validator happy\n    values[\"privileges\"] = [privileges_value] if isinstance(privileges_value, str) else privileges_value\n\n    return values\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.validate_object_and_object_type","title":"validate_object_and_object_type","text":"<pre><code>validate_object_and_object_type()\n</code></pre> <p>Validate that the object and type are set.</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_object_and_object_type(self):\n    \"\"\"Validate that the object and type are set.\"\"\"\n    object_value = self.object\n    if not object_value:\n        raise ValueError(\"You must provide an `object`, this should be the name of the object. \")\n\n    object_type = self.type\n    if not object_type:\n        raise ValueError(\n            \"You must provide a `type`, e.g. TABLE, VIEW, DATABASE. \"\n            \"See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html\"\n        )\n\n    return self\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable","title":"koheesio.spark.snowflake.GrantPrivilegesOnTable","text":"<p>Grant Snowflake privileges to a set of roles on a table</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.object","title":"object  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>object: str = Field(default=..., alias='table', description='The name of the Table to grant Privileges on. This should be just the name of the table; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: str = 'TABLE'\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView","title":"koheesio.spark.snowflake.GrantPrivilegesOnView","text":"<p>Grant Snowflake privileges to a set of roles on a view</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.object","title":"object  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>object: str = Field(default=..., alias='view', description='The name of the View to grant Privileges on. This should be just the name of the view; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.type","title":"type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>type: str = 'VIEW'\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query","title":"koheesio.spark.snowflake.Query","text":"<p>Query data from Snowflake and return the result as a DataFrame</p> Example <pre><code>Query(\n    database=\"MY_DB\",\n    schema_=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"gid.account@nike.com\",\n    password=Secret(\"super-secret-password\"),\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"SELECT * FROM MY_TABLE\",\n).execute().df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='The query to run')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>add query to options</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    \"\"\"add query to options\"\"\"\n    options = super().get_options()\n    options[\"query\"] = self.query\n    return options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.validate_query","title":"validate_query","text":"<pre><code>validate_query(query)\n</code></pre> <p>Replace escape characters</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    query = query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n    return query\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery","title":"koheesio.spark.snowflake.RunQuery","text":"<p>Run a query on Snowflake that does not return a result, e.g. create table statement</p> <p>This is a wrapper around 'net.snowflake.spark.snowflake.Utils.runQuery' on the JVM</p> Example <pre><code>RunQuery(\n    database=\"MY_DB\",\n    schema=\"MY_SCHEMA\",\n    warehouse=\"MY_WH\",\n    user=\"account\",\n    password=\"***\",\n    role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n    query=\"CREATE TABLE test (col1 string)\",\n).execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: str = Field(default=..., description='The query to run', alias='sql')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self) -&gt; None:\n    if not self.query:\n        self.log.warning(\"Empty string given as query input, skipping execution\")\n        return\n    # noinspection PyProtectedMember\n    self.spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(self.get_options(), self.query)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    # Executing the RunQuery without `host` option in Databricks throws:\n    # An error occurred while calling z:net.snowflake.spark.snowflake.Utils.runQuery.\n    # : java.util.NoSuchElementException: key not found: host\n    options = super().get_options()\n    options[\"host\"] = options[\"sfURL\"]\n    return options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.validate_query","title":"validate_query","text":"<pre><code>validate_query(query)\n</code></pre> <p>Replace escape characters</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>@field_validator(\"query\")\ndef validate_query(cls, query):\n    \"\"\"Replace escape characters\"\"\"\n    return query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel","title":"koheesio.spark.snowflake.SnowflakeBaseModel","text":"<p>BaseModel for setting up Snowflake Driver options.</p> Notes <ul> <li>Snowflake is supported natively in Databricks 4.2 and newer:     https://docs.snowflake.com/en/user-guide/spark-connector-databricks</li> <li>Refer to Snowflake docs for the installation instructions for non-Databricks environments:     https://docs.snowflake.com/en/user-guide/spark-connector-install</li> <li>Refer to Snowflake docs for connection options:     https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector</li> </ul> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for <code>sfURL</code>. required <code>user</code> <code>str</code> <p>Login name for the Snowflake user. Alias for <code>sfUser</code>.</p> required <code>password</code> <code>SecretStr</code> <p>Password for the Snowflake user. Alias for <code>sfPassword</code>.</p> required <code>database</code> <code>str</code> <p>The database to use for the session after connecting. Alias for <code>sfDatabase</code>.</p> required <code>sfSchema</code> <code>str</code> <p>The schema to use for the session after connecting. Alias for <code>schema</code> (\"schema\" is a reserved name in Pydantic, so we use <code>sfSchema</code> as main name instead).</p> required <code>role</code> <code>str</code> <p>The default security role to use for the session after connecting. Alias for <code>sfRole</code>.</p> required <code>warehouse</code> <code>str</code> <p>The default virtual warehouse to use for the session after connecting. Alias for <code>sfWarehouse</code>.</p> required <code>authenticator</code> <code>Optional[str]</code> <p>Authenticator for the Snowflake user. Example: \"okta.com\".</p> <code>None</code> <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the Snowflake connector.</p> <code>{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}</code> <code>format</code> <code>str</code> <p>The default <code>snowflake</code> format can be used natively in Databricks, use <code>net.snowflake.spark.snowflake</code> in other environments and make sure to install required JARs.</p> <code>\"snowflake\"</code>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.authenticator","title":"authenticator  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>authenticator: Optional[str] = Field(default=None, description='Authenticator for the Snowflake user', examples=['okta.com'])\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.database","title":"database  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>database: str = Field(default=..., alias='sfDatabase', description='The database to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='snowflake', description='The default `snowflake` format can be used natively in Databricks, use `net.snowflake.spark.snowflake` in other environments and make sure to install required JARs.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field(default={'sfCompress': 'on', 'continue_on_error': 'off'}, description='Extra options to pass to the Snowflake connector')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.password","title":"password  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>password: SecretStr = Field(default=..., alias='sfPassword', description='Password for the Snowflake user')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.role","title":"role  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>role: str = Field(default=..., alias='sfRole', description='The default security role to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.sfSchema","title":"sfSchema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sfSchema: str = Field(default=..., alias='schema', description='The schema to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., alias='sfURL', description='Hostname for the Snowflake account, e.g. &lt;account&gt;.snowflakecomputing.com', examples=['example.snowflakecomputing.com'])\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.user","title":"user  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>user: str = Field(default=..., alias='sfUser', description='Login name for the Snowflake user')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.warehouse","title":"warehouse  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>warehouse: str = Field(default=..., alias='sfWarehouse', description='The default virtual warehouse to use for the session after connecting')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Get the sfOptions as a dictionary.</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    \"\"\"Get the sfOptions as a dictionary.\"\"\"\n    return {\n        key: value\n        for key, value in {\n            \"sfURL\": self.url,\n            \"sfUser\": self.user,\n            \"sfPassword\": self.password.get_secret_value(),\n            \"authenticator\": self.authenticator,\n            \"sfDatabase\": self.database,\n            \"sfSchema\": self.sfSchema,\n            \"sfRole\": self.role,\n            \"sfWarehouse\": self.warehouse,\n            **self.options,\n        }.items()\n        if value is not None\n    }\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader","title":"koheesio.spark.snowflake.SnowflakeReader","text":"<p>Wrapper around JdbcReader for Snowflake.</p> Example <pre><code>sr = SnowflakeReader(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n)\ndf = sr.read()\n</code></pre> Notes <ul> <li>Snowflake is supported natively in Databricks 4.2 and newer:     https://docs.snowflake.com/en/user-guide/spark-connector-databricks</li> <li>Refer to Snowflake docs for the installation instructions for non-Databricks environments:     https://docs.snowflake.com/en/user-guide/spark-connector-install</li> <li>Refer to Snowflake docs for connection options:     https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector</li> </ul>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: Optional[str] = None\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeStep","title":"koheesio.spark.snowflake.SnowflakeStep","text":"<p>Expands the SnowflakeBaseModel so that it can be used as a Step</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep","title":"koheesio.spark.snowflake.SnowflakeTableStep","text":"<p>Expands the SnowflakeStep, adding a 'table' parameter</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The name of the table')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def get_options(self):\n    options = super().get_options()\n    options[\"table\"] = self.table\n    return options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTransformation","title":"koheesio.spark.snowflake.SnowflakeTransformation","text":"<p>Adds Snowflake parameters to the Transformation class</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter","title":"koheesio.spark.snowflake.SnowflakeWriter","text":"<p>Class for writing to Snowflake</p> See Also <ul> <li>koheesio.steps.writers.Writer</li> <li>koheesio.steps.writers.BatchOutputMode</li> </ul>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.insert_type","title":"insert_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>insert_type: Optional[BatchOutputMode] = Field(APPEND, alias='mode', description='The insertion type, append or overwrite')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='Target table name')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Write to Snowflake</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    \"\"\"Write to Snowflake\"\"\"\n    self.log.debug(f\"writing to {self.table} with mode {self.insert_type}\")\n    self.df.write.format(self.format).options(**self.get_options()).option(\"dbtable\", self.table).mode(\n        self.insert_type\n    ).save()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema","title":"koheesio.spark.snowflake.SyncTableAndDataFrameSchema","text":"<p>Sync the schema's of a Snowflake table and a DataFrame. This will add NULL columns for the columns that are not in both and perform type casts where needed.</p> <p>The Snowflake table will take priority in case of type conflicts.</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: DataFrame = Field(default=..., description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.dry_run","title":"dry_run  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>dry_run: Optional[bool] = Field(default=False, description='Only show schema differences, do not apply changes')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='The table name')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output","title":"Output","text":"<p>Output class for SyncTableAndDataFrameSchema</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_df_schema","title":"new_df_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>new_df_schema: StructType = Field(default=..., description='New DataFrame schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_sf_schema","title":"new_sf_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>new_sf_schema: StructType = Field(default=..., description='New Snowflake schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_df_schema","title":"original_df_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>original_df_schema: StructType = Field(default=..., description='Original DataFrame schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_sf_schema","title":"original_sf_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>original_sf_schema: StructType = Field(default=..., description='Original Snowflake schema')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.sf_table_altered","title":"sf_table_altered  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sf_table_altered: bool = Field(default=False, description='Flag to indicate whether Snowflake schema has been altered')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    self.log.warning(\"Snowflake table will always take a priority in case of data type conflicts!\")\n\n    # spark side\n    df_schema = self.df.schema\n    self.output.original_df_schema = deepcopy(df_schema)  # using deepcopy to avoid storing in place changes\n    df_cols = [c.name.lower() for c in df_schema]\n\n    # snowflake side\n    sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n    self.output.original_sf_schema = sf_schema\n    sf_cols = [c.name.lower() for c in sf_schema]\n\n    if self.dry_run:\n        # Display differences between Spark DataFrame and Snowflake schemas\n        # and provide dummy values that are expected as class outputs.\n        self.log.warning(f\"Columns to be added to Snowflake table: {set(df_cols) - set(sf_cols)}\")\n        self.log.warning(f\"Columns to be added to Spark DataFrame: {set(sf_cols) - set(df_cols)}\")\n\n        self.output.new_df_schema = t.StructType()\n        self.output.new_sf_schema = t.StructType()\n        self.output.df = self.df\n        self.output.sf_table_altered = False\n\n    else:\n        # Add columns to SnowFlake table that exist in DataFrame\n        for df_column in df_schema:\n            if df_column.name.lower() not in sf_cols:\n                AddColumn(\n                    **self.get_options(),\n                    table=self.table,\n                    column=df_column.name,\n                    type=df_column.dataType,\n                ).execute()\n                self.output.sf_table_altered = True\n\n        if self.output.sf_table_altered:\n            sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n            sf_cols = [c.name.lower() for c in sf_schema]\n\n        self.output.new_sf_schema = sf_schema\n\n        # Add NULL columns to the DataFrame if they exist in SnowFlake but not in the df\n        df = self.df\n        for sf_col in self.output.original_sf_schema:\n            sf_col_name = sf_col.name.lower()\n            if sf_col_name not in df_cols:\n                sf_col_type = sf_col.dataType\n                df = df.withColumn(sf_col_name, f.lit(None).cast(sf_col_type))\n\n        # Put DataFrame columns in the same order as the Snowflake table\n        df = df.select(*sf_cols)\n\n        self.output.df = df\n        self.output.new_df_schema = df.schema\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","title":"koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","text":"<p>Synchronize a Delta table to a Snowflake table</p> <ul> <li>Overwrite - only in batch mode</li> <li>Append - supports batch and streaming mode</li> <li>Merge - only in streaming mode</li> </ul> Example <pre><code>SynchronizeDeltaToSnowflakeTask(\n    url=\"acme.snowflakecomputing.com\",\n    user=\"admin\",\n    role=\"ADMIN\",\n    warehouse=\"SF_WAREHOUSE\",\n    database=\"SF_DATABASE\",\n    schema=\"SF_SCHEMA\",\n    source_table=DeltaTableStep(...),\n    target_table=\"my_sf_table\",\n    key_columns=[\n        \"id\",\n    ],\n    streaming=False,\n).run()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.checkpoint_location","title":"checkpoint_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>checkpoint_location: Optional[str] = Field(default=None, description='Checkpoint location to use')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.enable_deletion","title":"enable_deletion  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_deletion: Optional[bool] = Field(default=False, description='In case of merge synchronisation_mode add deletion statement in merge query.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.key_columns","title":"key_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>key_columns: Optional[List[str]] = Field(default_factory=list, description='Key columns on which merge statements will be MERGE statement will be applied.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.non_key_columns","title":"non_key_columns  <code>property</code>","text":"<pre><code>non_key_columns: List[str]\n</code></pre> <p>Columns of source table that aren't part of the (composite) primary key</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.persist_staging","title":"persist_staging  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>persist_staging: Optional[bool] = Field(default=False, description='In case of debugging, set `persist_staging` to True to retain the staging table for inspection after synchronization.')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.reader","title":"reader  <code>property</code>","text":"<pre><code>reader\n</code></pre> <p>DeltaTable reader</p> Returns: <pre><code>DeltaTableReader the will yield source delta table\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.schema_tracking_location","title":"schema_tracking_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_tracking_location: Optional[str] = Field(default=None, description='Schema tracking location to use. Info: https://docs.delta.io/latest/delta-streaming.html#-schema-tracking')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.source_table","title":"source_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source_table: DeltaTableStep = Field(default=..., description='Source delta table to synchronize')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table","title":"staging_table  <code>property</code>","text":"<pre><code>staging_table\n</code></pre> <p>Intermediate table on snowflake where staging results are stored</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table_name","title":"staging_table_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>staging_table_name: Optional[str] = Field(default=None, alias='staging_table', description='Optional snowflake staging name', validate_default=False)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: Optional[bool] = Field(default=False, description=\"Should synchronisation happen in streaming or in batch mode. Streaming is supported in 'APPEND' and 'MERGE' mode. Batch is supported in 'OVERWRITE' and 'APPEND' mode.\")\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.synchronisation_mode","title":"synchronisation_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>synchronisation_mode: BatchOutputMode = Field(default=MERGE, description=\"Determines if synchronisation will 'overwrite' any existing table, 'append' new rows or 'merge' with existing rows.\")\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.target_table","title":"target_table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_table: str = Field(default=..., description='Target table in snowflake to synchronize to')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[ForEachBatchStreamWriter, SnowflakeWriter]\n</code></pre> <p>Writer to persist to snowflake</p> <p>Depending on configured options, this returns an SnowflakeWriter or ForEachBatchStreamWriter: - OVERWRITE/APPEND mode yields SnowflakeWriter - MERGE mode yields ForEachBatchStreamWriter</p> <p>Returns:</p> Type Description <code>    Union[ForEachBatchStreamWriter, SnowflakeWriter]</code>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer_","title":"writer_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>writer_: Optional[Union[ForEachBatchStreamWriter, SnowflakeWriter]] = None\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.drop_table","title":"drop_table","text":"<pre><code>drop_table(snowflake_table)\n</code></pre> <p>Drop a given snowflake table</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def drop_table(self, snowflake_table):\n    \"\"\"Drop a given snowflake table\"\"\"\n    self.log.warning(f\"Dropping table {snowflake_table} from snowflake\")\n    drop_table_query = f\"\"\"DROP TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(**self.get_options(), query=drop_table_query)\n    query_executor.execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self) -&gt; None:\n    # extract\n    df = self.extract()\n    self.output.source_df = df\n\n    # synchronize\n    self.output.target_df = df\n    self.load(df)\n    if not self.persist_staging:\n        # If it's a streaming job, await for termination before dropping staging table\n        if self.streaming:\n            self.writer.await_termination()\n        self.drop_table(self.staging_table)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.extract","title":"extract","text":"<pre><code>extract() -&gt; DataFrame\n</code></pre> <p>Extract source table</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def extract(self) -&gt; DataFrame:\n    \"\"\"\n    Extract source table\n    \"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        if not self.source_table.is_cdf_active:\n            raise RuntimeError(\n                f\"Source table {self.source_table.table_name} does not have CDF enabled. \"\n                f\"Set TBLPROPERTIES ('delta.enableChangeDataFeed' = true) to enable. \"\n                f\"Current properties = {self.source_table_properties}\"\n            )\n\n    df = self.reader.read()\n    self.output.source_df = df\n    return df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.load","title":"load","text":"<pre><code>load(df) -&gt; DataFrame\n</code></pre> <p>Load source table into snowflake</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def load(self, df) -&gt; DataFrame:\n    \"\"\"Load source table into snowflake\"\"\"\n    if self.synchronisation_mode == BatchOutputMode.MERGE:\n        self.log.info(f\"Truncating staging table {self.staging_table}\")\n        self.truncate_table(self.staging_table)\n    self.writer.write(df)\n    self.output.target_df = df\n    return df\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>alias of execute</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def run(self):\n    \"\"\"alias of execute\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.truncate_table","title":"truncate_table","text":"<pre><code>truncate_table(snowflake_table)\n</code></pre> <p>Truncate a given snowflake table</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def truncate_table(self, snowflake_table):\n    \"\"\"Truncate a given snowflake table\"\"\"\n    truncate_query = f\"\"\"TRUNCATE TABLE IF EXISTS {snowflake_table}\"\"\"\n    query_executor = RunQuery(\n        **self.get_options(),\n        query=truncate_query,\n    )\n    query_executor.execute()\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists","title":"koheesio.spark.snowflake.TableExists","text":"<p>Check if the table exists in Snowflake by using INFORMATION_SCHEMA.</p> Example <pre><code>k = TableExists(\n    url=\"foo.snowflakecomputing.com\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    database=\"db\",\n    schema=\"schema\",\n    table=\"table\",\n)\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output","title":"Output","text":"<p>Output class for TableExists</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output.exists","title":"exists  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>exists: bool = Field(default=..., description='Whether or not the table exists')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    query = (\n        dedent(\n            # Force upper case, due to case-sensitivity of where clause\n            f\"\"\"\n        SELECT *\n        FROM INFORMATION_SCHEMA.TABLES\n        WHERE TABLE_CATALOG     = '{self.database}'\n          AND TABLE_SCHEMA      = '{self.sfSchema}'\n          AND TABLE_TYPE        = 'BASE TABLE'\n          AND upper(TABLE_NAME) = '{self.table.upper()}'\n        \"\"\"  # nosec B608: hardcoded_sql_expressions\n        )\n        .upper()\n        .strip()\n    )\n\n    self.log.debug(f\"Query that was executed to check if the table exists:\\n{query}\")\n\n    df = Query(**self.get_options(), query=query).read()\n\n    exists = df.count() &gt; 0\n    self.log.info(f\"Table {self.table} {'exists' if exists else 'does not exist'}\")\n    self.output.exists = exists\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery","title":"koheesio.spark.snowflake.TagSnowflakeQuery","text":"<p>Provides Snowflake query tag pre-action that can be used to easily find queries through SF history search and further group them for debugging and cost tracking purposes.</p> <p>Takes in query tag attributes as kwargs and additional Snowflake options dict that can optionally contain other set of pre-actions to be applied to a query, in that case existing pre-action aren't dropped, query tag pre-action will be added to them.</p> <p>Passed Snowflake options dictionary is not modified in-place, instead anew dictionary containing updated pre-actions is returned.</p> Notes <p>See this article for explanation: https://select.dev/posts/snowflake-query-tags</p> <p>Arbitrary tags can be applied, such as team, dataset names, business capability, etc.</p> Example <pre><code>query_tag = AddQueryTag(\n    options={\"preactions\": ...},\n    task_name=\"cleanse_task\",\n    pipeline_name=\"ingestion-pipeline\",\n    etl_date=\"2022-01-01\",\n    pipeline_execution_time=\"2022-01-01T00:00:00\",\n    task_execution_time=\"2022-01-01T01:00:00\",\n    environment=\"dev\",\n    trace_id=\"e0fdec43-a045-46e5-9705-acd4f3f96045\",\n    span_id=\"cb89abea-1c12-471f-8b12-546d2d66f6cb\",\n    ),\n).execute().options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Dict = Field(default_factory=dict, description='Additional Snowflake options, optionally containing additional preactions')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output","title":"Output","text":"<p>Output class for AddQueryTag</p>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Dict = Field(default=..., description='Copy of provided SF options, with added query tag preaction')\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Add query tag preaction to Snowflake options</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def execute(self):\n    \"\"\"Add query tag preaction to Snowflake options\"\"\"\n    tag_json = json.dumps(self.extra_params, indent=4, sort_keys=True)\n    tag_preaction = f\"ALTER SESSION SET QUERY_TAG = '{tag_json}';\"\n    preactions = self.options.get(\"preactions\", \"\")\n    preactions = f\"{preactions}\\n{tag_preaction}\".strip()\n    updated_options = dict(self.options)\n    updated_options[\"preactions\"] = preactions\n    self.output.options = updated_options\n</code></pre>"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.map_spark_type","title":"koheesio.spark.snowflake.map_spark_type","text":"<pre><code>map_spark_type(spark_type: DataType)\n</code></pre> <p>Translates Spark DataFrame Schema type to SnowFlake type</p> Basic Types Snowflake Type StringType STRING NullType STRING BooleanType BOOLEAN Numeric Types Snowflake Type LongType BIGINT IntegerType INT ShortType SMALLINT DoubleType DOUBLE FloatType FLOAT NumericType FLOAT ByteType BINARY Date / Time Types Snowflake Type DateType DATE TimestampType TIMESTAMP Advanced Types Snowflake Type DecimalType DECIMAL MapType VARIANT ArrayType VARIANT StructType VARIANT References <ul> <li>Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html</li> <li>Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html</li> </ul> <p>Parameters:</p> Name Type Description Default <code>spark_type</code> <code>DataType</code> <p>DataType taken out of the StructField</p> required <p>Returns:</p> Type Description <code>str</code> <p>The Snowflake data type</p> Source code in <code>src/koheesio/spark/snowflake.py</code> <pre><code>def map_spark_type(spark_type: t.DataType):\n    \"\"\"\n    Translates Spark DataFrame Schema type to SnowFlake type\n\n    | Basic Types       | Snowflake Type |\n    |-------------------|----------------|\n    | StringType        | STRING         |\n    | NullType          | STRING         |\n    | BooleanType       | BOOLEAN        |\n\n    | Numeric Types     | Snowflake Type |\n    |-------------------|----------------|\n    | LongType          | BIGINT         |\n    | IntegerType       | INT            |\n    | ShortType         | SMALLINT       |\n    | DoubleType        | DOUBLE         |\n    | FloatType         | FLOAT          |\n    | NumericType       | FLOAT          |\n    | ByteType          | BINARY         |\n\n    | Date / Time Types | Snowflake Type |\n    |-------------------|----------------|\n    | DateType          | DATE           |\n    | TimestampType     | TIMESTAMP      |\n\n    | Advanced Types    | Snowflake Type |\n    |-------------------|----------------|\n    | DecimalType       | DECIMAL        |\n    | MapType           | VARIANT        |\n    | ArrayType         | VARIANT        |\n    | StructType        | VARIANT        |\n\n    References\n    ----------\n    - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html\n    - Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html\n\n    Parameters\n    ----------\n    spark_type : pyspark.sql.types.DataType\n        DataType taken out of the StructField\n\n    Returns\n    -------\n    str\n        The Snowflake data type\n    \"\"\"\n    # StructField means that the entire Field was passed, we need to extract just the dataType before continuing\n    if isinstance(spark_type, t.StructField):\n        spark_type = spark_type.dataType\n\n    # Check if the type is DayTimeIntervalType\n    if isinstance(spark_type, t.DayTimeIntervalType):\n        warn(\n            \"DayTimeIntervalType is being converted to STRING. \"\n            \"Consider converting to a more supported date/time/timestamp type in Snowflake.\"\n        )\n\n    # fmt: off\n    # noinspection PyUnresolvedReferences\n    data_type_map = {\n        # Basic Types\n        t.StringType: \"STRING\",\n        t.NullType: \"STRING\",\n        t.BooleanType: \"BOOLEAN\",\n\n        # Numeric Types\n        t.LongType: \"BIGINT\",\n        t.IntegerType: \"INT\",\n        t.ShortType: \"SMALLINT\",\n        t.DoubleType: \"DOUBLE\",\n        t.FloatType: \"FLOAT\",\n        t.NumericType: \"FLOAT\",\n        t.ByteType: \"BINARY\",\n        t.BinaryType: \"VARBINARY\",\n\n        # Date / Time Types\n        t.DateType: \"DATE\",\n        t.TimestampType: \"TIMESTAMP\",\n        t.DayTimeIntervalType: \"STRING\",\n\n        # Advanced Types\n        t.DecimalType:\n            f\"DECIMAL({spark_type.precision},{spark_type.scale})\"  # pylint: disable=no-member\n            if isinstance(spark_type, t.DecimalType) else \"DECIMAL(38,0)\",\n        t.MapType: \"VARIANT\",\n        t.ArrayType: \"VARIANT\",\n        t.StructType: \"VARIANT\",\n    }\n    return data_type_map.get(type(spark_type), 'STRING')\n</code></pre>"},{"location":"api_reference/spark/utils.html","title":"Utils","text":"<p>Spark Utility functions</p>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_minor_version","title":"koheesio.spark.utils.spark_minor_version  <code>module-attribute</code>","text":"<pre><code>spark_minor_version: float = get_spark_minor_version()\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype","title":"koheesio.spark.utils.SparkDatatype","text":"<p>Allowed spark datatypes</p> <p>The following table lists the data types that are supported by Spark SQL.</p> Data type SQL name ByteType BYTE, TINYINT ShortType SHORT, SMALLINT IntegerType INT, INTEGER LongType LONG, BIGINT FloatType FLOAT, REAL DoubleType DOUBLE DecimalType DECIMAL, DEC, NUMERIC StringType STRING BinaryType BINARY BooleanType BOOLEAN TimestampType TIMESTAMP, TIMESTAMP_LTZ DateType DATE ArrayType ARRAY MapType MAP NullType VOID Not supported yet <ul> <li>TimestampNTZType     TIMESTAMP_NTZ</li> <li>YearMonthIntervalType     INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH</li> <li>DayTimeIntervalType     INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR,     INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND</li> </ul> See Also <p>https://spark.apache.org/docs/latest/sql-ref-datatypes.html#supported-data-types</p>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.ARRAY","title":"ARRAY  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ARRAY = 'array'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BIGINT","title":"BIGINT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BIGINT = 'long'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BINARY","title":"BINARY  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BINARY = 'binary'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BOOLEAN","title":"BOOLEAN  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BOOLEAN = 'boolean'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BYTE","title":"BYTE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BYTE = 'byte'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DATE","title":"DATE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DATE = 'date'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DEC","title":"DEC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DEC = 'decimal'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DECIMAL","title":"DECIMAL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DECIMAL = 'decimal'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DOUBLE","title":"DOUBLE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DOUBLE = 'double'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.FLOAT","title":"FLOAT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FLOAT = 'float'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INT","title":"INT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INT = 'integer'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INTEGER","title":"INTEGER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INTEGER = 'integer'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.LONG","title":"LONG  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LONG = 'long'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.MAP","title":"MAP  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MAP = 'map'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.NUMERIC","title":"NUMERIC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>NUMERIC = 'decimal'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.REAL","title":"REAL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>REAL = 'float'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SHORT","title":"SHORT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SHORT = 'short'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SMALLINT","title":"SMALLINT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SMALLINT = 'short'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.STRING","title":"STRING  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>STRING = 'string'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP","title":"TIMESTAMP  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TIMESTAMP = 'timestamp'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP_LTZ","title":"TIMESTAMP_LTZ  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TIMESTAMP_LTZ = 'timestamp'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TINYINT","title":"TINYINT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TINYINT = 'byte'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.VOID","title":"VOID  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>VOID = 'void'\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.spark_type","title":"spark_type  <code>property</code>","text":"<pre><code>spark_type: DataType\n</code></pre> <p>Returns the spark type for the given enum value</p>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(value: str) -&gt; SparkDatatype\n</code></pre> <p>Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>@classmethod\ndef from_string(cls, value: str) -&gt; \"SparkDatatype\":\n    \"\"\"Allows for getting the right Enum value by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.get_spark_minor_version","title":"koheesio.spark.utils.get_spark_minor_version","text":"<pre><code>get_spark_minor_version() -&gt; float\n</code></pre> <p>Returns the minor version of the spark instance.</p> <p>For example, if the spark version is 3.3.2, this function would return 3.3</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def get_spark_minor_version() -&gt; float:\n    \"\"\"Returns the minor version of the spark instance.\n\n    For example, if the spark version is 3.3.2, this function would return 3.3\n    \"\"\"\n    return float(\".\".join(spark_version.split(\".\")[:2]))\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.on_databricks","title":"koheesio.spark.utils.on_databricks","text":"<pre><code>on_databricks() -&gt; bool\n</code></pre> <p>Retrieve if we're running on databricks or elsewhere</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def on_databricks() -&gt; bool:\n    \"\"\"Retrieve if we're running on databricks or elsewhere\"\"\"\n    dbr_version = os.getenv(\"DATABRICKS_RUNTIME_VERSION\", None)\n    return dbr_version is not None and dbr_version != \"\"\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.schema_struct_to_schema_str","title":"koheesio.spark.utils.schema_struct_to_schema_str","text":"<pre><code>schema_struct_to_schema_str(schema: StructType) -&gt; str\n</code></pre> <p>Converts a StructType to a schema str</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def schema_struct_to_schema_str(schema: StructType) -&gt; str:\n    \"\"\"Converts a StructType to a schema str\"\"\"\n    if not schema:\n        return \"\"\n    return \",\\n\".join([f\"{field.name} {field.dataType.typeName().upper()}\" for field in schema.fields])\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_array","title":"koheesio.spark.utils.spark_data_type_is_array","text":"<pre><code>spark_data_type_is_array(data_type: DataType) -&gt; bool\n</code></pre> <p>Check if the column's dataType is of type ArrayType</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def spark_data_type_is_array(data_type: DataType) -&gt; bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, ArrayType)\n</code></pre>"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_numeric","title":"koheesio.spark.utils.spark_data_type_is_numeric","text":"<pre><code>spark_data_type_is_numeric(data_type: DataType) -&gt; bool\n</code></pre> <p>Check if the column's dataType is of type ArrayType</p> Source code in <code>src/koheesio/spark/utils.py</code> <pre><code>def spark_data_type_is_numeric(data_type: DataType) -&gt; bool:\n    \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n    return isinstance(data_type, (IntegerType, LongType, FloatType, DoubleType, DecimalType))\n</code></pre>"},{"location":"api_reference/spark/readers/index.html","title":"Readers","text":"<p>Readers are a type of Step that read data from a source based on the input parameters and stores the result in self.output.df.</p> <p>For a comprehensive guide on the usage, examples, and additional features of Reader classes, please refer to the reference/concepts/steps/readers section of the Koheesio documentation.</p>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader","title":"koheesio.spark.readers.Reader","text":"<p>Base class for all Readers</p> <p>A Reader is a Step that reads data from a source based on the input parameters and stores the result in self.output.df (DataFrame).</p> <p>When implementing a Reader, the execute() method should be implemented. The execute() method should read from the source and store the result in self.output.df.</p> <p>The Reader class implements a standard read() method that calls the execute() method and returns the result. This method can be used to read data from a Reader without having to call the execute() method directly. Read method does not need to be implemented in the child class.</p> <p>Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession.</p> <p>The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the output.df is None, .execute() will be run first.</p>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.df","title":"df  <code>property</code>","text":"<pre><code>df: Optional[DataFrame]\n</code></pre> <p>Shorthand for accessing self.output.df If the output.df is None, .execute() will be run first</p>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Execute on a Reader should handle self.output.df (output) as a minimum Read from whichever source -&gt; store result in self.output.df</p> Source code in <code>src/koheesio/spark/readers/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Reader should handle self.output.df (output) as a minimum\n    Read from whichever source -&gt; store result in self.output.df\n    \"\"\"\n    # self.output.df  # output dataframe\n    ...\n</code></pre>"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.read","title":"read","text":"<pre><code>read() -&gt; Optional[DataFrame]\n</code></pre> <p>Read from a Reader without having to call the execute() method directly</p> Source code in <code>src/koheesio/spark/readers/__init__.py</code> <pre><code>def read(self) -&gt; Optional[DataFrame]:\n    \"\"\"Read from a Reader without having to call the execute() method directly\"\"\"\n    self.execute()\n    return self.output.df\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html","title":"Delta","text":"<p>Read data from a Delta table and return a DataFrame or DataStream</p> <p>Classes:</p> Name Description <code>DeltaTableReader</code> <p>Reads data from a Delta table and returns a DataFrame</p> <code>DeltaTableStreamReader</code> <p>Reads data from a Delta table and returns a DataStream</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS","title":"koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS  <code>module-attribute</code>","text":"<pre><code>STREAMING_ONLY_OPTIONS = ['ignore_deletes', 'ignore_changes', 'starting_version', 'starting_timestamp', 'schema_tracking_location']\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING","title":"koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING  <code>module-attribute</code>","text":"<pre><code>STREAMING_SCHEMA_WARNING = '\\nImportant!\\nAlthough you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the data with an incorrect schema.'\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader","title":"koheesio.spark.readers.delta.DeltaTableReader","text":"<p>Reads data from a Delta table and returns a DataFrame Delta Table can be read in batch or streaming mode It also supports reading change data feed (CDF) in both batch mode and streaming mode</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Union[DeltaTableStep, str]</code> <p>The table to read</p> required <code>filter_cond</code> <code>Optional[Union[Column, str]]</code> <p>Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions. For example: <code>f.col('state') == 'Ohio'</code>, <code>state = 'Ohio'</code> or  <code>(col('col1') &gt; 3) &amp; (col('col2') &lt; 9)</code></p> required <code>columns</code> <p>Columns to select from the table. One or many columns can be provided as strings. For example: <code>['col1', 'col2']</code>, <code>['col1']</code> or <code>'col1'</code></p> required <code>streaming</code> <code>Optional[bool]</code> <p>Whether to read the table as a Stream or not</p> required <code>read_change_feed</code> <code>bool</code> <p>readChangeFeed: Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html</p> required <code>starting_version</code> <code>str</code> <p>startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.</p> required <code>starting_timestamp</code> <code>str</code> <p>startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)</p> required <code>ignore_deletes</code> <code>bool</code> <p>ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes</p> required <code>ignore_changes</code> <code>bool</code> <p>ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.</p> required"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[ListOfColumns] = Field(default=None, description=\"Columns to select from the table. One or many columns can be provided as strings. For example: `['col1', 'col2']`, `['col1']` or `'col1'` \")\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.filter_cond","title":"filter_cond  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>filter_cond: Optional[Union[Column, str]] = Field(default=None, alias='filterCondition', description=\"Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions For example: `f.col('state') == 'Ohio'`, `state = 'Ohio'` or  `(col('col1') &gt; 3) &amp; (col('col2') &lt; 9)`\")\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_changes","title":"ignore_changes  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ignore_changes: bool = Field(default=False, alias='ignoreChanges', description='ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_deletes","title":"ignore_deletes  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ignore_deletes: bool = Field(default=False, alias='ignoreDeletes', description='ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.read_change_feed","title":"read_change_feed  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>read_change_feed: bool = Field(default=False, alias='readChangeFeed', description=\"Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html\")\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.reader","title":"reader  <code>property</code>","text":"<pre><code>reader: Union[DataStreamReader, DataFrameReader]\n</code></pre> <p>Return the reader for the DeltaTableReader based on the <code>streaming</code> attribute</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.schema_tracking_location","title":"schema_tracking_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_tracking_location: Optional[str] = Field(default=None, alias='schemaTrackingLocation', description='schemaTrackingLocation: Track the location of source schema. Note: Recommend to enable Delta reader version: 3 and writer version: 7 for this option. For more info see https://docs.delta.io/latest/delta-column-mapping.html' + STREAMING_SCHEMA_WARNING)\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.skip_change_commits","title":"skip_change_commits  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>skip_change_commits: bool = Field(default=False, alias='skipChangeCommits', description='skipChangeCommits: Skip processing of change commits. Note: Only supported for streaming tables. (not supported in Open Source Delta Implementation). Prefer using skipChangeCommits over ignoreDeletes and ignoreChanges starting DBR12.1 and above. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#skip-change-commits')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_timestamp","title":"starting_timestamp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>starting_timestamp: Optional[str] = Field(default=None, alias='startingTimestamp', description='startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)' + STREAMING_SCHEMA_WARNING)\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_version","title":"starting_version  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>starting_version: Optional[str] = Field(default=None, alias='startingVersion', description='startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.' + STREAMING_SCHEMA_WARNING)\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: Optional[bool] = Field(default=False, description='Whether to read the table as a Stream or not')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: Union[DeltaTableStep, str] = Field(default=..., description='The table to read')\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.temp_view_name","title":"temp_view_name  <code>property</code>","text":"<pre><code>temp_view_name\n</code></pre> <p>Get the temporary view name for the dataframe for SQL queries</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.view","title":"view  <code>property</code>","text":"<pre><code>view\n</code></pre> <p>Create a temporary view of the dataframe for SQL queries</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/delta.py</code> <pre><code>def execute(self):\n    df = self.reader.table(self.table.table_name)\n    if self.filter_cond is not None:\n        df = df.filter(f.expr(self.filter_cond) if isinstance(self.filter_cond, str) else self.filter_cond)\n    if self.columns is not None:\n        df = df.select(*self.columns)\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.get_options","title":"get_options","text":"<pre><code>get_options() -&gt; Dict[str, Any]\n</code></pre> <p>Get the options for the DeltaTableReader based on the <code>streaming</code> attribute</p> Source code in <code>src/koheesio/spark/readers/delta.py</code> <pre><code>def get_options(self) -&gt; Dict[str, Any]:\n    \"\"\"Get the options for the DeltaTableReader based on the `streaming` attribute\"\"\"\n    options = {\n        # Enable Change Data Feed (CDF) feature\n        \"readChangeFeed\": self.read_change_feed,\n        # Initial position, one of:\n        \"startingVersion\": self.starting_version,\n        \"startingTimestamp\": self.starting_timestamp,\n    }\n\n    # Streaming only options\n    if self.streaming:\n        options = {\n            **options,\n            # Ignore updates and deletes, one of:\n            \"ignoreDeletes\": self.ignore_deletes,\n            \"ignoreChanges\": self.ignore_changes,\n            \"skipChangeCommits\": self.skip_change_commits,\n            \"schemaTrackingLocation\": self.schema_tracking_location,\n        }\n    # Batch only options\n    else:\n        pass  # there are none... for now :)\n\n    def normalize(v: Union[str, bool]):\n        \"\"\"normalize values\"\"\"\n        # True becomes \"true\", False becomes \"false\"\n        v = str(v).lower() if isinstance(v, bool) else v\n        return v\n\n    # Any options with `value == None` are filtered out\n    return {k: normalize(v) for k, v in options.items() if v is not None}\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.set_temp_view_name","title":"set_temp_view_name","text":"<pre><code>set_temp_view_name()\n</code></pre> <p>Set a temporary view name for the dataframe for SQL queries</p> Source code in <code>src/koheesio/spark/readers/delta.py</code> <pre><code>@model_validator(mode=\"after\")\ndef set_temp_view_name(self):\n    \"\"\"Set a temporary view name for the dataframe for SQL queries\"\"\"\n    table_name = self.table.table\n    vw_name = get_random_string(prefix=f\"tmp_{table_name}\")\n    self.__temp_view_name__ = vw_name\n    return self\n</code></pre>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader","title":"koheesio.spark.readers.delta.DeltaTableStreamReader","text":"<p>Reads data from a Delta table and returns a DataStream</p>"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: bool = True\n</code></pre>"},{"location":"api_reference/spark/readers/dummy.html","title":"Dummy","text":"<p>A simple DummyReader that returns a DataFrame with an id-column of the given range</p>"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader","title":"koheesio.spark.readers.dummy.DummyReader","text":"<p>A simple DummyReader that returns a DataFrame with an id-column of the given range</p> <p>Can be used in place of any Reader without having to read from a real source.</p> <p>Wraps SparkSession.range(). Output DataFrame will have a single column named \"id\" of type Long and length of the given range.</p> <p>Parameters:</p> Name Type Description Default <code>range</code> <code>int</code> <p>How large to make the Dataframe</p> required Example <pre><code>from koheesio.spark.readers.dummy import DummyReader\n\noutput_df = DummyReader(range=100).read()\n</code></pre> <p>output_df: Output DataFrame will have a single column named \"id\" of type <code>Long</code> containing 100 rows (0-99).</p> id 0 1 ... 99"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.range","title":"range  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>range: int = Field(default=100, description='How large to make the Dataframe')\n</code></pre>"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/dummy.py</code> <pre><code>def execute(self):\n    self.output.df = self.spark.range(self.range)\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html","title":"File loader","text":"<p>Generic file Readers for different file formats.</p> <p>Supported file formats: - CSV - Parquet - Avro - JSON - ORC - Text</p> <p>Examples: <pre><code>from koheesio.spark.readers import (\n    CsvReader,\n    ParquetReader,\n    AvroReader,\n    JsonReader,\n    OrcReader,\n)\n\ncsv_reader = CsvReader(path=\"path/to/file.csv\", header=True)\nparquet_reader = ParquetReader(path=\"path/to/file.parquet\")\navro_reader = AvroReader(path=\"path/to/file.avro\")\njson_reader = JsonReader(path=\"path/to/file.json\")\norc_reader = OrcReader(path=\"path/to/file.orc\")\n</code></pre></p> <p>For more information about the available options, see Spark's official documentation.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader","title":"koheesio.spark.readers.file_loader.AvroReader","text":"<p>Reads an Avro file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.avro</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = AvroReader(path=\"path/to/file.avro\", mergeSchema=True)\n</code></pre></p> <p>Make sure to have the <code>spark-avro</code> package installed in your environment.</p> <p>For more information about the available options, see the official documentation.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = avro\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader","title":"koheesio.spark.readers.file_loader.CsvReader","text":"<p>Reads a CSV file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.csv</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = CsvReader(path=\"path/to/file.csv\", header=True)\n</code></pre></p> <p>For more information about the available options, see the official pyspark documentation and read about CSV data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = csv\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat","title":"koheesio.spark.readers.file_loader.FileFormat","text":"<p>Supported file formats.</p> <p>This enum represents the supported file formats that can be used with the FileLoader class. The available file formats are: - csv: Comma-separated values format - parquet: Apache Parquet format - avro: Apache Avro format - json: JavaScript Object Notation format - orc: Apache ORC format - text: Plain text format</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.avro","title":"avro  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>avro = 'avro'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.csv","title":"csv  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>csv = 'csv'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.json","title":"json  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>json = 'json'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.orc","title":"orc  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>orc = 'orc'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.parquet","title":"parquet  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>parquet = 'parquet'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.text","title":"text  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>text = 'text'\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader","title":"koheesio.spark.readers.file_loader.FileLoader","text":"<p>Generic file reader.</p> <pre><code>Available file formats:\n- CSV\n- Parquet\n- Avro\n- JSON\n- ORC\n- Text (default)\n\nExtra parameters can be passed to the reader using the `extra_params` attribute or as keyword arguments.\n\nExample:\n```python\nreader = FileLoader(path=\"path/to/textfile.txt\", format=\"text\", header=True, lineSep=\"\n</code></pre> <p>\")     ```</p> <pre><code>For more information about the available options, see Spark's\n[official pyspark documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.text.html)\nand [read about text data source](https://spark.apache.org/docs/latest/sql-data-sources-text.html).\n\nAlso see the [data sources generic options](https://spark.apache.org/docs/3.5.0/sql-data-sources-generic-options.html).\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = Field(default=text, description='File format to read')\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Union[Path, str] = Field(default=..., description='Path to the file to read')\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.schema_","title":"schema_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_: Optional[Union[StructType, str]] = Field(default=None, description='Schema to use when reading the file', validate_default=False, alias='schema')\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.ensure_path_is_str","title":"ensure_path_is_str","text":"<pre><code>ensure_path_is_str(v)\n</code></pre> <p>Ensure that the path is a string as required by Spark.</p> Source code in <code>src/koheesio/spark/readers/file_loader.py</code> <pre><code>@field_validator(\"path\")\ndef ensure_path_is_str(cls, v):\n    \"\"\"Ensure that the path is a string as required by Spark.\"\"\"\n    if isinstance(v, Path):\n        return str(v.absolute().as_posix())\n    return v\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Reads the file using the specified format, schema, while applying any extra parameters.</p> Source code in <code>src/koheesio/spark/readers/file_loader.py</code> <pre><code>def execute(self):\n    \"\"\"Reads the file using the specified format, schema, while applying any extra parameters.\"\"\"\n    reader = self.spark.read.format(self.format)\n\n    if self.schema_:\n        reader.schema(self.schema_)\n\n    if self.extra_params:\n        reader = reader.options(**self.extra_params)\n\n    self.output.df = reader.load(self.path)\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader","title":"koheesio.spark.readers.file_loader.JsonReader","text":"<p>Reads a JSON file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.json</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = JsonReader(path=\"path/to/file.json\", allowComments=True)\n</code></pre></p> <p>For more information about the available options, see the official pyspark documentation and read about JSON data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = json\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader","title":"koheesio.spark.readers.file_loader.OrcReader","text":"<p>Reads an ORC file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.orc</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = OrcReader(path=\"path/to/file.orc\", mergeSchema=True)\n</code></pre></p> <p>For more information about the available options, see the official documentation and read about ORC data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = orc\n</code></pre>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader","title":"koheesio.spark.readers.file_loader.ParquetReader","text":"<p>Reads a Parquet file.</p> <p>This class is a convenience class that sets the <code>format</code> field to <code>FileFormat.parquet</code>.</p> <p>Extra parameters can be passed to the reader using the <code>extra_params</code> attribute or as keyword arguments.</p> <p>Example: <pre><code>reader = ParquetReader(path=\"path/to/file.parquet\", mergeSchema=True)\n</code></pre></p> <p>For more information about the available options, see the official pyspark documentation and read about Parquet data source.</p> <p>Also see the data sources generic options.</p>"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: FileFormat = parquet\n</code></pre>"},{"location":"api_reference/spark/readers/hana.html","title":"Hana","text":"<p>HANA reader.</p>"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader","title":"koheesio.spark.readers.hana.HanaReader","text":"<p>Wrapper around JdbcReader for SAP HANA</p> Notes <ul> <li>Refer to JdbcReader for the list of all available parameters.</li> <li>Refer to SAP HANA Client Interface Programming Reference docs for the list of all available connection string     parameters:     https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/109397c2206a4ab2a5386d494f4cf75e.html</li> </ul> Example <p>Note: jars should be added to the Spark session manually. This class does not take care of that.</p> <p>This example depends on the SAP HANA <code>ngdbc</code> JAR. e.g. ngdbc-2.5.49.</p> <pre><code>from koheesio.spark.readers.hana import HanaReader\njdbc_hana = HanaReader(\n    url=\"jdbc:sap://&lt;domain_or_ip&gt;:&lt;port&gt;/?&lt;options&gt;\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\"\n)\ndf = jdbc_hana.read()\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>JDBC connection string. Refer to SAP HANA docs for the list of all available connection string parameters. Example: jdbc:sap://:[/?] required <code>user</code> <code>str</code> required <code>password</code> <code>SecretStr</code> required <code>dbtable</code> <code>str</code> <p>Database table name, also include schema name</p> required <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the SAP HANA JDBC driver. Refer to SAP HANA docs for the list of all available connection string parameters. Example: {\"fetchsize\": 2000, \"numPartitions\": 10}</p> required <code>query</code> <code>Optional[str]</code> <p>Query</p> required <code>format</code> <code>str</code> <p>The type of format to load. Defaults to 'jdbc'. Should not be changed.</p> required <code>driver</code> <code>str</code> <p>Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.</p> required"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: str = Field(default='com.sap.db.jdbc.Driver', description='Make sure that the necessary JARs are available in the cluster: ngdbc-2-x.x.x.x')\n</code></pre>"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field(default={'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the SAP HANA JDBC driver')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html","title":"Jdbc","text":"<p>Module for reading data from JDBC sources.</p> <p>Classes:</p> Name Description <code>JdbcReader</code> <p>Reader for JDBC tables.</p>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader","title":"koheesio.spark.readers.jdbc.JdbcReader","text":"<p>Reader for JDBC tables.</p> <p>Wrapper around Spark's jdbc read format</p> Notes <ul> <li>Query has precedence over dbtable. If query and dbtable both are filled in, dbtable will be ignored!</li> <li>Extra options to the spark reader can be passed through the <code>options</code> input. Refer to Spark documentation     for details: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html</li> <li>Consider using <code>fetchsize</code> as one of the options, as it is greatly increases the performance of the reader</li> <li>Consider using <code>numPartitions</code>, <code>partitionColumn</code>, <code>lowerBound</code>, <code>upperBound</code> together with real or synthetic     partitioning column as it will improve the reader performance</li> </ul> <p>When implementing a JDBC reader, the <code>get_options()</code> method should be implemented. The method should return a dict of options required for the specific JDBC driver. The <code>get_options()</code> method can be overridden in the child class. Additionally, the <code>driver</code> parameter should be set to the name of the JDBC driver. Be aware that the driver jar needs to be included in the Spark session; this class does not (and can not) take care of that!</p> Example <p>Note: jars should be added to the Spark session manually. This class does not take care of that.</p> <p>This example depends on the jar for MS SQL:  <code>https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre8/mssql-jdbc-9.2.1.jre8.jar</code></p> <pre><code>from koheesio.spark.readers.jdbc import JdbcReader\n\njdbc_mssql = JdbcReader(\n    driver=\"com.microsoft.sqlserver.jdbc.SQLServerDriver\",\n    url=\"jdbc:sqlserver://10.xxx.xxx.xxx:1433;databaseName=YOUR_DATABASE\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n    options={\"fetchsize\": 100},\n)\ndf = jdbc_mssql.read()\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.dbtable","title":"dbtable  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>dbtable: Optional[str] = Field(default=None, description='Database table name, also include schema name')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: str = Field(default=..., description='Driver name. Be aware that the driver jar needs to be passed to the task')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='jdbc', description=\"The type of format to load. Defaults to 'jdbc'.\")\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field(default_factory=dict, description='Extra options to pass to spark reader')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.password","title":"password  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>password: SecretStr = Field(default=..., description='Password belonging to the username')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.query","title":"query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>query: Optional[str] = Field(default=None, description='Query')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., description='URL for the JDBC driver. Note, in some environments you need to use the IP Address instead of the hostname of the server.')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.user","title":"user  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>user: str = Field(default=..., description='User to authenticate to the server')\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Wrapper around Spark's jdbc read format</p> Source code in <code>src/koheesio/spark/readers/jdbc.py</code> <pre><code>def execute(self):\n    \"\"\"Wrapper around Spark's jdbc read format\"\"\"\n\n    # Can't have both dbtable and query empty\n    if not self.dbtable and not self.query:\n        raise ValueError(\"Please do not leave dbtable and query both empty!\")\n\n    if self.query and self.dbtable:\n        self.log.info(\"Both 'query' and 'dbtable' are filled in, 'dbtable' will be ignored!\")\n\n    options = self.get_options()\n\n    if pw := self.password:\n        options[\"password\"] = pw.get_secret_value()\n\n    if query := self.query:\n        options[\"query\"] = query\n        self.log.info(f\"Executing query: {self.query}\")\n    else:\n        options[\"dbtable\"] = self.dbtable\n\n    self.output.df = self.spark.read.format(self.format).options(**options).load()\n</code></pre>"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Dictionary of options required for the specific JDBC driver.</p> <p>Note: override this method if driver requires custom names, e.g. Snowflake: <code>sfUrl</code>, <code>sfUser</code>, etc.</p> Source code in <code>src/koheesio/spark/readers/jdbc.py</code> <pre><code>def get_options(self):\n    \"\"\"\n    Dictionary of options required for the specific JDBC driver.\n\n    Note: override this method if driver requires custom names, e.g. Snowflake: `sfUrl`, `sfUser`, etc.\n    \"\"\"\n    return {\n        \"driver\": self.driver,\n        \"url\": self.url,\n        \"user\": self.user,\n        \"password\": self.password,\n        **self.options,\n    }\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html","title":"Kafka","text":"<p>Module for KafkaReader and KafkaStreamReader.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader","title":"koheesio.spark.readers.kafka.KafkaReader","text":"<p>Reader for Kafka topics.</p> <p>Wrapper around Spark's kafka read format. Supports both batch and streaming reads.</p> <p>Parameters:</p> Name Type Description Default <code>read_broker</code> <code>str</code> <p>Kafka brokers to read from. Should be passed as a single string with multiple brokers passed in a comma separated list</p> required <code>topic</code> <code>str</code> <p>Kafka topic to consume.</p> required <code>streaming</code> <code>Optional[bool]</code> <p>Whether to read the kafka topic as a stream or not.</p> required <code>params</code> <code>Optional[Dict[str, str]]</code> <p>Arbitrary options to be applied when creating NSP Reader. If a user provides values for <code>subscribe</code> or <code>kafka.bootstrap.servers</code>, they will be ignored in favor of configuration passed through <code>topic</code> and <code>read_broker</code> respectively. Defaults to an empty dictionary.</p> required Notes <ul> <li>The <code>read_broker</code> and <code>topic</code> parameters are required.</li> <li>The <code>streaming</code> parameter defaults to <code>False</code>.</li> <li>The <code>params</code> parameter defaults to an empty dictionary. This parameter is also aliased as <code>kafka_options</code>.</li> <li>Any extra kafka options can also be passed as key-word arguments; these will be merged with the <code>params</code> parameter</li> </ul> Example <pre><code>from koheesio.spark.readers.kafka import KafkaReader\n\nkafka_reader = KafkaReader(\n    read_broker=\"kafka-broker-1:9092,kafka-broker-2:9092\",\n    topic=\"my-topic\",\n    streaming=True,\n    # extra kafka options can be passed as key-word arguments\n    startingOffsets=\"earliest\",\n)\n</code></pre> <p>In the example above, the <code>KafkaReader</code> will read from the <code>my-topic</code> Kafka topic, using the brokers <code>kafka-broker-1:9092</code> and <code>kafka-broker-2:9092</code>. The reader will read the topic as a stream and will start reading from the earliest available offset.</p> <p>The stream can be started by calling the <code>read</code> or <code>execute</code> method on the <code>kafka_reader</code> object.</p> <p>Note: The <code>KafkaStreamReader</code> could be used in the example above to achieve the same result. <code>streaming</code> would     default to <code>True</code> in that case and could be omitted from the parameters.</p> See Also <ul> <li>Official Spark Documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html</li> </ul>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.batch_reader","title":"batch_reader  <code>property</code>","text":"<pre><code>batch_reader\n</code></pre> <p>Returns the Spark read object for batch processing.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.logged_option_keys","title":"logged_option_keys  <code>property</code>","text":"<pre><code>logged_option_keys\n</code></pre> <p>Keys that are allowed to be logged for the options.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.options","title":"options  <code>property</code>","text":"<pre><code>options\n</code></pre> <p>Merge fixed parameters with arbitrary options provided by user.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, str]] = Field(default_factory=dict, alias='kafka_options', description=\"Arbitrary options to be applied when creating NSP Reader. If a user provides values for 'subscribe' or 'kafka.bootstrap.servers', they will be ignored in favor of configuration passed through 'topic' and 'read_broker' respectively.\")\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.read_broker","title":"read_broker  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>read_broker: str = Field(..., description='Kafka brokers to read from, should be passed as a single string with multiple brokers passed in a comma separated list')\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.reader","title":"reader  <code>property</code>","text":"<pre><code>reader\n</code></pre> <p>Returns the appropriate reader based on the streaming flag.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.stream_reader","title":"stream_reader  <code>property</code>","text":"<pre><code>stream_reader\n</code></pre> <p>Returns the Spark readStream object.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: Optional[bool] = Field(default=False, description='Whether to read the kafka topic as a stream or not. Defaults to False.')\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.topic","title":"topic  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>topic: str = Field(default=..., description='Kafka topic to consume.')\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/kafka.py</code> <pre><code>def execute(self):\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self.output.df = self.reader.format(\"kafka\").options(**self.options).load()\n</code></pre>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader","title":"koheesio.spark.readers.kafka.KafkaStreamReader","text":"<p>KafkaStreamReader is a KafkaReader that reads data as a stream</p> <p>This class is identical to KafkaReader, with the <code>streaming</code> parameter defaulting to <code>True</code>.</p>"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader.streaming","title":"streaming  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming: bool = True\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html","title":"Memory","text":"<p>Create Spark DataFrame directly from the data stored in a Python variable</p>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat","title":"koheesio.spark.readers.memory.DataFormat","text":"<p>Data formats supported by the InMemoryDataReader</p>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.CSV","title":"CSV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CSV = 'csv'\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'json'\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader","title":"koheesio.spark.readers.memory.InMemoryDataReader","text":"<p>Directly read data from a Python variable and convert it to a Spark DataFrame.</p> <p>Read the data, that is stored in one of the supported formats (see <code>DataFormat</code>) directly from the variable and convert it to the Spark DataFrame. The use cases include converting JSON output of the API into the dataframe; reading the CSV data via the API (e.g. Box API).</p> <p>The advantage of using this reader is that it allows to read the data directly from the Python variable, without the need to store it on the disk. This can be useful when the data is small and does not need to be stored permanently.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Union[str, list, dict, bytes]</code> <p>Source data</p> required <code>format</code> <code>DataFormat</code> <p>File / data format</p> required <code>schema_</code> <code>Optional[StructType]</code> <p>Schema that will be applied during the creation of Spark DataFrame</p> <code>None</code> <code>params</code> <code>Optional[Dict[str, Any]]</code> <p>Set of extra parameters that should be passed to the appropriate reader (csv / json). Optionally, the user can pass the parameters that are specific to the reader (e.g. <code>multiLine</code> for JSON reader) as key-word arguments. These will be merged with the <code>params</code> parameter.</p> <code>dict</code> Example <pre><code># Read CSV data from a string\ndf1 = InMemoryDataReader(format=DataFormat.CSV, data='foo,bar\\nA,1\\nB,2')\n\n# Read JSON data from a string\ndf2 = InMemoryDataReader(format=DataFormat.JSON, data='{\"foo\": A, \"bar\": 1}'\ndf3 = InMemoryDataReader(format=DataFormat.JSON, data=['{\"foo\": \"A\", \"bar\": 1}', '{\"foo\": \"B\", \"bar\": 2}']\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.data","title":"data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data: Union[str, list, dict, bytes] = Field(default=..., description='Source data')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: DataFormat = Field(default=..., description='File / data format')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the appropriate reader (csv / json)')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.schema_","title":"schema_  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_: Optional[StructType] = Field(default=None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Execute method appropriate to the specific data format</p> Source code in <code>src/koheesio/spark/readers/memory.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Execute method appropriate to the specific data format\n    \"\"\"\n    _func = getattr(InMemoryDataReader, f\"_{self.format}\")\n    _df = partial(_func, self, self._rdd)()\n    self.output.df = _df\n</code></pre>"},{"location":"api_reference/spark/readers/metastore.html","title":"Metastore","text":"<p>Create Spark DataFrame from table in Metastore</p>"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader","title":"koheesio.spark.readers.metastore.MetastoreReader","text":"<p>Reader for tables/views from Spark Metastore</p> <p>Parameters:</p> Name Type Description Default <code>table</code> <code>str</code> <p>Table name in spark metastore</p> required"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: str = Field(default=..., description='Table name in spark metastore')\n</code></pre>"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/metastore.py</code> <pre><code>def execute(self):\n    self.output.df = self.spark.table(self.table)\n</code></pre>"},{"location":"api_reference/spark/readers/rest_api.html","title":"Rest api","text":"<p>This module provides the RestApiReader class for interacting with RESTful APIs.</p> <p>The RestApiReader class is designed to fetch data from RESTful APIs and store the response in a DataFrame. It supports different transports, e.g. Paginated Http or Async HTTP. The main entry point is the <code>execute</code> method, which performs transport.execute() call and provide data from the API calls.</p> <p>For more details on how to use this class and its methods, refer to the class docstring.</p>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader","title":"koheesio.spark.readers.rest_api.RestApiReader","text":"<p>A reader class that executes an API call and stores the response in a DataFrame.</p> <p>Parameters:</p> Name Type Description Default <code>transport</code> <code>Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]</code> <p>The HTTP transport step.</p> required <code>spark_schema</code> <code>Union[str, StructType, List[str], Tuple[str, ...], AtomicType]</code> <p>The pyspark schema of the response.</p> required <p>Attributes:</p> Name Type Description <code>transport</code> <code>Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]</code> <p>The HTTP transport step.</p> <code>spark_schema</code> <code>Union[str, StructType, List[str], Tuple[str, ...], AtomicType]</code> <p>The pyspark schema of the response.</p> <p>Returns:</p> Type Description <code>Output</code> <p>The output of the reader, which includes the DataFrame.</p> <p>Examples:</p> <p>Here are some examples of how to use this class:</p> <p>Example 1: Paginated Transport <pre><code>import requests\nfrom urllib3 import Retry\n\nfrom koheesio.steps.http import HttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = requests.Session()\nretry_logic = Retry(total=max_retries, status_forcelist=[503])\nsession.mount(\"https://\", HTTPAdapter(max_retries=retry_logic))\nsession.mount(\"http://\", HTTPAdapter(max_retries=retry_logic))\n\ntransport = PaginatedHtppGetStep(\n    url=\"https://api.example.com/data?page={page}\",\n    paginate=True,\n    pages=3,\n    session=session,\n)\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n</code></pre></p> <p>Example 2: Async Transport <pre><code>from aiohttp import ClientSession, TCPConnector\nfrom aiohttp_retry import ExponentialRetry\nfrom yarl import URL\n\nfrom koheesio.steps.asyncio.http import AsyncHttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = ClientSession()\nurls = [URL(\"http://httpbin.org/get\"), URL(\"http://httpbin.org/get\")]\nretry_options = ExponentialRetry()\nconnector = TCPConnector(limit=10)\ntransport = AsyncHttpGetStep(\n    client_session=session,\n    url=urls,\n    retry_options=retry_options,\n    connector=connector,\n)\n\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n</code></pre></p>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.spark_schema","title":"spark_schema  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>spark_schema: Union[str, StructType, List[str], Tuple[str, ...], AtomicType] = Field(..., description='The pyspark schema of the response')\n</code></pre>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.transport","title":"transport  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>transport: Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]] = Field(..., description='HTTP transport step', exclude=True)\n</code></pre>"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Executes the API call and stores the response in a DataFrame.</p> <p>Returns:</p> Type Description <code>Output</code> <p>The output of the reader, which includes the DataFrame.</p> Source code in <code>src/koheesio/spark/readers/rest_api.py</code> <pre><code>def execute(self) -&gt; Reader.Output:\n    \"\"\"\n    Executes the API call and stores the response in a DataFrame.\n\n    Returns\n    -------\n    Reader.Output\n        The output of the reader, which includes the DataFrame.\n    \"\"\"\n    raw_data = self.transport.execute()\n\n    if isinstance(raw_data, HttpGetStep.Output):\n        data = raw_data.response_json\n    elif isinstance(raw_data, AsyncHttpGetStep.Output):\n        data = [d for d, _ in raw_data.responses_urls]  # type: ignore\n\n    if data:\n        self.output.df = self.spark.createDataFrame(data=data, schema=self.spark_schema)  # type: ignore\n</code></pre>"},{"location":"api_reference/spark/readers/snowflake.html","title":"Snowflake","text":"<p>Module containing Snowflake reader classes.</p> <p>This module contains classes for reading data from Snowflake. The classes are used to create a Spark DataFrame from a Snowflake table or a query.</p> <p>Classes:</p> Name Description <code>SnowflakeReader</code> <p>Reader for Snowflake tables.</p> <code>Query</code> <p>Reader for Snowflake queries.</p> <code>DbTableQuery</code> <p>Reader for Snowflake queries that return a single row.</p> Notes <p>The classes are defined in the koheesio.steps.integrations.snowflake module; this module simply inherits from the classes defined there.</p> See Also <ul> <li>koheesio.spark.readers.Reader     Base class for all Readers.</li> <li>koheesio.steps.integrations.snowflake     Module containing Snowflake classes.</li> </ul> <p>More detailed class descriptions can be found in the class docstrings.</p>"},{"location":"api_reference/spark/readers/spark_sql_reader.html","title":"Spark sql reader","text":"<p>This module contains the SparkSqlReader class which reads the SparkSQL compliant query and returns the dataframe.</p>"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader","title":"koheesio.spark.readers.spark_sql_reader.SparkSqlReader","text":"<p>SparkSqlReader reads the SparkSQL compliant query and returns the dataframe.</p> <p>This SQL can originate from a string or a file and may contain placeholder (parameters) for templating. - Placeholders are identified with ${placeholder}. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).</p> Example <p>SQL script (example.sql): <pre><code>SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n</code></pre></p> <p>Python code: <pre><code>from koheesio.spark.readers import SparkSqlReader\n\nreader = SparkSqlReader(\n    sql_path=\"example.sql\",\n    # params can also be passed as kwargs\n    dynamic_column\"=\"name\",\n    \"table_name\"=\"my_table\"\n)\nreader.execute()\n</code></pre></p> <p>In this example, the SQL script is read from a file and the placeholders are replaced with the given params. The resulting SQL query is: <pre><code>SELECT id, id + 1 AS incremented_id, name AS extra_column\nFROM my_table\n</code></pre></p> <p>The query is then executed and the resulting DataFrame is stored in the <code>output.df</code> attribute.</p> <p>Parameters:</p> Name Type Description Default <code>sql_path</code> <code>str or Path</code> <p>Path to a SQL file</p> required <code>sql</code> <code>str</code> <p>SQL query to execute</p> required <code>params</code> <code>dict</code> <p>Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.</p> required Notes <p>Any arbitrary kwargs passed to the class will be added to params.</p>"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/readers/spark_sql_reader.py</code> <pre><code>def execute(self):\n    self.output.df = self.spark.sql(self.query)\n</code></pre>"},{"location":"api_reference/spark/readers/teradata.html","title":"Teradata","text":"<p>Teradata reader.</p>"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader","title":"koheesio.spark.readers.teradata.TeradataReader","text":"<p>Wrapper around JdbcReader for Teradata.</p> Notes <ul> <li>Consider using synthetic partitioning column when using partitioned read:     <code>MOD(HASHBUCKET(HASHROW(&lt;TABLE&gt;.&lt;COLUMN&gt;)), &lt;NUM_PARTITIONS&gt;)</code></li> <li>Relevant jars should be added to the Spark session manually. This class does not take care of that.</li> </ul> See Also <ul> <li>Refer to JdbcReader for the list of all available     parameters.</li> <li>Refer to Teradata docs for the list of all available connection string parameters:     https://teradata-docs.s3.amazonaws.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_2.html#BABJIHBJ</li> </ul> Example <p>This example depends on the Teradata <code>terajdbc4</code> JAR. e.g. terajdbc4-17.20.00.15. Keep in mind that older versions of <code>terajdbc4</code> drivers also require <code>tdgssconfig</code> JAR.</p> <pre><code>from koheesio.spark.readers.teradata import TeradataReader\n\ntd = TeradataReader(\n    url=\"jdbc:teradata://&lt;domain_or_ip&gt;/logmech=ldap,charset=utf8,database=&lt;db&gt;,type=fastexport, maybenull=on\",\n    user=\"YOUR_USERNAME\",\n    password=\"***\",\n    dbtable=\"schemaname.tablename\",\n)\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>JDBC connection string. Refer to Teradata docs for the list of all available connection string parameters. Example: <code>jdbc:teradata://&lt;domain_or_ip&gt;/logmech=ldap,charset=utf8,database=&lt;db&gt;,type=fastexport, maybenull=on</code></p> required <code>user</code> <code>str</code> <p>Username</p> required <code>password</code> <code>SecretStr</code> <p>Password</p> required <code>dbtable</code> <code>str</code> <p>Database table name, also include schema name</p> required <code>options</code> <code>Optional[Dict[str, Any]]</code> <p>Extra options to pass to the Teradata JDBC driver. Refer to Teradata docs for the list of all available connection string parameters.</p> <code>{\"fetchsize\": 2000, \"numPartitions\": 10}</code> <code>query</code> <code>Optional[str]</code> <p>Query</p> <code>None</code> <code>format</code> <code>str</code> <p>The type of format to load. Defaults to 'jdbc'. Should not be changed.</p> required <code>driver</code> <code>str</code> <p>Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.</p> required"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.driver","title":"driver  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>driver: str = Field('com.teradata.jdbc.TeraDriver', description='Make sure that the necessary JARs are available in the cluster: terajdbc4-x.x.x.x')\n</code></pre>"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, Any]] = Field({'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the Teradata JDBC driver')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/index.html","title":"Databricks","text":""},{"location":"api_reference/spark/readers/databricks/autoloader.html","title":"Autoloader","text":"<p>Read from a location using Databricks' <code>autoloader</code></p> <p>Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader","title":"koheesio.spark.readers.databricks.autoloader.AutoLoader","text":"<p>Read from a location using Databricks' <code>autoloader</code></p> <p>Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p> Notes <p><code>autoloader</code> is a <code>Spark Structured Streaming</code> function!</p> <p>Although most transformations are compatible with <code>Spark Structured Streaming</code>, not all of them are. As a result, be mindful with your downstream transformations.</p> <p>Parameters:</p> Name Type Description Default <code>format</code> <code>Union[str, AutoLoaderFormat]</code> <p>The file format, used in <code>cloudFiles.format</code>. Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p> required <code>location</code> <code>str</code> <p>The location where the files are located, used in <code>cloudFiles.location</code></p> required <code>schema_location</code> <code>str</code> <p>The location for storing inferred schema and supporting schema evolution, used in <code>cloudFiles.schemaLocation</code>.</p> required <code>options</code> <code>Optional[Dict[str, str]]</code> <p>Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html</p> <code>{}</code> Example <pre><code>from koheesio.spark.readers.databricks import AutoLoader, AutoLoaderFormat\n\nresult_df = AutoLoader(\n    format=AutoLoaderFormat.JSON,\n    location=\"some_s3_path\",\n    schema_location=\"other_s3_path\",\n    options={\"multiLine\": \"true\"},\n).read()\n</code></pre> See Also <p>Some other useful documentation:</p> <ul> <li>autoloader: https://docs.databricks.com/ingestion/auto-loader/index.html</li> <li>Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html</li> </ul>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: Union[str, AutoLoaderFormat] = Field(default=..., description=__doc__)\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.location","title":"location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>location: str = Field(default=..., description='The location where the files are located, used in `cloudFiles.location`')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.options","title":"options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>options: Optional[Dict[str, str]] = Field(default_factory=dict, description='Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.schema_location","title":"schema_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>schema_location: str = Field(default=..., alias='schemaLocation', description='The location for storing inferred schema and supporting schema evolution, used in `cloudFiles.schemaLocation`.')\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Reads from the given location with the given options using Autoloader</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>def execute(self):\n    \"\"\"Reads from the given location with the given options using Autoloader\"\"\"\n    self.output.df = self.reader().load(self.location)\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Get the options for the autoloader</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>def get_options(self):\n    \"\"\"Get the options for the autoloader\"\"\"\n    self.options.update(\n        {\n            \"cloudFiles.format\": self.format,\n            \"cloudFiles.schemaLocation\": self.schema_location,\n        }\n    )\n    return self.options\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.reader","title":"reader","text":"<pre><code>reader()\n</code></pre> <p>Return the reader for the autoloader</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>def reader(self):\n    \"\"\"Return the reader for the autoloader\"\"\"\n    return self.spark.readStream.format(\"cloudFiles\").options(**self.get_options())\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.validate_format","title":"validate_format","text":"<pre><code>validate_format(format_specified)\n</code></pre> <p>Validate <code>format</code> value</p> Source code in <code>src/koheesio/spark/readers/databricks/autoloader.py</code> <pre><code>@field_validator(\"format\")\ndef validate_format(cls, format_specified):\n    \"\"\"Validate `format` value\"\"\"\n    if isinstance(format_specified, str):\n        if format_specified.upper() in [f.value.upper() for f in AutoLoaderFormat]:\n            format_specified = getattr(AutoLoaderFormat, format_specified.upper())\n    return str(format_specified.value)\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","title":"koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","text":"<p>The file format, used in <code>cloudFiles.format</code> Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.</p>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.AVRO","title":"AVRO  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>AVRO = 'avro'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.BINARYFILE","title":"BINARYFILE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BINARYFILE = 'binaryfile'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.CSV","title":"CSV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CSV = 'csv'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.JSON","title":"JSON  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON = 'json'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.ORC","title":"ORC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ORC = 'orc'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.PARQUET","title":"PARQUET  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PARQUET = 'parquet'\n</code></pre>"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.TEXT","title":"TEXT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TEXT = 'text'\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html","title":"Transformations","text":"<p>This module contains the base classes for all transformations.</p> <p>See class docstrings for more information.</p> References <p>For a comprehensive guide on the usage, examples, and additional features of Transformation classes, please refer to the reference/concepts/steps/transformations section of the Koheesio documentation.</p> <p>Classes:</p> Name Description <code>Transformation</code> <p>Base class for all transformations</p> <code>ColumnsTransformation</code> <p>Extended Transformation class with a preset validator for handling column(s) data</p> <code>ColumnsTransformationWithTarget</code> <p>Extended ColumnsTransformation class with an additional <code>target_column</code> field</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation","title":"koheesio.spark.transformations.ColumnsTransformation","text":"<p>Extended Transformation class with a preset validator for handling column(s) data with a standardized input for a single column or multiple columns.</p> Concept <p>A ColumnsTransformation is a Transformation with a standardized input for column or columns. The <code>columns</code> are stored as a list. Either a single string, or a list of strings can be passed to enter the <code>columns</code>. <code>column</code> and <code>columns</code> are aliases to one another - internally the name <code>columns</code> should be used though.</p> <ul> <li><code>columns</code> are stored as a list</li> <li>either a single string, or a list of strings can be passed to enter the <code>columns</code></li> <li><code>column</code> and <code>columns</code> are aliases to one another - internally the name <code>columns</code> should be used though.</li> </ul> <p>If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns</p> Configuring the ColumnsTransformation <p>The ColumnsTransformation class has a <code>ColumnConfig</code> class that can be used to configure the behavior of the class. This class has the following fields: - <code>run_for_all_data_type</code>     allows to run the transformation for all columns of a given type.</p> <ul> <li> <p><code>limit_data_type</code>     allows to limit the transformation to a specific data type.</p> </li> <li> <p><code>data_type_strict_mode</code>     Toggles strict mode for data type validation. Will only work if <code>limit_data_type</code> is set.</p> </li> </ul> <p>Note that Data types need to be specified as a SparkDatatype enum.</p> <p>See the docstrings of the <code>ColumnConfig</code> class for more information. See the SparkDatatype enum for a list of available data types.</p> <p>Users should not have to interact with the <code>ColumnConfig</code> class directly.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The column (or list of columns) to apply the transformation to. Alias: column</p> required Example <pre><code>from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: ListOfColumns = Field(default='', alias='column', description='The column (or list of columns) to apply the transformation to. Alias: column')\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.data_type_strict_mode_is_set","title":"data_type_strict_mode_is_set  <code>property</code>","text":"<pre><code>data_type_strict_mode_is_set: bool\n</code></pre> <p>Returns True if data_type_strict_mode is set</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.limit_data_type_is_set","title":"limit_data_type_is_set  <code>property</code>","text":"<pre><code>limit_data_type_is_set: bool\n</code></pre> <p>Returns True if limit_data_type is set</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.run_for_all_is_set","title":"run_for_all_is_set  <code>property</code>","text":"<pre><code>run_for_all_is_set: bool\n</code></pre> <p>Returns True if the transformation should be run for all columns of a given type</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig","title":"ColumnConfig","text":"<p>Koheesio ColumnsTransformation specific Config</p> <p>Parameters:</p> Name Type Description Default <code>run_for_all_data_type</code> <p>allows to run the transformation for all columns of a given type. A user can trigger this behavior by either omitting the <code>columns</code> parameter or by passing a single <code>*</code> as a column name. In both cases, the <code>run_for_all_data_type</code> will be used to determine the data type. Value should be be passed as a SparkDatatype enum. (default: [None])</p> required <code>limit_data_type</code> <p>allows to limit the transformation to a specific data type. Value should be passed as a SparkDatatype enum. (default: [None])</p> required <code>data_type_strict_mode</code> <p>Toggles strict mode for data type validation. Will only work if <code>limit_data_type</code> is set. - when True, a ValueError will be raised if any column does not adhere to the <code>limit_data_type</code> - when False, a warning will be thrown and the column will be skipped instead (default: False)</p> required"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data_type_strict_mode: bool = False\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type: Optional[List[SparkDatatype]] = [None]\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type: Optional[List[SparkDatatype]] = [None]\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.column_type_of_col","title":"column_type_of_col","text":"<pre><code>column_type_of_col(col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True) -&gt; Union[DataType, str]\n</code></pre> <p>Returns the dataType of a Column object as a string.</p> <p>The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type based on the column name. We retrieve the name of the column from the Column object by calling toString() from the JVM.</p> <p>Examples:</p> <p>input_df: | str_column | int_column | |------------|------------| | hello      | 1          | | world      | 2          |</p> <pre><code># using the AddOne transformation from the example above\nadd_one = AddOne(\n    columns=[\"str_column\", \"int_column\"],\n    df=input_df,\n)\nadd_one.column_type_of_col(\"str_column\")  # returns \"string\"\nadd_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n# returns IntegerType\nadd_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>col</code> <code>Union[str, Column]</code> <p>The column to check the type of</p> required <code>df</code> <code>Optional[DataFrame]</code> <p>The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor will be used.</p> <code>None</code> <code>simple_return_mode</code> <code>bool</code> <p>If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.</p> <code>True</code> <p>Returns:</p> Name Type Description <code>datatype</code> <code>str</code> <p>The type of the column as a string</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def column_type_of_col(\n    self, col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True\n) -&gt; Union[DataType, str]:\n    \"\"\"\n    Returns the dataType of a Column object as a string.\n\n    The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type\n    based on the column name. We retrieve the name of the column from the Column object by calling toString() from\n    the JVM.\n\n    Examples\n    --------\n    __input_df:__\n    | str_column | int_column |\n    |------------|------------|\n    | hello      | 1          |\n    | world      | 2          |\n\n    ```python\n    # using the AddOne transformation from the example above\n    add_one = AddOne(\n        columns=[\"str_column\", \"int_column\"],\n        df=input_df,\n    )\n    add_one.column_type_of_col(\"str_column\")  # returns \"string\"\n    add_one.column_type_of_col(\"int_column\")  # returns \"integer\"\n    # returns IntegerType\n    add_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n    ```\n\n    Parameters\n    ----------\n    col: Union[str, Column]\n        The column to check the type of\n\n    df: Optional[DataFrame]\n        The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    simple_return_mode: bool\n        If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.\n\n    Returns\n    -------\n    datatype: str\n        The type of the column as a string\n    \"\"\"\n    df = df or self.df\n    if not df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n\n    if not isinstance(col, Column):\n        col = f.col(col)\n\n    # ask the JVM for the name of the column\n    # noinspection PyProtectedMember\n    col_name = col._jc.toString()\n\n    # In order to check the datatype of the column, we have to ask the DataFrame its schema\n    df_col = [c for c in df.schema if c.name == col_name][0]\n\n    if simple_return_mode:\n        return SparkDatatype(df_col.dataType.typeName()).value\n\n    return df_col.dataType\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_all_columns_of_specific_type","title":"get_all_columns_of_specific_type","text":"<pre><code>get_all_columns_of_specific_type(data_type: Union[str, SparkDatatype]) -&gt; List[str]\n</code></pre> <p>Get all columns from the dataframe of a given type</p> <p>A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will be raised.</p> <p>Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you have to call this method multiple times.</p> <p>Parameters:</p> Name Type Description Default <code>data_type</code> <code>Union[str, SparkDatatype]</code> <p>The data type to get the columns for</p> required <p>Returns:</p> Type Description <code>List[str]</code> <p>A list of column names of the given data type</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_all_columns_of_specific_type(self, data_type: Union[str, SparkDatatype]) -&gt; List[str]:\n    \"\"\"Get all columns from the dataframe of a given type\n\n    A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will\n    be raised.\n\n    Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you\n    have to call this method multiple times.\n\n    Parameters\n    ----------\n    data_type: Union[str, SparkDatatype]\n        The data type to get the columns for\n\n    Returns\n    -------\n    List[str]\n        A list of column names of the given data type\n    \"\"\"\n    if not self.df:\n        raise ValueError(\"No dataframe available - cannot get columns\")\n\n    expected_data_type = (SparkDatatype.from_string(data_type) if isinstance(data_type, str) else data_type).value\n\n    columns_of_given_type: List[str] = [\n        col for col in self.df.columns if self.df.schema[col].dataType.typeName() == expected_data_type\n    ]\n    return columns_of_given_type\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_columns","title":"get_columns","text":"<pre><code>get_columns() -&gt; iter\n</code></pre> <p>Return an iterator of the columns</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_columns(self) -&gt; iter:\n    \"\"\"Return an iterator of the columns\"\"\"\n    # If `run_for_all_is_set` is True, we want to run the transformation for all columns of a given type\n    if self.run_for_all_is_set:\n        columns = []\n        for data_type in self.ColumnConfig.run_for_all_data_type:\n            columns += self.get_all_columns_of_specific_type(data_type)\n    else:\n        columns = self.columns\n\n    for column in columns:\n        if self.is_column_type_correct(column):\n            yield column\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_limit_data_types","title":"get_limit_data_types","text":"<pre><code>get_limit_data_types()\n</code></pre> <p>Get the limit_data_type as a list of strings</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_limit_data_types(self):\n    \"\"\"Get the limit_data_type as a list of strings\"\"\"\n    return [dt.value for dt in self.ColumnConfig.limit_data_type]\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.is_column_type_correct","title":"is_column_type_correct","text":"<pre><code>is_column_type_correct(column)\n</code></pre> <p>Check if column type is correct and handle it if not, when limit_data_type is set</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def is_column_type_correct(self, column):\n    \"\"\"Check if column type is correct and handle it if not, when limit_data_type is set\"\"\"\n    if not self.limit_data_type_is_set:\n        return True\n\n    if self.column_type_of_col(column) in (limit_data_types := self.get_limit_data_types()):\n        return True\n\n    # Raises a ValueError if the Column object is not of a given type and data_type_strict_mode is set\n    if self.data_type_strict_mode_is_set:\n        raise ValueError(\n            f\"Critical error: {column} is not of type {limit_data_types}. Exception is raised because \"\n            f\"`data_type_strict_mode` is set to True for {self.name}.\"\n        )\n\n    # Otherwise, throws a warning that the Column object is not of a given type\n    self.log.warning(f\"Column `{column}` is not of type `{limit_data_types}` and will be skipped.\")\n    return False\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.set_columns","title":"set_columns","text":"<pre><code>set_columns(columns_value)\n</code></pre> <p>Validate columns through the columns configuration provided</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>@field_validator(\"columns\", mode=\"before\")\ndef set_columns(cls, columns_value):\n    \"\"\"Validate columns through the columns configuration provided\"\"\"\n    columns = columns_value\n    run_for_all_data_type = cls.ColumnConfig.run_for_all_data_type\n\n    if run_for_all_data_type and len(columns) == 0:\n        columns = [\"*\"]\n\n    if columns[0] == \"*\" and not run_for_all_data_type:\n        raise ValueError(\"Cannot use '*' as a column name when no run_for_all_data_type is set\")\n\n    return columns\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget","title":"koheesio.spark.transformations.ColumnsTransformationWithTarget","text":"<p>Extended ColumnsTransformation class with an additional <code>target_column</code> field</p> <p>Using this class makes implementing Transformations significantly easier.</p> Concept <p>A <code>ColumnsTransformationWithTarget</code> is a <code>ColumnsTransformation</code> with an additional <code>target_column</code> field. This field can be used to store the result of the transformation in a new column.</p> <p>If the <code>target_column</code> is not provided, the result will be stored in the source column.</p> <p>If more than one column is passed, the behavior of the Class changes this way:</p> <ul> <li>the transformation will be run in a loop against all the given columns</li> <li>automatically handles the renaming of the columns when more than one column is passed</li> <li>the <code>target_column</code> will be used as a suffix. Leaving this blank will result in the original columns being renamed</li> </ul> <p>The <code>func</code> method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the <code>get_columns_with_target</code> method to loop over all the columns and apply this function to transform the DataFrame.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>The column (or list of columns) to apply the transformation to. Alias: column. If not provided, the <code>run_for_all_data_type</code> will be used to determine the data type. If <code>run_for_all_data_type</code> is not set, the transformation will be run for all columns of a given type.</p> <code>*</code> <code>target_column</code> <code>Optional[str]</code> <p>The name of the column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this input will be used as a suffix instead.</p> <code>None</code> Example <p>Writing your own transformation using the <code>ColumnsTransformationWithTarget</code> class:</p> <pre><code>from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n</code></pre> <p>In the above example, the <code>func</code> method is implemented to add 1 to the values of a given column.</p> <p>In order to use this transformation, we can call the <code>transform</code> method: <pre><code>from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOneWithTarget(column=\"id\", target_column=\"new_id\").transform(df)\n</code></pre></p> <p>The <code>output_df</code> will now contain the original DataFrame with an additional column called <code>new_id</code> with the values of <code>id</code> + 1.</p> <p>output_df:</p> id new_id 0 1 1 2 2 3 <p>Note: The <code>target_column</code> will be used as a suffix when more than one column is given as source. Leaving this blank will result in the original columns being renamed.</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: Optional[str] = Field(default=None, alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output) This can be left unchanged, and hence should not be implemented in the child class.</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def execute(self):\n    \"\"\"Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output)\n    This can be left unchanged, and hence should not be implemented in the child class.\n    \"\"\"\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.func","title":"func  <code>abstractmethod</code>","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>The function that will be run on a single Column of the DataFrame</p> <p>The <code>func</code> method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the <code>get_columns_with_target</code> method to loop over all the columns and apply this function to transform the DataFrame.</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Column</code> <p>The column to apply the transformation to</p> required <p>Returns:</p> Type Description <code>Column</code> <p>The transformed column</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>@abstractmethod\ndef func(self, column: Column) -&gt; Column:\n    \"\"\"The function that will be run on a single Column of the DataFrame\n\n    The `func` method should be implemented in the child class. This method should return the transformation that\n    will be applied to the column(s). The execute method (already preset) will use the `get_columns_with_target`\n    method to loop over all the columns and apply this function to transform the DataFrame.\n\n    Parameters\n    ----------\n    column: Column\n        The column to apply the transformation to\n\n    Returns\n    -------\n    Column\n        The transformed column\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.get_columns_with_target","title":"get_columns_with_target","text":"<pre><code>get_columns_with_target() -&gt; iter\n</code></pre> <p>Return an iterator of the columns</p> <p>Works just like in get_columns from the  ColumnsTransformation class except that it handles the <code>target_column</code> as well.</p> <p>If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns - the target_column will be used as a suffix. Leaving this blank will result in the original columns being     renamed.</p> <p>Returns:</p> Type Description <code>iter</code> <p>An iterator of tuples containing the target column name and the original column name</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def get_columns_with_target(self) -&gt; iter:\n    \"\"\"Return an iterator of the columns\n\n    Works just like in get_columns from the  ColumnsTransformation class except that it handles the `target_column`\n    as well.\n\n    If more than one column is passed, the behavior of the Class changes this way:\n    - the transformation will be run in a loop against all the given columns\n    - the target_column will be used as a suffix. Leaving this blank will result in the original columns being\n        renamed.\n\n    Returns\n    -------\n    iter\n        An iterator of tuples containing the target column name and the original column name\n    \"\"\"\n    columns = [*self.get_columns()]\n\n    for column in columns:\n        # ensures that we at least use the original column name\n        target_column = self.target_column or column\n\n        if len(columns) &gt; 1:  # target_column becomes a suffix when more than 1 column is given\n            # dict.fromkeys is used to avoid duplicates in the name while maintaining order\n            _cols = [column, target_column]\n            target_column = \"_\".join(list(dict.fromkeys(_cols)))\n\n        yield target_column, column\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation","title":"koheesio.spark.transformations.Transformation","text":"<p>Base class for all transformations</p> Concept <p>A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is transformed based on the logic implemented in the <code>execute</code> method. Any additional parameters that are needed for the transformation can be passed to the constructor.</p> <p>Parameters:</p> Name Type Description Default <code>df</code> <p>The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the transform-method.</p> required Example <pre><code>from koheesio.steps.transformations import Transformation\nfrom pyspark.sql import functions as f\n\n\nclass AddOne(Transformation):\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n</code></pre> <p>In the example above, the <code>execute</code> method is implemented to add 1 to the values of the <code>old_column</code> and store the result in a new column called <code>new_column</code>.</p> <p>In order to use this transformation, we can call the <code>transform</code> method:</p> <pre><code>from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOne().transform(df)\n</code></pre> <p>The <code>output_df</code> will now contain the original DataFrame with an additional column called <code>new_column</code> with the values of <code>old_column</code> + 1.</p> <p>output_df:</p> id new_column 0 1 1 2 2 3 ... <p>Alternatively, we can pass the DataFrame to the constructor and call the <code>execute</code> or <code>transform</code> method without any arguments:</p> <pre><code>output_df = AddOne(df).transform()\n# or\noutput_df = AddOne(df).execute().output.df\n</code></pre> <p>Note: that the transform method was not implemented explicitly in the AddOne class. This is because the <code>transform</code> method is already implemented in the <code>Transformation</code> class. This means that all classes that inherit from the Transformation class will have the <code>transform</code> method available. Only the execute method needs to be implemented.</p>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute on a Transformation should handle self.df (input) and set self.output.df (output)</p> <p>This method should be implemented in the child class. The input DataFrame is available as <code>self.df</code> and the output DataFrame should be stored in <code>self.output.df</code>.</p> <p>For example: <pre><code>def execute(self):\n    self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n</code></pre></p> <p>The transform method will call this method and return the output DataFrame.</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self) -&gt; SparkStep.Output:\n    \"\"\"Execute on a Transformation should handle self.df (input) and set self.output.df (output)\n\n    This method should be implemented in the child class. The input DataFrame is available as `self.df` and the\n    output DataFrame should be stored in `self.output.df`.\n\n    For example:\n    ```python\n    def execute(self):\n        self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n    ```\n\n    The transform method will call this method and return the output DataFrame.\n    \"\"\"\n    # self.df  # input dataframe\n    # self.output.df # output dataframe\n    self.output.df = ...  # implement the transformation logic\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.transform","title":"transform","text":"<pre><code>transform(df: Optional[DataFrame] = None) -&gt; DataFrame\n</code></pre> <p>Execute the transformation and return the output DataFrame</p> <p>Note: when creating a child from this, don't implement this transform method. Instead, implement execute!</p> See Also <p><code>Transformation.execute</code></p> <p>Parameters:</p> Name Type Description Default <code>df</code> <code>Optional[DataFrame]</code> <p>The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor will be used.</p> <code>None</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>The transformed DataFrame</p> Source code in <code>src/koheesio/spark/transformations/__init__.py</code> <pre><code>def transform(self, df: Optional[DataFrame] = None) -&gt; DataFrame:\n    \"\"\"Execute the transformation and return the output DataFrame\n\n    Note: when creating a child from this, don't implement this transform method. Instead, implement execute!\n\n    See Also\n    --------\n    `Transformation.execute`\n\n    Parameters\n    ----------\n    df: Optional[DataFrame]\n        The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor\n        will be used.\n\n    Returns\n    -------\n    DataFrame\n        The transformed DataFrame\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output.df\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html","title":"Arrays","text":"<p>A collection of classes for performing various transformations on arrays in PySpark.</p> <p>These transformations include operations such as removing duplicates, exploding arrays into separate rows, reversing the order of elements, sorting elements, removing certain values, and calculating aggregate statistics like minimum, maximum, sum, mean, and median.</p> Concept <ul> <li>Every transformation in this module is implemented as a class that inherits from the <code>ArrayTransformation</code> class.</li> <li>The <code>ArrayTransformation</code> class is a subclass of <code>ColumnsTransformationWithTarget</code></li> <li>The <code>ArrayTransformation</code> class implements the <code>func</code> method, which is used to define the transformation logic.</li> <li>The <code>func</code> method takes a <code>column</code> as input and returns a <code>Column</code> object.</li> <li>The <code>Column</code> object is a PySpark column that can be used to perform transformations on a DataFrame column.</li> <li>The <code>ArrayTransformation</code> limits the data type of the transformation to array by setting the <code>ColumnConfig</code> class to     <code>run_for_all_data_type = [SparkDatatype.ARRAY]</code> and <code>limit_data_type = [SparkDatatype.ARRAY]</code>.</li> </ul> See Also <ul> <li>koheesio.spark.transformations     Module containing all transformation classes.</li> <li>koheesio.spark.transformations.ColumnsTransformationWithTarget     Base class for all transformations that operate on columns and have a target column.</li> </ul>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortAsc","title":"koheesio.spark.transformations.arrays.ArraySortAsc  <code>module-attribute</code>","text":"<pre><code>ArraySortAsc = ArraySort\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct","title":"koheesio.spark.transformations.arrays.ArrayDistinct","text":"<p>Remove duplicates from array</p> Example <pre><code>ArrayDistinct(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.filter_empty","title":"filter_empty  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>filter_empty: bool = Field(default=True, description='Remove null, nan, and empty values from array. Default is True.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    _fn = F.array_distinct(column)\n\n    # noinspection PyUnresolvedReferences\n    element_type = self.column_type_of_col(column, None, False).elementType\n    is_numeric = spark_data_type_is_numeric(element_type)\n\n    if self.filter_empty:\n        # Remove null values from array\n        if spark_minor_version &gt;= 3.4:\n            # Run array_compact if spark version is 3.4 or higher\n            # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_compact.html\n            # pylint: disable=E0611\n            from pyspark.sql.functions import array_compact as _array_compact\n\n            _fn = _array_compact(_fn)\n            # pylint: enable=E0611\n        else:\n            # Otherwise, remove null from array using array_except\n            _fn = F.array_except(_fn, F.array(F.lit(None)))\n\n        # Remove nan or empty values from array (depends on the type of the elements in array)\n        if is_numeric:\n            # Remove nan from array (float/int/numbers)\n            _fn = F.array_except(_fn, F.array(F.lit(float(\"nan\")).cast(element_type)))\n        else:\n            # Remove empty values from array (string/text)\n            _fn = F.array_except(_fn, F.array(F.lit(\"\"), F.lit(\" \")))\n\n    return _fn\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax","title":"koheesio.spark.transformations.arrays.ArrayMax","text":"<p>Return the maximum value in the array</p> Example <pre><code>ArrayMax(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    # Call for processing of nan values\n    column = super().func(column)\n\n    return F.array_max(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean","title":"koheesio.spark.transformations.arrays.ArrayMean","text":"<p>Return the mean of the values in the array.</p> <p>Note: Only numeric values are supported for calculating the mean.</p> Example <pre><code>ArrayMean(column=\"array_column\", target_column=\"average\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Calculate the mean of the values in the array</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"Calculate the mean of the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(col=column, df=None, simple_return_mode=False).elementType\n\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for calculating a mean.\"\n        )\n\n    _sum = ArraySum.from_step(self).func(column)\n    # Call for processing of nan values\n    column = super().func(column)\n    _size = F.size(column)\n    # return 0 if the size of the array is 0 to avoid division by zero\n    return F.when(_size == 0, F.lit(0)).otherwise(_sum / _size)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian","title":"koheesio.spark.transformations.arrays.ArrayMedian","text":"<p>Return the median of the values in the array.</p> <p>The median is the middle value in a sorted, ascending or descending, list of numbers.</p> <ul> <li>If the size of the array is even, the median is the average of the two middle numbers.</li> <li>If the size of the array is odd, the median is the middle number.</li> </ul> <p>Note: Only numeric values are supported for calculating the median.</p> Example <pre><code>ArrayMedian(column=\"array_column\", target_column=\"median\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Calculate the median of the values in the array</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"Calculate the median of the values in the array\"\"\"\n    # Call for processing of nan values\n    column = super().func(column)\n\n    sorted_array = ArraySort.from_step(self).func(column)\n    _size: Column = F.size(sorted_array)\n\n    # Calculate the middle index. If the size is odd, PySpark discards the fractional part.\n    # Use floor function to ensure the result is an integer\n    middle: Column = F.floor((_size + 1) / 2).cast(\"int\")\n\n    # Define conditions\n    is_size_zero: Column = _size == 0\n    is_column_null: Column = column.isNull()\n    is_size_even: Column = _size % 2 == 0\n\n    # Define actions / responses\n    # For even-sized arrays, calculate the average of the two middle elements\n    average_of_middle_elements = (F.element_at(sorted_array, middle) + F.element_at(sorted_array, middle + 1)) / 2\n    # For odd-sized arrays, select the middle element\n    middle_element = F.element_at(sorted_array, middle)\n    # In case the array is empty, return either None or 0\n    none_value = F.lit(None)\n    zero_value = F.lit(0)\n\n    median = (\n        # Check if the size of the array is 0\n        F.when(\n            is_size_zero,\n            # If the size of the array is 0 and the column is null, return None\n            # If the size of the array is 0 and the column is not null, return 0\n            F.when(is_column_null, none_value).otherwise(zero_value),\n        ).otherwise(\n            # If the size of the array is not 0, calculate the median\n            F.when(is_size_even, average_of_middle_elements).otherwise(middle_element)\n        )\n    )\n\n    return median\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin","title":"koheesio.spark.transformations.arrays.ArrayMin","text":"<p>Return the minimum value in the array</p> Example <pre><code>ArrayMin(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return F.array_min(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess","title":"koheesio.spark.transformations.arrays.ArrayNullNanProcess","text":"<p>Process an array by removing NaN and/or NULL values from elements.</p> <p>Parameters:</p> Name Type Description Default <code>keep_nan</code> <code>bool</code> <p>Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.</p> <code>False</code> <code>keep_null</code> <code>bool</code> <p>Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.</p> <code>False</code> <p>Returns:</p> Name Type Description <code>column</code> <code>Column</code> <p>The processed column with NaN and/or NULL values removed from elements.</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; input_data = [(1, [1.1, 2.1, 4.1, float(\"nan\")])]\n&gt;&gt;&gt; input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n&gt;&gt;&gt; spark = SparkSession.builder.getOrCreate()\n&gt;&gt;&gt; df = spark.createDataFrame(input_data, schema=input_schema)\n&gt;&gt;&gt; transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=False)\n&gt;&gt;&gt; transformer.transform(df)\n&gt;&gt;&gt; print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1]\n\n&gt;&gt;&gt; input_data = [(1, [1.1, 2.2, 4.1, float(\"nan\")])]\n&gt;&gt;&gt; input_schema = StructType([StructField(\"id\", IntegerType(), True),\n    StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n&gt;&gt;&gt; spark = SparkSession.builder.getOrCreate()\n&gt;&gt;&gt; df = spark.createDataFrame(input_data, schema=input_schema)\n&gt;&gt;&gt; transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=True)\n&gt;&gt;&gt; transformer.transform(df)\n&gt;&gt;&gt; print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1, nan]\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_nan","title":"keep_nan  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>keep_nan: bool = Field(False, description='Whether to keep nan values in the array. Default is False. If set to True, the nan values will be kept in the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_null","title":"keep_null  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>keep_null: bool = Field(False, description='Whether to keep null values in the array. Default is False. If set to True, the null values will be kept in the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Process the given column by removing NaN and/or NULL values from elements.</p> Parameters: <p>column : Column     The column to be processed.</p> Returns: <p>column : Column     The processed column with NaN and/or NULL values removed from elements.</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"\n    Process the given column by removing NaN and/or NULL values from elements.\n\n    Parameters:\n    -----------\n    column : Column\n        The column to be processed.\n\n    Returns:\n    --------\n    column : Column\n        The processed column with NaN and/or NULL values removed from elements.\n    \"\"\"\n\n    def apply_logic(x: Column):\n        if self.keep_nan is False and self.keep_null is False:\n            logic = x.isNotNull() &amp; ~F.isnan(x)\n        elif self.keep_nan is False:\n            logic = ~F.isnan(x)\n        elif self.keep_null is False:\n            logic = x.isNotNull()\n\n        return logic\n\n    if self.keep_nan is False or self.keep_null is False:\n        column = F.filter(column, apply_logic)\n\n    return column\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove","title":"koheesio.spark.transformations.arrays.ArrayRemove","text":"<p>Remove a certain value from the array</p> <p>Parameters:</p> Name Type Description Default <code>keep_nan</code> <code>bool</code> <p>Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.</p> <code>False</code> <code>keep_null</code> <code>bool</code> <p>Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.</p> <code>False</code> Example <pre><code>ArrayRemove(column=\"array_column\", value=\"value_to_remove\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.make_distinct","title":"make_distinct  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>make_distinct: bool = Field(default=False, description='Whether to remove duplicates from the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.value","title":"value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>value: Any = Field(default=None, description='The value to remove from the array.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    value = self.value\n\n    column = super().func(column)\n\n    def filter_logic(x: Column, _val: Any):\n        if self.keep_null and self.keep_nan:\n            logic = (x != F.lit(_val)) | x.isNull() | F.isnan(x)\n        elif self.keep_null:\n            logic = (x != F.lit(_val)) | x.isNull()\n        elif self.keep_nan:\n            logic = (x != F.lit(_val)) | F.isnan(x)\n        else:\n            logic = x != F.lit(_val)\n\n        return logic\n\n    # Check if the value is iterable (i.e., a list, tuple, or set)\n    if isinstance(value, (list, tuple, set)):\n        result = reduce(lambda res, val: F.filter(res, lambda x: filter_logic(x, val)), value, column)\n    else:\n        # If the value is not iterable, simply remove the value from the array\n        result = F.filter(column, lambda x: filter_logic(x, value))\n\n    if self.make_distinct:\n        result = F.array_distinct(result)\n\n    return result\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse","title":"koheesio.spark.transformations.arrays.ArrayReverse","text":"<p>Reverse the order of elements in the array</p> Example <pre><code>ArrayReverse(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return F.reverse(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort","title":"koheesio.spark.transformations.arrays.ArraySort","text":"<p>Sort the elements in the array</p> <p>By default, the elements are sorted in ascending order. To sort the elements in descending order, set the <code>reverse</code> parameter to True.</p> Example <pre><code>ArraySort(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.reverse","title":"reverse  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>reverse: bool = Field(default=False, description='Sort the elements in the array in a descending order. Default is False.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    column = F.array_sort(column)\n    if self.reverse:\n        # Reverse the order of elements in the array\n        column = ArrayReverse.from_step(self).func(column)\n    return column\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc","title":"koheesio.spark.transformations.arrays.ArraySortDesc","text":"<p>Sort the elements in the array in descending order</p>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc.reverse","title":"reverse  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>reverse: bool = True\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum","title":"koheesio.spark.transformations.arrays.ArraySum","text":"<p>Return the sum of the values in the array</p> <p>Parameters:</p> Name Type Description Default <code>keep_nan</code> <code>bool</code> <p>Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.</p> <code>False</code> <code>keep_null</code> <code>bool</code> <p>Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.</p> <code>False</code> Example <pre><code>ArraySum(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> <p>Using the <code>aggregate</code> function to sum the values in the array</p> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    \"\"\"Using the `aggregate` function to sum the values in the array\"\"\"\n    # raise an error if the array contains non-numeric elements\n    element_type = self.column_type_of_col(column, None, False).elementType\n    if not spark_data_type_is_numeric(element_type):\n        raise ValueError(\n            f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n            f\"Only numeric values are supported for summing.\"\n        )\n\n    # remove na values from array.\n    column = super().func(column)\n\n    # Using the `aggregate` function to sum the values in the array by providing the initial value as 0.0 and the\n    # lambda function to add the elements together. Pyspark will automatically infer the type of the initial value\n    # making 0.0 valid for both integer and float types.\n    initial_value = F.lit(0.0)\n    return F.aggregate(column, initial_value, lambda accumulator, x: accumulator + x)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation","title":"koheesio.spark.transformations.arrays.ArrayTransformation","text":"<p>Base class for array transformations</p>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data type of the Transformation to array</p>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [ARRAY]\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [ARRAY]\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    raise NotImplementedError(\"This is an abstract class\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode","title":"koheesio.spark.transformations.arrays.Explode","text":"<p>Explode the array into separate rows</p> Example <pre><code>Explode(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.distinct","title":"distinct  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>distinct: bool = Field(False, description='Remove duplicates from the exploded array. Default is False.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.preserve_nulls","title":"preserve_nulls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>preserve_nulls: bool = Field(True, description='Preserve rows with null values in the exploded array by using explode_outer instead of explode.Default is True.')\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/arrays.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    if self.distinct:\n        column = ArrayDistinct.from_step(self).func(column)\n    return F.explode_outer(column) if self.preserve_nulls else F.explode(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct","title":"koheesio.spark.transformations.arrays.ExplodeDistinct","text":"<p>Explode the array into separate rows while removing duplicates and empty values</p> Example <pre><code>ExplodeDistinct(column=\"array_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct.distinct","title":"distinct  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>distinct: bool = True\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html","title":"Camel to snake","text":"<p>Class for converting DataFrame column names from camel case to snake case.</p>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.camel_to_snake_re","title":"koheesio.spark.transformations.camel_to_snake.camel_to_snake_re  <code>module-attribute</code>","text":"<pre><code>camel_to_snake_re = compile('([a-z0-9])([A-Z])')\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","title":"koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","text":"<p>Converts column names from camel case to snake cases</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Optional[ListOfColumns]</code> <p>The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: <code>[\"column1\", \"column2\"]</code> or <code>\"column1\"</code></p> <code>None</code> Example <p>input_df:</p> camelCaseColumn snake_case_column ... ... <pre><code>output_df = CamelToSnakeTransformation(column=\"camelCaseColumn\").transform(input_df)\n</code></pre> <p>output_df:</p> camel_case_column snake_case_column ... ... <p>In this example, the column <code>camelCaseColumn</code> is converted to <code>camel_case_column</code>.</p> <p>Note: the data in the columns is not changed, only the column names.</p>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[ListOfColumns] = Field(default='', alias='column', description=\"The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'` \")\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/camel_to_snake.py</code> <pre><code>def execute(self):\n    _df = self.df\n\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n\n    for column in columns:\n        _df = _df.withColumnRenamed(column, convert_camel_to_snake(column))\n\n    self.output.df = _df\n</code></pre>"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","title":"koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","text":"<pre><code>convert_camel_to_snake(name: str)\n</code></pre> <p>Converts a string from camelCase to snake_case.</p> Parameters: <p>name : str     The string to be converted.</p> Returns: <p>str     The converted string in snake_case.</p> Source code in <code>src/koheesio/spark/transformations/camel_to_snake.py</code> <pre><code>def convert_camel_to_snake(name: str):\n    \"\"\"\n    Converts a string from camelCase to snake_case.\n\n    Parameters:\n    ----------\n    name : str\n        The string to be converted.\n\n    Returns:\n    --------\n    str\n        The converted string in snake_case.\n    \"\"\"\n    return camel_to_snake_re.sub(r\"\\1_\\2\", name).lower()\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html","title":"Cast to datatype","text":"<p>Transformations to cast a column or set of columns to a given datatype.</p> <p>Each one of these have been vetted to throw warnings when wrong datatypes are passed (to skip erroring any job or pipeline).</p> <p>Furthermore, detailed tests have been added to ensure that types are actually compatible as prescribed.</p> Concept <ul> <li>One can use the CastToDataType class directly, or use one of the more specific subclasses.</li> <li>Each subclass is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run     for all compatible data types.</li> <li>Compatible Data types are specified in the ColumnConfig class of each subclass, and are documented in the docstring     of each subclass.</li> </ul> <p>See class docstrings for more information</p> Note <p>Dates, Arrays and Maps are not supported by this module.</p> <ul> <li>for dates, use the koheesio.spark.transformations.date_time module</li> <li>for arrays, use the koheesio.spark.transformations.arrays module</li> </ul> <p>Classes:</p> Name Description <code>CastToDatatype:</code> <p>Cast a column or set of columns to a given datatype</p> <code>CastToByte</code> <p>Cast to Byte (a.k.a. tinyint)</p> <code>CastToShort</code> <p>Cast to Short (a.k.a. smallint)</p> <code>CastToInteger</code> <p>Cast to Integer (a.k.a. int)</p> <code>CastToLong</code> <p>Cast to Long (a.k.a. bigint)</p> <code>CastToFloat</code> <p>Cast to Float (a.k.a. real)</p> <code>CastToDouble</code> <p>Cast to Double</p> <code>CastToDecimal</code> <p>Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)</p> <code>CastToString</code> <p>Cast to String</p> <code>CastToBinary</code> <p>Cast to Binary (a.k.a. byte array)</p> <code>CastToBoolean</code> <p>Cast to Boolean</p> <code>CastToTimestamp</code> <p>Cast to Timestamp</p> Note <p>The following parameters are common to all classes in this module:</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>Name of the source column(s). Alias: column</p> required <code>target_column</code> <code>str</code> <p>Name of the target column or alias if more than one column is specified. Alias: target_alias</p> required <code>datatype</code> <code>str or SparkDatatype</code> <p>Datatype to cast to. Choose from SparkDatatype Enum (only needs to be specified in CastToDatatype, all other classes have a fixed datatype)</p> required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary","title":"koheesio.spark.transformations.cast_to_datatype.CastToBinary","text":"<p>Cast to Binary (a.k.a. byte array)</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>float</li> <li>double</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>string</li> </ul> <p>Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = BINARY\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToBinary class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean","title":"koheesio.spark.transformations.cast_to_datatype.CastToBoolean","text":"<p>Cast to Boolean</p> Unsupported datatypes: <p>Following casts are not supported</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = BOOLEAN\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToBoolean class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte","title":"koheesio.spark.transformations.cast_to_datatype.CastToByte","text":"<p>Cast to Byte (a.k.a. tinyint)</p> <p>Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>boolean</li> <li>timestamp</li> <li>decimal</li> <li>double</li> <li>float</li> <li>long</li> <li>integer</li> <li>short</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>timestamp     range of values too small for timestamp to have any meaning</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = BYTE\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToByte class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype","title":"koheesio.spark.transformations.cast_to_datatype.CastToDatatype","text":"<p>Cast a column or set of columns to a given datatype</p> <p>Wrapper around pyspark.sql.Column.cast</p> Concept <p>This class acts as the base class for all the other CastTo* classes. It is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.</p> Example <p>input_df:</p> c1 c2 1 2 3 4 <pre><code>output_df = CastToDatatype(\n    column=\"c1\",\n    datatype=\"string\",\n    target_alias=\"c1\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> c1 c2 \"1\" 2 \"3\" 4 <p>In the example above, the column <code>c1</code> is cast to a string datatype. The column <code>c2</code> is not affected.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>Name of the source column(s). Alias: column</p> required <code>datatype</code> <code>str or SparkDatatype</code> <p>Datatype to cast to. Choose from SparkDatatype Enum</p> required <code>target_column</code> <code>str</code> <p>Name of the target column or alias if more than one column is specified. Alias: target_alias</p> required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = Field(default=..., description='Datatype. Choose from SparkDatatype Enum')\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    # This is to let the IDE explicitly know that the datatype is not a string, but a `SparkDatatype` Enum\n    datatype: SparkDatatype = self.datatype\n    return column.cast(datatype.spark_type())\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.validate_datatype","title":"validate_datatype","text":"<pre><code>validate_datatype(datatype_value) -&gt; SparkDatatype\n</code></pre> <p>Validate the datatype.</p> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>@field_validator(\"datatype\")\ndef validate_datatype(cls, datatype_value) -&gt; SparkDatatype:\n    \"\"\"Validate the datatype.\"\"\"\n    # handle string input\n    try:\n        if isinstance(datatype_value, str):\n            datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value)\n            return datatype_value\n\n        # and let SparkDatatype handle the rest\n        datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value.value)\n\n    except AttributeError as e:\n        raise AttributeError(f\"Invalid datatype: {datatype_value}\") from e\n\n    return datatype_value\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal","title":"koheesio.spark.transformations.cast_to_datatype.CastToDecimal","text":"<p>Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)</p> <p>Represents arbitrary-precision signed decimal numbers. Backed internally by <code>java.math.BigDecimal</code>. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.</p> <p>The DecimalType must have fixed precision (the maximum total number of digits) and scale (the number of digits on the right of dot). For example, (5, 2) can support the value from [-999.99 to 999.99].</p> <p>The precision can be up to 38, the scale must be less or equal to precision.</p> <p>Spark sets the default precision and scale to (10, 0) when creating a DecimalType. However, when inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).</p> <p>For compatibility reasons, Koheesio sets the default precision and scale to (38, 18) when creating a DecimalType.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>string</li> <li>void</li> <li>decimal     spark will convert existing decimals to null if the precision and scale doesn't fit the data</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>ListOfColumns</code> <p>Name of the source column(s). Alias: column</p> <code>*</code> <code>target_column</code> <code>str</code> <p>Name of the target column or alias if more than one column is specified. Alias: target_alias</p> required <code>precision</code> <code>conint(gt=0, le=38)</code> <p>the maximum (i.e. total) number of digits (default: 38). Must be &gt; 0.</p> <code>38</code> <code>scale</code> <code>conint(ge=0, le=18)</code> <p>the number of digits on right side of dot. (default: 18). Must be &gt;= 0.</p> <code>18</code>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = DECIMAL\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.precision","title":"precision  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>precision: conint(gt=0, le=38) = Field(default=38, description='The maximum total number of digits (precision) of the decimal. Must be &gt; 0. Default is 38')\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.scale","title":"scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: conint(ge=0, le=18) = Field(default=18, description='The number of digits to the right of the decimal point (scale). Must be &gt;= 0. Default is 18')\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToDecimal class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return column.cast(self.datatype.spark_type(precision=self.precision, scale=self.scale))\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.validate_scale_and_precisions","title":"validate_scale_and_precisions","text":"<pre><code>validate_scale_and_precisions()\n</code></pre> <p>Validate the precision and scale values.</p> Source code in <code>src/koheesio/spark/transformations/cast_to_datatype.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_scale_and_precisions(self):\n    \"\"\"Validate the precision and scale values.\"\"\"\n    precision_value = self.precision\n    scale_value = self.scale\n\n    if scale_value == precision_value:\n        self.log.warning(\"scale and precision are equal, this will result in a null value\")\n    if scale_value &gt; precision_value:\n        raise ValueError(\"scale must be &lt; precision\")\n\n    return self\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble","title":"koheesio.spark.transformations.cast_to_datatype.CastToDouble","text":"<p>Cast to Double</p> <p>Represents 8-byte double-precision floating point numbers. The range of numbers is from -1.7976931348623157E308 to 1.7976931348623157E308.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = DOUBLE\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToDouble class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat","title":"koheesio.spark.transformations.cast_to_datatype.CastToFloat","text":"<p>Cast to Float (a.k.a. real)</p> <p>Represents 4-byte single-precision floating point numbers. The range of numbers is from -3.402823E38 to 3.402823E38.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>double</li> <li>decimal</li> <li>boolean</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>timestamp     precision is lost (use CastToDouble instead)</li> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = FLOAT\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToFloat class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, DOUBLE, DECIMAL, BOOLEAN]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger","title":"koheesio.spark.transformations.cast_to_datatype.CastToInteger","text":"<p>Cast to Integer (a.k.a. int)</p> <p>Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = INTEGER\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToInteger class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong","title":"koheesio.spark.transformations.cast_to_datatype.CastToLong","text":"<p>Cast to Long (a.k.a. bigint)</p> <p>Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>boolean</li> <li>timestamp</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = LONG\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToLong class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort","title":"koheesio.spark.transformations.cast_to_datatype.CastToShort","text":"<p>Cast to Short (a.k.a. smallint)</p> <p>Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>string</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>void</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>string     converts to null</li> <li>timestamp     range of values too small for timestamp to have any meaning</li> <li>date     converts to null</li> <li>void     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = SHORT\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToShort class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString","title":"koheesio.spark.transformations.cast_to_datatype.CastToString","text":"<p>Cast to String</p> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>binary</li> <li>boolean</li> <li>timestamp</li> <li>date</li> <li>array</li> <li>map</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = STRING\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToString class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BINARY, BOOLEAN, TIMESTAMP, DATE, ARRAY, MAP]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","title":"koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","text":"<p>Cast to Timestamp</p> <p>Numeric time stamp is the number of seconds since January 1, 1970 00:00:00.000000000 UTC. Not advised to be used on small integers, as the range of values is too small for timestamp to have any meaning.</p> <p>For more fine-grained control over the timestamp format, use the <code>date_time</code> module. This allows for parsing strings to timestamps and vice versa.</p> See Also <ul> <li>koheesio.spark.transformations.date_time</li> <li>https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#timestamp-pattern</li> </ul> Unsupported datatypes: <p>Following casts are not supported</p> <ul> <li>binary</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>date</li> </ul> <p>Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:</p> <ul> <li>boolean:     range of values too small for timestamp to have any meaning</li> <li>byte:     range of values too small for timestamp to have any meaning</li> <li>string:     converts to null in most cases, use <code>date_time</code> module instead</li> <li>short:     range of values too small for timestamp to have any meaning</li> <li>void:     skipped by default</li> </ul>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.datatype","title":"datatype  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>datatype: Union[str, SparkDatatype] = TIMESTAMP\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig","title":"ColumnConfig","text":"<p>Set the data types that are compatible with the CastToTimestamp class.</p>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, BOOLEAN, BYTE, SHORT, STRING, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, DATE]\n</code></pre>"},{"location":"api_reference/spark/transformations/drop_column.html","title":"Drop column","text":"<p>This module defines the DropColumn class, a subclass of ColumnsTransformation.</p>"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn","title":"koheesio.spark.transformations.drop_column.DropColumn","text":"<p>Drop one or more columns</p> <p>The DropColumn class is used to drop one or more columns from a PySpark DataFrame. It wraps the <code>pyspark.DataFrame.drop</code> function and can handle either a single string or a list of strings as input.</p> <p>If a specified column does not exist in the DataFrame, no error or warning is thrown, and all existing columns will remain.</p> Expected behavior <ul> <li>When the <code>column</code> does not exist, all columns will remain (no error or warning is thrown)</li> <li>Either a single string, or a list of strings can be specified</li> </ul> Example <p>df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = DropColumn(column=\"product\").transform(df)\n</code></pre> <p>output_df:</p> amount country 1000 USA 1500 USA 1600 USA <p>In this example, the <code>product</code> column is dropped from the DataFrame <code>df</code>.</p>"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/drop_column.py</code> <pre><code>def execute(self):\n    self.log.info(f\"{self.column=}\")\n    self.output.df = self.df.drop(*self.columns)\n</code></pre>"},{"location":"api_reference/spark/transformations/dummy.html","title":"Dummy","text":"<p>Dummy transformation for testing purposes.</p>"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation","title":"koheesio.spark.transformations.dummy.DummyTransformation","text":"<p>Dummy transformation for testing purposes.</p> <p>This transformation adds a new column <code>hello</code> to the DataFrame with the value <code>world</code>.</p> <p>It is intended for testing purposes or for use in examples or reference documentation.</p> Example <p>input_df:</p> id 1 <pre><code>output_df = DummyTransformation().transform(input_df)\n</code></pre> <p>output_df:</p> id hello 1 world <p>In this example, the <code>hello</code> column is added to the DataFrame <code>input_df</code>.</p>"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/dummy.py</code> <pre><code>def execute(self):\n    self.output.df = self.df.withColumn(\"hello\", lit(\"world\"))\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html","title":"Get item","text":"<p>Transformation to wrap around the pyspark getItem function</p>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem","title":"koheesio.spark.transformations.get_item.GetItem","text":"<p>Get item from list or map (dictionary)</p> <p>Wrapper around <code>pyspark.sql.functions.getItem</code></p> <p><code>GetItem</code> is strict about the data type of the column. If the column is not a list or a map, an error will be raised.</p> Note <p>Only MapType and ArrayType are supported.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to get the item from. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>key</code> <code>Union[int, str]</code> <p>The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index</p> required Example"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-list-arraytype","title":"Example with list (ArrayType)","text":"<p>By specifying an integer for the parameter \"key\", getItem knows to get the element at index n of a list (index starts at 0).</p> <p>input_df:</p> id content 1 [1, 2, 3] 2 [4, 5] 3 [6] 4 [] <pre><code>output_df = GetItem(\n    column=\"content\",\n    index=1,  # get the second element of the list\n    target_column=\"item\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> id content item 1 [1, 2, 3] 2 2 [4, 5] 5 3 [6] null 4 [] null"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-a-dict-maptype","title":"Example with a dict (MapType)","text":"<p>input_df:</p> id content 1 {key1 -&gt; value1} 2 {key1 -&gt; value2} 3 {key2 -&gt; hello} 4 {key2 -&gt; world} <p><pre><code>output_df = GetItem(\n    column= \"content\",\n    key=\"key2,\n    target_column=\"item\",\n).transform(input_df)\n</code></pre> As we request the key to be \"key2\", the first 2 rows will be null, because it does not have \"key2\".</p> <p>output_df:</p> id content item 1 {key1 -&gt; value1} null 2 {key1 -&gt; value2} null 3 {key2 -&gt; hello} hello 4 {key2 -&gt; world} world"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.key","title":"key  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>key: Union[int, str] = Field(default=..., alias='index', description='The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index')\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig","title":"ColumnConfig","text":"<p>Limit the data types to ArrayType and MapType.</p>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data_type_strict_mode = True\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = run_for_all_data_type\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [ARRAY, MAP]\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/get_item.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return get_item(column, self.key)\n</code></pre>"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.get_item","title":"koheesio.spark.transformations.get_item.get_item","text":"<pre><code>get_item(column: Column, key: Union[str, int])\n</code></pre> <p>Wrapper around pyspark.sql.functions.getItem</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Column</code> <p>The column to get the item from</p> required <code>key</code> <code>Union[str, int]</code> <p>The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string.</p> required <p>Returns:</p> Type Description <code>Column</code> <p>The column with the item</p> Source code in <code>src/koheesio/spark/transformations/get_item.py</code> <pre><code>def get_item(column: Column, key: Union[str, int]):\n    \"\"\"\n    Wrapper around pyspark.sql.functions.getItem\n\n    Parameters\n    ----------\n    column : Column\n        The column to get the item from\n    key : Union[str, int]\n        The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer.\n        If the column is a dict (MapType), this should be a string.\n\n    Returns\n    -------\n    Column\n        The column with the item\n    \"\"\"\n    return column.getItem(key)\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html","title":"Hash","text":"<p>Module for hashing data using SHA-2 family of hash functions</p> <p>See the docstring of the Sha2Hash class for more information.</p>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.HASH_ALGORITHM","title":"koheesio.spark.transformations.hash.HASH_ALGORITHM  <code>module-attribute</code>","text":"<pre><code>HASH_ALGORITHM = Literal[224, 256, 384, 512]\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.STRING","title":"koheesio.spark.transformations.hash.STRING  <code>module-attribute</code>","text":"<pre><code>STRING = STRING\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash","title":"koheesio.spark.transformations.hash.Sha2Hash","text":"<p>hash the value of 1 or more columns using SHA-2 family of hash functions</p> <p>Mild wrapper around pyspark.sql.functions.sha2</p> <ul> <li>https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html</li> </ul> <p>Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).</p> Note <p>This function allows concatenating the values of multiple columns together prior to hashing.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to hash. Alias: column</p> required <code>delimiter</code> <code>Optional[str]</code> <p>Optional separator for the string that will eventually be hashed. Defaults to '|'</p> <code>|</code> <code>num_bits</code> <code>Optional[HASH_ALGORITHM]</code> <p>Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512</p> <code>256</code> <code>target_column</code> <code>str</code> <p>The generated hash will be written to the column name specified here</p> required"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.delimiter","title":"delimiter  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>delimiter: Optional[str] = Field(default='|', description=\"Optional separator for the string that will eventually be hashed. Defaults to '|'\")\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.num_bits","title":"num_bits  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>num_bits: Optional[HASH_ALGORITHM] = Field(default=256, description='Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512')\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: str = Field(default=..., description='The generated hash will be written to the column name specified here')\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/hash.py</code> <pre><code>def execute(self):\n    columns = list(self.get_columns())\n    self.output.df = (\n        self.df.withColumn(\n            self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)\n        )\n        if columns\n        else self.df\n    )\n</code></pre>"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.sha2_hash","title":"koheesio.spark.transformations.hash.sha2_hash","text":"<pre><code>sha2_hash(columns: List[str], delimiter: Optional[str] = '|', num_bits: Optional[HASH_ALGORITHM] = 256)\n</code></pre> <p>hash the value of 1 or more columns using SHA-2 family of hash functions</p> <p>Mild wrapper around pyspark.sql.functions.sha2</p> <ul> <li>https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html</li> </ul> <p>Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.</p> <p>If a null is passed, the result will also be null.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>List[str]</code> <p>The columns to hash</p> required <code>delimiter</code> <code>Optional[str]</code> <p>Optional separator for the string that will eventually be hashed. Defaults to '|'</p> <code>|</code> <code>num_bits</code> <code>Optional[HASH_ALGORITHM]</code> <p>Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512</p> <code>256</code> Source code in <code>src/koheesio/spark/transformations/hash.py</code> <pre><code>def sha2_hash(columns: List[str], delimiter: Optional[str] = \"|\", num_bits: Optional[HASH_ALGORITHM] = 256):\n    \"\"\"\n    hash the value of 1 or more columns using SHA-2 family of hash functions\n\n    Mild wrapper around pyspark.sql.functions.sha2\n\n    - https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html\n\n    Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).\n    This function allows concatenating the values of multiple columns together prior to hashing.\n\n    If a null is passed, the result will also be null.\n\n    Parameters\n    ----------\n    columns : List[str]\n        The columns to hash\n    delimiter : Optional[str], optional, default=|\n        Optional separator for the string that will eventually be hashed. Defaults to '|'\n    num_bits : Optional[HASH_ALGORITHM], optional, default=256\n        Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512\n    \"\"\"\n    # make sure all columns are of type pyspark.sql.Column and cast to string\n    _columns = []\n    for c in columns:\n        if isinstance(c, str):\n            c: Column = col(c)\n        _columns.append(c.cast(STRING.spark_type()))\n\n    # concatenate columns if more than 1 column is provided\n    if len(_columns) &gt; 1:\n        column = concat_ws(delimiter, *_columns)\n    else:\n        column = _columns[0]\n\n    return sha2(column, num_bits)\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html","title":"Lookup","text":"<p>Lookup transformation for joining two dataframes together</p> <p>Classes:</p> Name Description <code>JoinMapping</code> <code>TargetColumn</code> <code>JoinType</code> <code>JoinHint</code> <code>DataframeLookup</code>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup","title":"koheesio.spark.transformations.lookup.DataframeLookup","text":"<p>Lookup transformation for joining two dataframes together</p> <p>Parameters:</p> Name Type Description Default <code>df</code> <code>DataFrame</code> <p>The left Spark DataFrame</p> required <code>other</code> <code>DataFrame</code> <p>The right Spark DataFrame</p> required <code>on</code> <code>List[JoinMapping] | JoinMapping</code> <p>List of join mappings. If only one mapping is passed, it can be passed as a single object.</p> required <code>targets</code> <code>List[TargetColumn] | TargetColumn</code> <p>List of target columns. If only one target is passed, it can be passed as a single object.</p> required <code>how</code> <code>JoinType</code> <p>What type of join to perform. Defaults to left. See JoinType for more information.</p> required <code>hint</code> <code>JoinHint</code> <p>What type of join hint to use. Defaults to None. See JoinHint for more information.</p> required Example <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.spark.transformations.lookup import (\n    DataframeLookup,\n    JoinMapping,\n    TargetColumn,\n    JoinType,\n)\n\nspark = SparkSession.builder.getOrCreate()\n\n# create the dataframes\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\n# perform the lookup\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", joined_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.transform()\n</code></pre> <p>output_df:</p> id value right_value 1 A A 2 B null <p>In this example, the <code>left_df</code> and <code>right_df</code> dataframes are joined together using the <code>id</code> column. The <code>value</code> column from the <code>right_df</code> is aliased as <code>right_value</code> in the output dataframe.</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: DataFrame = Field(default=None, description='The left Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.hint","title":"hint  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>hint: Optional[JoinHint] = Field(default=None, description='What type of join hint to use. Defaults to None. ' + __doc__)\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.how","title":"how  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>how: Optional[JoinType] = Field(default=LEFT, description='What type of join to perform. Defaults to left. ' + __doc__)\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.on","title":"on  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>on: Union[List[JoinMapping], JoinMapping] = Field(default=..., alias='join_mapping', description='List of join mappings. If only one mapping is passed, it can be passed as a single object.')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.other","title":"other  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>other: DataFrame = Field(default=None, description='The right Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.targets","title":"targets  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>targets: Union[List[TargetColumn], TargetColumn] = Field(default=..., alias='target_columns', description='List of target columns. If only one target is passed, it can be passed as a single object.')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output","title":"Output","text":"<p>Output for the lookup transformation</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.left_df","title":"left_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>left_df: DataFrame = Field(default=..., description='The left Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.right_df","title":"right_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>right_df: DataFrame = Field(default=..., description='The right Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute the lookup transformation</p> Source code in <code>src/koheesio/spark/transformations/lookup.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"Execute the lookup transformation\"\"\"\n    # prepare the right dataframe\n    prepared_right_df = self.get_right_df().select(\n        *[join_mapping.column for join_mapping in self.on],\n        *[target.column for target in self.targets],\n    )\n    if self.hint:\n        prepared_right_df = prepared_right_df.hint(self.hint)\n\n    # generate the output\n    self.output.left_df = self.df\n    self.output.right_df = prepared_right_df\n    self.output.df = self.df.join(\n        prepared_right_df,\n        on=[join_mapping.source_column for join_mapping in self.on],\n        how=self.how,\n    )\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.get_right_df","title":"get_right_df","text":"<pre><code>get_right_df() -&gt; DataFrame\n</code></pre> <p>Get the right side dataframe</p> Source code in <code>src/koheesio/spark/transformations/lookup.py</code> <pre><code>def get_right_df(self) -&gt; DataFrame:\n    \"\"\"Get the right side dataframe\"\"\"\n    return self.other\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.set_list","title":"set_list","text":"<pre><code>set_list(value)\n</code></pre> <p>Ensure that we can pass either a single object, or a list of objects</p> Source code in <code>src/koheesio/spark/transformations/lookup.py</code> <pre><code>@field_validator(\"on\", \"targets\")\ndef set_list(cls, value):\n    \"\"\"Ensure that we can pass either a single object, or a list of objects\"\"\"\n    return [value] if not isinstance(value, list) else value\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint","title":"koheesio.spark.transformations.lookup.JoinHint","text":"<p>Supported join hints</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.BROADCAST","title":"BROADCAST  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>BROADCAST = 'broadcast'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.MERGE","title":"MERGE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE = 'merge'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping","title":"koheesio.spark.transformations.lookup.JoinMapping","text":"<p>Mapping for joining two dataframes together</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.column","title":"column  <code>property</code>","text":"<pre><code>column: Column\n</code></pre> <p>Get the join mapping as a pyspark.sql.Column object</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.other_column","title":"other_column  <code>instance-attribute</code>","text":"<pre><code>other_column: str\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.source_column","title":"source_column  <code>instance-attribute</code>","text":"<pre><code>source_column: str\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType","title":"koheesio.spark.transformations.lookup.JoinType","text":"<p>Supported join types</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.ANTI","title":"ANTI  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ANTI = 'anti'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.CROSS","title":"CROSS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CROSS = 'cross'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.FULL","title":"FULL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FULL = 'full'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.INNER","title":"INNER  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>INNER = 'inner'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.LEFT","title":"LEFT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LEFT = 'left'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.RIGHT","title":"RIGHT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>RIGHT = 'right'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.SEMI","title":"SEMI  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SEMI = 'semi'\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn","title":"koheesio.spark.transformations.lookup.TargetColumn","text":"<p>Target column for the joined dataframe</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.column","title":"column  <code>property</code>","text":"<pre><code>column: Column\n</code></pre> <p>Get the target column as a pyspark.sql.Column object</p>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column","title":"target_column  <code>instance-attribute</code>","text":"<pre><code>target_column: str\n</code></pre>"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column_alias","title":"target_column_alias  <code>instance-attribute</code>","text":"<pre><code>target_column_alias: str\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html","title":"Repartition","text":"<p>Repartition Transformation</p>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition","title":"koheesio.spark.transformations.repartition.Repartition","text":"<p>Wrapper around DataFrame.repartition</p> <p>With repartition, the number of partitions can be given as an optional value. If this is not provided, a default value is used. The default number of partitions is defined by the spark config 'spark.sql.shuffle.partitions', for which the default value is 200 and will never exceed the number or rows in the DataFrame (whichever is value is lower).</p> <p>If columns are omitted, the entire DataFrame is repartitioned without considering the particular values in the columns.</p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Optional[Union[str, List[str]]]</code> <p>Name of the source column(s). If omitted, the entire DataFrame is repartitioned without considering the particular values in the columns. Alias: columns</p> <code>None</code> <code>num_partitions</code> <code>Optional[int]</code> <p>The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.</p> <code>None</code> Example <pre><code>Repartition(column=[\"c1\", \"c2\"], num_partitions=3)  # results in 3 partitions\nRepartition(column=\"c1\", num_partitions=2)  # results in 2 partitions\nRepartition(column=[\"c1\", \"c2\"])  # results in &lt;= 200 partitions\nRepartition(num_partitions=5)  # results in 5 partitions\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[ListOfColumns] = Field(default='', alias='column', description='Name of the source column(s)')\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.numPartitions","title":"numPartitions  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>numPartitions: Optional[int] = Field(default=None, alias='num_partitions', description=\"The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/repartition.py</code> <pre><code>def execute(self):\n    # Prepare columns input:\n    columns = self.df.columns if self.columns == [\"*\"] else self.columns\n    # Prepare repartition input:\n    #  num_partitions comes first, but if it is not provided it should not be included as None.\n    repartition_inputs = [i for i in [self.numPartitions, *columns] if i]\n    self.output.df = self.df.repartition(*repartition_inputs)\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html","title":"Replace","text":"<p>Transformation to replace a particular value in a column with another one</p>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace","title":"koheesio.spark.transformations.replace.Replace","text":"<p>Replace a particular value in a column with another one</p> <p>Can handle empty strings (\"\") as well as NULL / None values.</p> Unsupported datatypes: <p>Following casts are not supported</p> <p>will raise an error in Spark:</p> <ul> <li>binary</li> <li>boolean</li> <li>array&lt;...&gt;</li> <li>map&lt;...,...&gt;</li> </ul> Supported datatypes: <p>Following casts are supported:</p> <ul> <li>byte</li> <li>short</li> <li>integer</li> <li>long</li> <li>float</li> <li>double</li> <li>decimal</li> <li>timestamp</li> <li>date</li> <li>string</li> <li>void     skipped by default</li> </ul> <p>Any supported none-string datatype will be cast to string before the replacement is done.</p> Example <p>input_df:</p> id string 1 hello 2 world 3 <pre><code>output_df = Replace(\n    column=\"string\",\n    from_value=\"hello\",\n    to_value=\"programmer\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> id string 1 programmer 2 world 3 <p>In this example, the value \"hello\" in the column \"string\" is replaced with \"programmer\".</p>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.from_value","title":"from_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>from_value: Optional[str] = Field(default=None, alias='from', description=\"The original value that needs to be replaced. If no value is given, all 'null' values will be replaced with the to_value\")\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.to_value","title":"to_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>to_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig","title":"ColumnConfig","text":"<p>Column type configurations for the column to be replaced</p>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [*run_for_all_data_type, VOID]\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP, DATE]\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/replace.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return replace(column=column, from_value=self.from_value, to_value=self.to_value)\n</code></pre>"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.replace","title":"koheesio.spark.transformations.replace.replace","text":"<pre><code>replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None)\n</code></pre> <p>Function to replace a particular value in a column with another one</p> Source code in <code>src/koheesio/spark/transformations/replace.py</code> <pre><code>def replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None):\n    \"\"\"Function to replace a particular value in a column with another one\"\"\"\n    # make sure we have a Column object\n    if isinstance(column, str):\n        column = col(column)\n\n    if not from_value:\n        condition = column.isNull()\n    else:\n        condition = column == from_value\n\n    return when(condition, lit(to_value)).otherwise(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html","title":"Row number dedup","text":"<p>This module contains the RowNumberDedup class, which performs a row_number deduplication operation on a DataFrame.</p> <p>See the docstring of the RowNumberDedup class for more information.</p>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup","title":"koheesio.spark.transformations.row_number_dedup.RowNumberDedup","text":"<p>A class used to perform a row_number deduplication operation on a DataFrame.</p> <p>This class is a specialized transformation that extends the ColumnsTransformation class. It sorts the DataFrame based on the provided sort columns and assigns a row_number to each row. It then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row can be stored in a specified target column or a default column named \"meta_row_number_column\". The class also provides an option to preserve meta columns (like the row_numberk column) in the output DataFrame.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>list</code> <p>List of columns to apply the transformation to. If a single '*' is passed as a column name or if the columns parameter is omitted, the transformation will be applied to all columns of the data types specified in <code>run_for_all_data_type</code> of the ColumnConfig. (inherited from ColumnsTransformation)</p> <code>sort_columns</code> <code>list</code> <p>List of columns that the DataFrame will be sorted by.</p> <code>target_column</code> <code>(str, optional)</code> <p>Column where the row_number of each row will be stored.</p> <code>preserve_meta</code> <code>(bool, optional)</code> <p>Flag that determines whether the meta columns should be kept in the output DataFrame.</p>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.preserve_meta","title":"preserve_meta  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>preserve_meta: bool = Field(default=False, description=\"If true, meta columns are kept in output dataframe. Defaults to 'False'\")\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.sort_columns","title":"sort_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sort_columns: conlist(Union[str, Column], min_length=0) = Field(default_factory=list, alias='sort_column', description='List of orderBy columns. If only one column is passed, it can be passed as a single object.')\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: Optional[Union[str, Column]] = Field(default='meta_row_number_column', alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.window_spec","title":"window_spec  <code>property</code>","text":"<pre><code>window_spec: WindowSpec\n</code></pre> <p>Builds a WindowSpec object based on the columns defined in the configuration.</p> <p>The WindowSpec object is used to define a window frame over which functions are applied in Spark. This method partitions the data by the columns returned by the <code>get_columns</code> method and then orders the partitions by the columns specified in <code>sort_columns</code>.</p> Notes <p>The order of the columns in the WindowSpec object is preserved. If a column is passed as a string, it is converted to a Column object with DESC ordering.</p> <p>Returns:</p> Type Description <code>WindowSpec</code> <p>A WindowSpec object that can be used to define a window frame in Spark.</p>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Performs the row_number deduplication operation on the DataFrame.</p> <p>This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row, and then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row is stored in the target column. If preserve_meta is False, the method also drops the target column from the DataFrame.</p> Source code in <code>src/koheesio/spark/transformations/row_number_dedup.py</code> <pre><code>def execute(self) -&gt; RowNumberDedup.Output:\n    \"\"\"\n    Performs the row_number deduplication operation on the DataFrame.\n\n    This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row,\n    and then filters the DataFrame to keep only the top-row_number row for each group of duplicates.\n    The row_number of each row is stored in the target column. If preserve_meta is False,\n    the method also drops the target column from the DataFrame.\n    \"\"\"\n    df = self.df\n    window_spec = self.window_spec\n\n    # if target_column is a string, convert it to a Column object\n    if isinstance((target_column := self.target_column), str):\n        target_column = col(target_column)\n\n    # dedup the dataframe based on the window spec\n    df = df.withColumn(self.target_column, row_number().over(window_spec)).filter(target_column == 1).select(\"*\")\n\n    if not self.preserve_meta:\n        df = df.drop(target_column)\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.set_sort_columns","title":"set_sort_columns","text":"<pre><code>set_sort_columns(columns_value)\n</code></pre> <p>Validates and optimizes the sort_columns parameter.</p> <p>This method ensures that sort_columns is a list (or single object) of unique strings or Column objects. It removes any empty strings or None values from the list and deduplicates the columns.</p> <p>Parameters:</p> Name Type Description Default <code>columns_value</code> <code>Union[str, Column, List[Union[str, Column]]]</code> <p>The value of the sort_columns parameter.</p> required <p>Returns:</p> Type Description <code>List[Union[str, Column]]</code> <p>The optimized and deduplicated list of sort columns.</p> Source code in <code>src/koheesio/spark/transformations/row_number_dedup.py</code> <pre><code>@field_validator(\"sort_columns\", mode=\"before\")\ndef set_sort_columns(cls, columns_value):\n    \"\"\"\n    Validates and optimizes the sort_columns parameter.\n\n    This method ensures that sort_columns is a list (or single object) of unique strings or Column objects.\n    It removes any empty strings or None values from the list and deduplicates the columns.\n\n    Parameters\n    ----------\n    columns_value : Union[str, Column, List[Union[str, Column]]]\n        The value of the sort_columns parameter.\n\n    Returns\n    -------\n    List[Union[str, Column]]\n        The optimized and deduplicated list of sort columns.\n    \"\"\"\n    # Convert single string or Column object to a list\n    columns = [columns_value] if isinstance(columns_value, (str, Column)) else [*columns_value]\n\n    # Remove empty strings, None, etc.\n    columns = [c for c in columns if (isinstance(c, Column) and c is not None) or (isinstance(c, str) and c)]\n\n    dedup_columns = []\n    seen = set()\n\n    # Deduplicate the columns while preserving the order\n    for column in columns:\n        if str(column) not in seen:\n            dedup_columns.append(column)\n            seen.add(str(column))\n\n    return dedup_columns\n</code></pre>"},{"location":"api_reference/spark/transformations/sql_transform.html","title":"Sql transform","text":"<p>SQL Transform module</p> <p>SQL Transform module provides an easy interface to transform a dataframe using SQL. This SQL can originate from a string or a file and may contain placeholders for templating.</p>"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform","title":"koheesio.spark.transformations.sql_transform.SqlTransform","text":"<p>SQL Transform module provides an easy interface to transform a dataframe using SQL.</p> <p>This SQL can originate from a string or a file and may contain placeholder (parameters) for templating.</p> <ul> <li>Placeholders are identified with <code>${placeholder}</code>.</li> <li>Placeholders can be passed as explicit params (params) or as implicit params (kwargs).</li> </ul> <p>Example sql script:</p> <pre><code>SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n</code></pre>"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/sql_transform.py</code> <pre><code>def execute(self):\n    table_name = get_random_string(prefix=\"sql_transform\")\n    self.params = {**self.params, \"table_name\": table_name}\n\n    df = self.df\n    df.createOrReplaceTempView(table_name)\n    query = self.query\n\n    self.output.df = self.spark.sql(query)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html","title":"Transform","text":"<p>Transform module</p> <p>Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.</p>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform","title":"koheesio.spark.transformations.transform.Transform","text":"<pre><code>Transform(func: Callable, params: Dict = None, df: DataFrame = None, **kwargs)\n</code></pre> <p>Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.</p> <p>The implementation is inspired by and based upon: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html</p> <p>Parameters:</p> Name Type Description Default <code>func</code> <code>Callable</code> <p>The function to be called on the DataFrame.</p> required <code>params</code> <code>Dict</code> <p>The keyword arguments to be passed to the function. Defaults to None. Alternatively, keyword arguments can be passed directly as keyword arguments - they will be merged with the <code>params</code> dictionary.</p> <code>None</code> Example Source code in <code>src/koheesio/spark/transformations/transform.py</code> <pre><code>def __init__(self, func: Callable, params: Dict = None, df: DataFrame = None, **kwargs):\n    params = {**(params or {}), **kwargs}\n    super().__init__(func=func, params=params, df=df)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--a-function-compatible-with-transform","title":"a function compatible with Transform:","text":"<pre><code>def some_func(df, a: str, b: str):\n    return df.withColumn(a, f.lit(b))\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--verbose-style-input-in-transform","title":"verbose style input in Transform","text":"<pre><code>Transform(func=some_func, params={\"a\": \"foo\", \"b\": \"bar\"})\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--shortened-style-notation-easier-to-read","title":"shortened style notation (easier to read)","text":"<pre><code>Transform(some_func, a=\"foo\", b=\"bar\")\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--when-too-much-input-is-given-transform-will-ignore-extra-input","title":"when too much input is given, Transform will ignore extra input","text":"<pre><code>Transform(\n    some_func,\n    a=\"foo\",\n    # ignored input\n    c=\"baz\",\n    title=42,\n    author=\"Adams\",\n    # order of params input should not matter\n    b=\"bar\",\n)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--using-the-from_func-classmethod","title":"using the from_func classmethod","text":"<pre><code>SomeFunc = Transform.from_func(some_func, a=\"foo\")\nsome_func = SomeFunc(b=\"bar\")\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.func","title":"func  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>func: Callable = Field(default=None, description='The function to be called on the DataFrame.')\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Call the function on the DataFrame with the given keyword arguments.</p> Source code in <code>src/koheesio/spark/transformations/transform.py</code> <pre><code>def execute(self):\n    \"\"\"Call the function on the DataFrame with the given keyword arguments.\"\"\"\n    func, kwargs = get_args_for_func(self.func, self.params)\n    self.output.df = self.df.transform(func=func, **kwargs)\n</code></pre>"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.from_func","title":"from_func  <code>classmethod</code>","text":"<pre><code>from_func(func: Callable, **kwargs) -&gt; Callable[..., Transform]\n</code></pre> <p>Create a Transform class from a function. Useful for creating a new class with a different name.</p> <p>This method uses the <code>functools.partial</code> function to create a new class with the given function and keyword arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for the specific use case.</p> Example <pre><code>CustomTransform = Transform.from_func(some_func, a=\"foo\")\nsome_func = CustomTransform(b=\"bar\")\n</code></pre> <p>In this example, <code>CustomTransform</code> is a Transform class with the function <code>some_func</code> and the keyword argument <code>a</code> set to \"foo\". When calling <code>some_func(b=\"bar\")</code>, the function <code>some_func</code> will be called with the keyword arguments <code>a=\"foo\"</code> and <code>b=\"bar\"</code>.</p> Source code in <code>src/koheesio/spark/transformations/transform.py</code> <pre><code>@classmethod\ndef from_func(cls, func: Callable, **kwargs) -&gt; Callable[..., Transform]:\n    \"\"\"Create a Transform class from a function. Useful for creating a new class with a different name.\n\n    This method uses the `functools.partial` function to create a new class with the given function and keyword\n    arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for\n    the specific use case.\n\n    Example\n    -------\n    ```python\n    CustomTransform = Transform.from_func(some_func, a=\"foo\")\n    some_func = CustomTransform(b=\"bar\")\n    ```\n\n    In this example, `CustomTransform` is a Transform class with the function `some_func` and the keyword argument\n    `a` set to \"foo\". When calling `some_func(b=\"bar\")`, the function `some_func` will be called with the keyword\n    arguments `a=\"foo\"` and `b=\"bar\"`.\n    \"\"\"\n    return partial(cls, func=func, **kwargs)\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html","title":"Uuid5","text":"<p>Ability to generate UUID5 using native pyspark (no udf)</p>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5","title":"koheesio.spark.transformations.uuid5.HashUUID5","text":"<p>Generate a UUID with the UUID5 algorithm</p> <p>Spark does not provide inbuilt API to generate version 5 UUID, hence we have to use a custom implementation to provide this capability.</p> <p>Prerequisites: this function has no side effects. But be aware that in most cases, the expectation is that your data is clean (e.g. trimmed of leading and trailing spaces)</p> Concept <p>UUID5 is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). https://docs.python.org/3/library/uuid.html#uuid.uuid5</p> <p>Based on https://github.com/MrPowers/quinn/pull/96 with the difference that since Spark 3.0.0 an OVERLAY function from ANSI SQL 2016 is available which saves coding space and string allocation(s) in place of CONCAT + SUBSTRING.</p> <p>For more info on OVERLAY, see: https://docs.databricks.com/sql/language-manual/functions/overlay.html</p> Example <p>Input is a DataFrame with two columns:</p> id string 1 hello 2 world 3 <p>Input parameters:</p> <ul> <li>source_columns = [\"id\", \"string\"]</li> <li>target_column = \"uuid5\"</li> </ul> <p>Result:</p> id string uuid5 1 hello f3e99bbd-85ae-5dc3-bf6e-cd0022a0ebe6 2 world b48e880f-c289-5c94-b51f-b9d21f9616c0 3 2193a99d-222e-5a0c-a7d6-48fbe78d2708 <p>In code: <pre><code>HashUUID5(source_columns=[\"id\", \"string\"], target_column=\"uuid5\").transform(input_df)\n</code></pre></p> <p>In this example, the <code>id</code> and <code>string</code> columns are concatenated and hashed using the UUID5 algorithm. The result is stored in the <code>uuid5</code> column.</p>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.delimiter","title":"delimiter  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>delimiter: Optional[str] = Field(default='|', description='Separator for the string that will eventually be hashed')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.description","title":"description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>description: str = 'Generate a UUID with the UUID5 algorithm'\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.extra_string","title":"extra_string  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>extra_string: Optional[str] = Field(default='', description='In case of collisions, one can pass an extra string to hash on.')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.namespace","title":"namespace  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>namespace: Optional[Union[str, UUID]] = Field(default='', description='Namespace DNS')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.source_columns","title":"source_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>source_columns: ListOfColumns = Field(default=..., description=\"List of columns that should be hashed. Should contain the name of at least 1 column. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'`\")\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: str = Field(default=..., description='The generated UUID will be written to the column name specified here')\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> Source code in <code>src/koheesio/spark/transformations/uuid5.py</code> <pre><code>def execute(self) -&gt; None:\n    ns = f.lit(uuid5_namespace(self.namespace).bytes)\n    self.log.info(f\"UUID5 namespace '{ns}' derived from '{self.namespace}'\")\n    cols_to_hash = f.concat_ws(self.delimiter, *self.source_columns)\n    cols_to_hash = f.concat(f.lit(self.extra_string), cols_to_hash)\n    cols_to_hash = f.encode(cols_to_hash, \"utf-8\")\n    cols_to_hash = f.concat(ns, cols_to_hash)\n    source_columns_sha1 = f.sha1(cols_to_hash)\n    variant_part = f.substring(source_columns_sha1, 17, 4)\n    variant_part = f.conv(variant_part, 16, 2)\n    variant_part = f.lpad(variant_part, 16, \"0\")\n    variant_part = f.overlay(variant_part, f.lit(\"10\"), 1, 2)  # RFC 4122 variant.\n    variant_part = f.lower(f.conv(variant_part, 2, 16))\n    target_col_uuid = f.concat_ws(\n        \"-\",\n        f.substring(source_columns_sha1, 1, 8),\n        f.substring(source_columns_sha1, 9, 4),\n        f.concat(f.lit(\"5\"), f.substring(source_columns_sha1, 14, 3)),  # Set version.\n        variant_part,\n        f.substring(source_columns_sha1, 21, 12),\n    )\n    # Applying the transformation to the input df, storing the result in the column specified in `target_column`.\n    self.output.df = self.df.withColumn(self.target_column, target_col_uuid)\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.hash_uuid5","title":"koheesio.spark.transformations.uuid5.hash_uuid5","text":"<pre><code>hash_uuid5(input_value: str, namespace: Optional[Union[str, UUID]] = '', extra_string: Optional[str] = '')\n</code></pre> <p>pure python implementation of HashUUID5</p> <p>See: https://docs.python.org/3/library/uuid.html#uuid.uuid5</p> <p>Parameters:</p> Name Type Description Default <code>input_value</code> <code>str</code> <p>value that will be hashed</p> required <code>namespace</code> <code>Optional[str | UUID]</code> <p>namespace DNS</p> <code>''</code> <code>extra_string</code> <code>Optional[str]</code> <p>optional extra string that will be prepended to the input_value</p> <code>''</code> <p>Returns:</p> Type Description <code>str</code> <p>uuid.UUID (uuid5) cast to string</p> Source code in <code>src/koheesio/spark/transformations/uuid5.py</code> <pre><code>def hash_uuid5(\n    input_value: str,\n    namespace: Optional[Union[str, uuid.UUID]] = \"\",\n    extra_string: Optional[str] = \"\",\n):\n    \"\"\"pure python implementation of HashUUID5\n\n    See: https://docs.python.org/3/library/uuid.html#uuid.uuid5\n\n    Parameters\n    ----------\n    input_value : str\n        value that will be hashed\n    namespace : Optional[str | uuid.UUID]\n        namespace DNS\n    extra_string : Optional[str]\n        optional extra string that will be prepended to the input_value\n\n    Returns\n    -------\n    str\n        uuid.UUID (uuid5) cast to string\n    \"\"\"\n    if not isinstance(namespace, uuid.UUID):\n        hashed_namespace = uuid5_namespace(namespace)\n    else:\n        hashed_namespace = namespace\n    return str(uuid.uuid5(hashed_namespace, (extra_string + input_value)))\n</code></pre>"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.uuid5_namespace","title":"koheesio.spark.transformations.uuid5.uuid5_namespace","text":"<pre><code>uuid5_namespace(ns: Optional[Union[str, UUID]]) -&gt; UUID\n</code></pre> <p>Helper function used to provide a UUID5 hashed namespace based on the passed str</p> <p>Parameters:</p> Name Type Description Default <code>ns</code> <code>Optional[Union[str, UUID]]</code> <p>A str, an empty string (or None), or an existing UUID can be passed</p> required <p>Returns:</p> Type Description <code>UUID</code> <p>UUID5 hashed namespace</p> Source code in <code>src/koheesio/spark/transformations/uuid5.py</code> <pre><code>def uuid5_namespace(ns: Optional[Union[str, uuid.UUID]]) -&gt; uuid.UUID:\n    \"\"\"Helper function used to provide a UUID5 hashed namespace based on the passed str\n\n    Parameters\n    ----------\n    ns : Optional[Union[str, uuid.UUID]]\n        A str, an empty string (or None), or an existing UUID can be passed\n\n    Returns\n    -------\n    uuid.UUID\n        UUID5 hashed namespace\n    \"\"\"\n    # if we already have a UUID, we just return it\n    if isinstance(ns, uuid.UUID):\n        return ns\n\n    # if ns is empty or none, we simply return the default NAMESPACE_DNS\n    if not ns:\n        ns = uuid.NAMESPACE_DNS\n        return ns\n\n    # else we hash the string against the NAMESPACE_DNS\n    ns = uuid.uuid5(uuid.NAMESPACE_DNS, ns)\n    return ns\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html","title":"Date time","text":"<p>Module that holds the transformations that can be used for date and time related operations.</p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone","title":"koheesio.spark.transformations.date_time.ChangeTimeZone","text":"<p>Allows for the value of a column to be changed from one timezone to another</p> Adding useful metadata <p>When <code>add_target_timezone</code> is enabled (default), an additional column is created documenting which timezone a field has been converted to. Additionally, the suffix added to this column can be customized (default value is <code>_timezone</code>).</p> Example <p>Input:</p> <pre><code>target_column = \"some_column_name\"\ntarget_timezone = \"EST\"\nadd_target_timezone = True  # default value\ntimezone_column_suffix = \"_timezone\"  # default value\n</code></pre> <p>Output: <pre><code>column name  = \"some_column_name_timezone\"  # notice the suffix\ncolumn value = \"EST\"\n</code></pre></p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.add_target_timezone","title":"add_target_timezone  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>add_target_timezone: bool = Field(default=True, description='Toggles whether the target timezone is added as a column. True by default.')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.from_timezone","title":"from_timezone  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>from_timezone: str = Field(default=..., alias='source_timezone', description='Timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.target_timezone_column_suffix","title":"target_timezone_column_suffix  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_timezone_column_suffix: Optional[str] = Field(default='_timezone', alias='suffix', description=\"Allows to customize the suffix that is added to the target_timezone column. Defaults to '_timezone'. Note: this will be ignored if 'add_target_timezone' is set to False\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.to_timezone","title":"to_timezone  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>to_timezone: str = Field(default=..., alias='target_timezone', description='Target timezone. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def execute(self):\n    df = self.df\n\n    for target_column, column in self.get_columns_with_target():\n        func = self.func  # select the applicable function\n        df = df.withColumn(\n            target_column,\n            func(f.col(column)),\n        )\n\n        # document which timezone a field has been converted to\n        if self.add_target_timezone:\n            df = df.withColumn(f\"{target_column}{self.target_timezone_column_suffix}\", f.lit(self.to_timezone))\n\n    self.output.df = df\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return change_timezone(column=column, source_timezone=self.from_timezone, target_timezone=self.to_timezone)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_no_duplicate_timezones","title":"validate_no_duplicate_timezones","text":"<pre><code>validate_no_duplicate_timezones(values)\n</code></pre> <p>Validate that source and target timezone are not the same</p> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_no_duplicate_timezones(cls, values):\n    \"\"\"Validate that source and target timezone are not the same\"\"\"\n    from_timezone_value = values.get(\"from_timezone\")\n    to_timezone_value = values.get(\"o_timezone\")\n\n    if from_timezone_value == to_timezone_value:\n        raise ValueError(\"Timezone conversions from and to the same timezones are not valid.\")\n\n    return values\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_timezone","title":"validate_timezone","text":"<pre><code>validate_timezone(timezone_value)\n</code></pre> <p>Validate that the timezone is a valid timezone.</p> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>@field_validator(\"from_timezone\", \"to_timezone\")\ndef validate_timezone(cls, timezone_value):\n    \"\"\"Validate that the timezone is a valid timezone.\"\"\"\n    if timezone_value not in all_timezones_set:\n        raise ValueError(\n            \"Not a valid timezone. Refer to the `TZ database name` column here: \"\n            \"https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\"\n        )\n    return timezone_value\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat","title":"koheesio.spark.transformations.date_time.DateFormat","text":"<p>wrapper around pyspark.sql.functions.date_format</p> See Also <ul> <li>https://spark.apache.org/docs/3.3.2/api/python/reference/pyspark.sql/api/pyspark.sql.functions.date_format.html</li> <li>https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html</li> </ul> Concept <p>This Transformation allows to convert a date/timestamp/string to a value of string in the format specified by the date format given.</p> <p>A pattern could be for instance <code>dd.MM.yyyy</code> and could return a string like \u201818.03.1993\u2019. All pattern letters of datetime pattern can be used, see: https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html</p> How to use <p>If more than one column is passed, the behavior of the Class changes this way</p> <ul> <li>the transformation will be run in a loop against all the given columns</li> <li>the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.</li> </ul> Example <pre><code>source_column value: datetime.date(2020, 1, 1)\ntarget: \"yyyyMMdd HH:mm\"\noutput: \"20200101 00:00\"\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(..., description='The format for the resulting string. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    return date_format(column, self.format)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp","title":"koheesio.spark.transformations.date_time.ToTimestamp","text":"<p>wrapper around <code>pyspark.sql.functions.to_timestamp</code></p> <p>Converts a Column (or set of Columns) into <code>pyspark.sql.types.TimestampType</code> using the specified format. Specify formats according to <code>datetime pattern https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html</code>_.</p> <p>Functionally equivalent to col.cast(\"timestamp\").</p> See Also <p>Related Koheesio classes:</p> <ul> <li>koheesio.spark.transformations.ColumnsTransformation :     Base class for ColumnsTransformation. Defines column / columns field + recursive logic</li> <li>koheesio.spark.transformations.ColumnsTransformationWithTarget :     Defines target_column / target_suffix field</li> </ul> <p>pyspark.sql.functions:</p> <ul> <li>datetime pattern : https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html</li> </ul> Example"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--basic-usage-example","title":"Basic usage example:","text":"<p>input_df:</p> t \"1997-02-28 10:30:00\" <p><code>t</code> is a string</p> <pre><code>tts = ToTimestamp(\n    # since the source column is the same as the target in this example, 't' will be overwritten\n    column=\"t\",\n    target_column=\"t\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df)\n</code></pre> <p>output_df:</p> t datetime.datetime(1997, 2, 28, 10, 30) <p>Now <code>t</code> is a timestamp</p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--multiple-columns-at-once","title":"Multiple columns at once:","text":"<p>input_df:</p> t1 t2 \"1997-02-28 10:30:00\" \"2007-03-31 11:40:10\" <p><code>t1</code> and <code>t2</code> are strings</p> <pre><code>tts = ToTimestamp(\n    columns=[\"t1\", \"t2\"],\n    # 'target_suffix' is synonymous with 'target_column'\n    target_suffix=\"new\",\n    format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df).select(\"t1_new\", \"t2_new\")\n</code></pre> <p>output_df:</p> t1_new t2_new datetime.datetime(1997, 2, 28, 10, 30) datetime.datetime(2007, 3, 31, 11, 40) <p>Now <code>t1_new</code> and <code>t2_new</code> are both timestamps</p>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default=..., description='The date format for of the timestamp field. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.func","title":"func","text":"<pre><code>func(column: Column) -&gt; Column\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def func(self, column: Column) -&gt; Column:\n    # convert string to timestamp\n    converted_col = to_timestamp(column, self.format)\n    return when(column.isNull(), lit(None)).otherwise(converted_col)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.change_timezone","title":"koheesio.spark.transformations.date_time.change_timezone","text":"<pre><code>change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str)\n</code></pre> <p>Helper function to change from one timezone to another</p> <p>wrapper around <code>pyspark.sql.functions.from_utc_timestamp</code> and <code>to_utc_timestamp</code></p> <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Union[str, Column]</code> <p>The column to change the timezone of</p> required <code>source_timezone</code> <code>str</code> <p>The timezone of the source_column value. Timezone fields are validated against the <code>TZ database name</code> column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones</p> required <code>target_timezone</code> <code>str</code> <p>The target timezone. Timezone fields are validated against the <code>TZ database name</code> column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones</p> required Source code in <code>src/koheesio/spark/transformations/date_time/__init__.py</code> <pre><code>def change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str):\n    \"\"\"Helper function to change from one timezone to another\n\n    wrapper around `pyspark.sql.functions.from_utc_timestamp` and `to_utc_timestamp`\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column to change the timezone of\n    source_timezone : str\n        The timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in\n        this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n    target_timezone : str\n        The target timezone. Timezone fields are validated against the `TZ database name` column in this list:\n        https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n\n    \"\"\"\n    column = col(column) if isinstance(column, str) else column\n    return from_utc_timestamp((to_utc_timestamp(column, source_timezone)), target_timezone)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html","title":"Interval","text":"<p>This module provides a <code>DateTimeColumn</code> class that extends the <code>Column</code> class from PySpark. It allows for adding or subtracting an interval value from a datetime column.</p> <p>This can be used to reflect a change in a given date / time column in a more human-readable way.</p> <p>Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</p> Background <p>The aim is to easily add or subtract an 'interval' value to a datetime column. An interval value is a string that represents a time interval. For example, '1 day', '1 month', '5 years', '1 minute 30 seconds', '10 milliseconds', etc. These can be used to reflect a change in a given date / time column in a more human-readable way.</p> <p>Typically, this can be done using the <code>date_add()</code> and <code>date_sub()</code> functions in Spark SQL. However, these functions only support adding or subtracting a single unit of time measured in days. Using an interval gives us much more flexibility; however, Spark SQL does not provide a function to add or subtract an interval value from a datetime column through the python API directly, so we have to use the <code>expr()</code> function to do this to be able to directly use SQL.</p> <p>This module provides a <code>DateTimeColumn</code> class that extends the <code>Column</code> class from PySpark. It allows for adding or subtracting an interval value from a datetime column using the <code>+</code> and <code>-</code> operators.</p> <p>Additionally, this module provides two transformation classes that can be used as a transformation step in a pipeline:</p> <ul> <li><code>DateTimeAddInterval</code>: adds an interval value to a datetime column</li> <li><code>DateTimeSubtractInterval</code>: subtracts an interval value from a datetime column</li> </ul> <p>These classes are subclasses of <code>ColumnsTransformationWithTarget</code> and hence can be used to perform transformations on multiple columns at once.</p> <p>The above transformations both use the provided <code>asjust_time()</code> function to perform the actual transformation.</p> See also: <p>Related Koheesio classes:</p> <ul> <li>koheesio.spark.transformations.ColumnsTransformation :     Base class for ColumnsTransformation. Defines column / columns field + recursive logic</li> <li>koheesio.spark.transformations.ColumnsTransformationWithTarget :     Defines target_column / target_suffix field</li> </ul> <p>pyspark.sql.functions:</p> <ul> <li>https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</li> <li>https://spark.apache.org/docs/latest/api/sql/index.html</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_add</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_subtract</li> </ul> <p>Classes:</p> Name Description <code>DateTimeColumn</code> <p>A datetime column that can be adjusted by adding or subtracting an interval value using the <code>+</code> and <code>-</code> operators.</p> <code>DateTimeAddInterval</code> <p>A transformation that adds an interval value to a datetime column. This class is a subclass of <code>ColumnsTransformationWithTarget</code> and hence can be used as a transformation step in a pipeline. See <code>ColumnsTransformationWithTarget</code> for more information.</p> <code>DateTimeSubtractInterval</code> <p>A transformation that subtracts an interval value from a datetime column. This class is a subclass of <code>ColumnsTransformationWithTarget</code> and hence can be used as a transformation step in a pipeline. See <code>ColumnsTransformationWithTarget</code> for more information.</p> Note <p>the <code>DateTimeAddInterval</code> and <code>DateTimeSubtractInterval</code> classes are very similar. The only difference is that one adds an interval value to a datetime column, while the other subtracts an interval value from a datetime column.</p> <p>Functions:</p> Name Description <code>dt_column</code> <p>Converts a column to a <code>DateTimeColumn</code>. This function aims to be a drop-in replacement for <code>pyspark.sql.functions.col</code> that returns a <code>DateTimeColumn</code> instead of a <code>Column</code>.</p> <code>adjust_time</code> <p>Adjusts a datetime column by adding or subtracting an interval value.</p> <code>validate_interval</code> <p>Validates a given interval string.</p> Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--various-ways-to-create-and-interact-with-datetimecolumn","title":"Various ways to create and interact with <code>DateTimeColumn</code>:","text":"<ul> <li>Create a <code>DateTimeColumn</code> from a string: <code>dt_column(\"my_column\")</code></li> <li>Create a <code>DateTimeColumn</code> from a <code>Column</code>: <code>dt_column(df.my_column)</code></li> <li>Use the <code>+</code> and <code>-</code> operators to add or subtract an interval value from a <code>DateTimeColumn</code>:<ul> <li><code>dt_column(\"my_column\") + \"1 day\"</code></li> <li><code>dt_column(\"my_column\") - \"1 month\"</code></li> </ul> </li> </ul>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--functional-examples-using-adjust_time","title":"Functional examples using <code>adjust_time()</code>:","text":"<ul> <li>Add 1 day to a column: <code>adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")</code></li> <li>Subtract 1 month from a column: <code>adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")</code></li> </ul>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--as-a-transformation-step","title":"As a transformation step:","text":"<p><pre><code>from koheesio.spark.transformations.date_time.interval import (\n    DateTimeAddInterval,\n)\n\ninput_df = spark.createDataFrame([(1, \"2022-01-01 00:00:00\")], [\"id\", \"my_column\"])\n\n# add 1 day to my_column and store the result in a new column called 'one_day_later'\noutput_df = DateTimeAddInterval(column=\"my_column\", target_column=\"one_day_later\", interval=\"1 day\").transform(input_df)\n</code></pre> output_df:</p> id my_column one_day_later 1 2022-01-01 00:00:00 2022-01-02 00:00:00 <p><code>DateTimeSubtractInterval</code> works in a similar way, but subtracts an interval value from a datetime column.</p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.Operations","title":"koheesio.spark.transformations.date_time.interval.Operations  <code>module-attribute</code>","text":"<pre><code>Operations = Literal['add', 'subtract']\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","text":"<p>A transformation that adds or subtracts a specified interval from a datetime column.</p> See also: <p>pyspark.sql.functions:</p> <ul> <li>https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</li> <li>https://spark.apache.org/docs/latest/api/sql/index.html#interval</li> </ul> <p>Parameters:</p> Name Type Description Default <code>interval</code> <code>str</code> <p>The interval to add to the datetime column.</p> required <code>operation</code> <code>Operations</code> <p>The operation to perform. Must be either 'add' or 'subtract'.</p> <code>add</code> Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--add-1-day-to-a-column","title":"add 1 day to a column","text":"<pre><code>DateTimeAddInterval(\n    column=\"my_column\",\n    interval=\"1 day\",\n).transform(df)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--subtract-1-month-from-my_column-and-store-the-result-in-a-new-column-called-one_month_earlier","title":"subtract 1 month from <code>my_column</code> and store the result in a new column called <code>one_month_earlier</code>","text":"<pre><code>DateTimeSubtractInterval(\n    column=\"my_column\",\n    target_column=\"one_month_earlier\",\n    interval=\"1 month\",\n)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.interval","title":"interval  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>interval: str = Field(default=..., description='The interval to add to the datetime column.', examples=['1 day', '5 years', '3 months'])\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.operation","title":"operation  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>operation: Operations = Field(default='add', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.validate_interval","title":"validate_interval  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>validate_interval = field_validator('interval')(validate_interval)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def func(self, column: Column):\n    return adjust_time(column, operation=self.operation, interval=self.interval)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn","title":"koheesio.spark.transformations.date_time.interval.DateTimeColumn","text":"<p>A datetime column that can be adjusted by adding or subtracting an interval value  using the <code>+</code> and <code>-</code> operators.</p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn.from_column","title":"from_column  <code>classmethod</code>","text":"<pre><code>from_column(column: Column)\n</code></pre> <p>Create a DateTimeColumn from an existing Column</p> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>@classmethod\ndef from_column(cls, column: Column):\n    \"\"\"Create a DateTimeColumn from an existing Column\"\"\"\n    return cls(column._jc)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","text":"<p>Subtracts a specified interval from a datetime column.</p> <p>Works in the same way as <code>DateTimeAddInterval</code>, but subtracts the specified interval from the datetime column. See <code>DateTimeAddInterval</code> for more information.</p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval.operation","title":"operation  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>operation: Operations = Field(default='subtract', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time","title":"koheesio.spark.transformations.date_time.interval.adjust_time","text":"<pre><code>adjust_time(column: Column, operation: Operations, interval: str) -&gt; Column\n</code></pre> <p>Adjusts a datetime column by adding or subtracting an interval value.</p> <p>This can be used to reflect a change in a given date / time column in a more human-readable way.</p> See also <p>Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal</p> Example <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Column</code> <p>The datetime column to adjust.</p> required <code>operation</code> <code>Operations</code> <p>The operation to perform. Must be either 'add' or 'subtract'.</p> required <code>interval</code> <code>str</code> <p>The value to add or subtract. Must be a valid interval string.</p> required <p>Returns:</p> Type Description <code>Column</code> <p>The adjusted datetime column.</p> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def adjust_time(column: Column, operation: Operations, interval: str) -&gt; Column:\n    \"\"\"\n    Adjusts a datetime column by adding or subtracting an interval value.\n\n    This can be used to reflect a change in a given date / time column in a more human-readable way.\n\n\n    See also\n    --------\n    Please refer to the Spark SQL documentation for a list of valid interval values:\n    https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal\n\n    ### pyspark.sql.functions:\n\n    * https://spark.apache.org/docs/latest/api/sql/index.html#interval\n    * https://spark.apache.org/docs/latest/api/sql/#try_add\n    * https://spark.apache.org/docs/latest/api/sql/#try_subtract\n\n    Example\n    --------\n    ### add 1 day to a column\n    ```python\n    adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n    ```\n\n    ### subtract 1 month from a column\n    ```python\n    adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n    ```\n\n    ### or, a much more complicated example\n\n    In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called `my_column`.\n    ```python\n    adjust_time(\n        \"my_column\",\n        operation=\"add\",\n        interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n    )\n    ```\n\n    Parameters\n    ----------\n    column : Column\n        The datetime column to adjust.\n    operation : Operations\n        The operation to perform. Must be either 'add' or 'subtract'.\n    interval : str\n        The value to add or subtract. Must be a valid interval string.\n\n    Returns\n    -------\n    Column\n        The adjusted datetime column.\n    \"\"\"\n\n    # check that value is a valid interval\n    interval = validate_interval(interval)\n\n    column_name = column._jc.toString()\n\n    # determine the operation to perform\n    try:\n        operation = {\n            \"add\": \"try_add\",\n            \"subtract\": \"try_subtract\",\n        }[operation]\n    except KeyError as e:\n        raise ValueError(f\"Operation '{operation}' is not valid. Must be either 'add' or 'subtract'.\") from e\n\n    # perform the operation\n    _expression = f\"{operation}({column_name}, interval '{interval}')\"\n    column = expr(_expression)\n\n    return column\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--pysparksqlfunctions","title":"pyspark.sql.functions:","text":"<ul> <li>https://spark.apache.org/docs/latest/api/sql/index.html#interval</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_add</li> <li>https://spark.apache.org/docs/latest/api/sql/#try_subtract</li> </ul>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--add-1-day-to-a-column","title":"add 1 day to a column","text":"<pre><code>adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--subtract-1-month-from-a-column","title":"subtract 1 month from a column","text":"<pre><code>adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--or-a-much-more-complicated-example","title":"or, a much more complicated example","text":"<p>In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called <code>my_column</code>. <pre><code>adjust_time(\n    \"my_column\",\n    operation=\"add\",\n    interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n)\n</code></pre></p>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column","title":"koheesio.spark.transformations.date_time.interval.dt_column","text":"<pre><code>dt_column(column: Union[str, Column]) -&gt; DateTimeColumn\n</code></pre> <p>Convert a column to a DateTimeColumn</p> <p>Aims to be a drop-in replacement for <code>pyspark.sql.functions.col</code> that returns a DateTimeColumn instead of a Column.</p> Example <p>Parameters:</p> Name Type Description Default <code>column</code> <code>Union[str, Column]</code> <p>The column (or name of the column) to convert to a DateTimeColumn</p> required Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def dt_column(column: Union[str, Column]) -&gt; DateTimeColumn:\n    \"\"\"Convert a column to a DateTimeColumn\n\n    Aims to be a drop-in replacement for `pyspark.sql.functions.col` that returns a DateTimeColumn instead of a Column.\n\n    Example\n    --------\n    ### create a DateTimeColumn from a string\n    ```python\n    dt_column(\"my_column\")\n    ```\n\n    ### create a DateTimeColumn from a Column\n    ```python\n    dt_column(df.my_column)\n    ```\n\n    Parameters\n    ----------\n    column : Union[str, Column]\n        The column (or name of the column) to convert to a DateTimeColumn\n    \"\"\"\n    if isinstance(column, str):\n        column = col(column)\n    elif not isinstance(column, Column):\n        raise TypeError(f\"Expected column to be of type str or Column, got {type(column)} instead.\")\n    return DateTimeColumn.from_column(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-string","title":"create a DateTimeColumn from a string","text":"<pre><code>dt_column(\"my_column\")\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-column","title":"create a DateTimeColumn from a Column","text":"<pre><code>dt_column(df.my_column)\n</code></pre>"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.validate_interval","title":"koheesio.spark.transformations.date_time.interval.validate_interval","text":"<pre><code>validate_interval(interval: str)\n</code></pre> <p>Validate an interval string</p> <p>Parameters:</p> Name Type Description Default <code>interval</code> <code>str</code> <p>The interval string to validate</p> required <p>Raises:</p> Type Description <code>ValueError</code> <p>If the interval string is invalid</p> Source code in <code>src/koheesio/spark/transformations/date_time/interval.py</code> <pre><code>def validate_interval(interval: str):\n    \"\"\"Validate an interval string\n\n    Parameters\n    ----------\n    interval : str\n        The interval string to validate\n\n    Raises\n    ------\n    ValueError\n        If the interval string is invalid\n    \"\"\"\n    try:\n        expr(f\"interval '{interval}'\")\n    except ParseException as e:\n        raise ValueError(f\"Value '{interval}' is not a valid interval.\") from e\n    return interval\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/index.html","title":"Strings","text":"<p>Adds a number of Transformations that are intended to be used with StringType column input. Some will work with other types however, but will output StringType or an array of StringType.</p> <p>These Transformations take full advantage of Koheesio's ColumnsTransformationWithTarget class, allowing a user to apply column transformations to multiple columns at once. See the class docstrings for more information.</p> <p>The following Transformations are included:</p> <p>change_case:</p> <ul> <li><code>Lower</code>     Converts a string column to lower case.</li> <li><code>Upper</code>     Converts a string column to upper case.</li> <li><code>TitleCase</code> or <code>InitCap</code>     Converts a string column to title case, where each word starts with a capital letter.</li> </ul> <p>concat:</p> <ul> <li><code>Concat</code>     Concatenates multiple input columns together into a single column, optionally using the given separator.</li> </ul> <p>pad:</p> <ul> <li><code>Pad</code>     Pads the values of <code>source_column</code> with the <code>character</code> up until it reaches <code>length</code> of characters</li> <li><code>LPad</code>     Pad with a character on the left side of the string.</li> <li><code>RPad</code>     Pad with a character on the right side of the string.</li> </ul> <p>regexp:</p> <ul> <li><code>RegexpExtract</code>     Extract a specific group matched by a Java regexp from the specified string column.</li> <li><code>RegexpReplace</code>     Searches for the given regexp and replaces all instances with what is in 'replacement'.</li> </ul> <p>replace:</p> <ul> <li><code>Replace</code>     Replace all instances of a string in a column with another string.</li> </ul> <p>split:</p> <ul> <li><code>SplitAll</code>     Splits the contents of a column on basis of a split_pattern.</li> <li><code>SplitAtFirstMatch</code>     Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.</li> </ul> <p>substring:</p> <ul> <li><code>Substring</code>     Extracts a substring from a string column starting at the given position.</li> </ul> <p>trim:</p> <ul> <li><code>Trim</code>     Trim whitespace from the beginning and/or end of a string.</li> <li><code>LTrim</code>     Trim whitespace from the beginning of a string.</li> <li><code>RTrim</code>     Trim whitespace from the end of a string.</li> </ul>"},{"location":"api_reference/spark/transformations/strings/change_case.html","title":"Change case","text":"<p>Convert the case of a string column to upper case, lower case, or title case</p> <p>Classes:</p> Name Description <code>`Lower`</code> <p>Converts a string column to lower case.</p> <code>`Upper`</code> <p>Converts a string column to upper case.</p> <code>`TitleCase` or `InitCap`</code> <p>Converts a string column to title case, where each word starts with a capital letter.</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.InitCap","title":"koheesio.spark.transformations.strings.change_case.InitCap  <code>module-attribute</code>","text":"<pre><code>InitCap = TitleCase\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase","title":"koheesio.spark.transformations.strings.change_case.LowerCase","text":"<p>This function makes the contents of a column lower case.</p> <p>Wraps the <code>pyspark.sql.functions.lower</code> function.</p> Warnings <p>If the type of the column is not string, <code>LowerCase</code> will not be run. A Warning will be thrown indicating this.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The name of the column or columns to convert to lower case. Alias: column. Lower case will be applied to all columns in the list. Column is required to be of string type.</p> required <code>target_column</code> <p>The name of the column to store the result in. If None, the result will be stored in the same column as the input.</p> required Example <p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = LowerCase(column=\"product\", target_column=\"product_lower\").transform(df)\n</code></pre> <p>output_df:</p> product amount country product_lower Banana lemon orange 1000 USA banana lemon orange Carrots Blueberries 1500 USA carrots blueberries Beans 1600 USA beans <p>In this example, the column <code>product</code> is converted to <code>product_lower</code> and the contents of this column are converted to lower case.</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig","title":"ColumnConfig","text":"<p>Limit data type to string</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/change_case.py</code> <pre><code>def func(self, column: Column):\n    return lower(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase","title":"koheesio.spark.transformations.strings.change_case.TitleCase","text":"<p>This function makes the contents of a column title case. This means that every word starts with an upper case.</p> <p>Wraps the <code>pyspark.sql.functions.initcap</code> function.</p> Warnings <p>If the type of the column is not string, TitleCase will not be run. A Warning will be thrown indicating this.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The name of the column or columns to convert to title case. Alias: column. Title case will be applied to all columns in the list. Column is required to be of string type.</p> required <code>target_column</code> <p>The name of the column to store the result in. If None, the result will be stored in the same column as the input.</p> required Example <p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots blueberries 1500 USA Beans 1600 USA <pre><code>output_df = TitleCase(column=\"product\", target_column=\"product_title\").transform(df)\n</code></pre> <p>output_df:</p> product amount country product_title Banana lemon orange 1000 USA Banana Lemon Orange Carrots blueberries 1500 USA Carrots Blueberries Beans 1600 USA Beans <p>In this example, the column <code>product</code> is converted to <code>product_title</code> and the contents of this column are converted to title case (each word now starts with an upper case).</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/change_case.py</code> <pre><code>def func(self, column: Column):\n    return initcap(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase","title":"koheesio.spark.transformations.strings.change_case.UpperCase","text":"<p>This function makes the contents of a column upper case.</p> <p>Wraps the <code>pyspark.sql.functions.upper</code> function.</p> Warnings <p>If the type of the column is not string, <code>UpperCase</code> will not be run. A Warning will be thrown indicating this.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The name of the column or columns to convert to upper case. Alias: column. Upper case will be applied to all columns in the list. Column is required to be of string type.</p> required <code>target_column</code> <p>The name of the column to store the result in. If None, the result will be stored in the same column as the input.</p> required <p>Examples:</p> <p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = UpperCase(column=\"product\", target_column=\"product_upper\").transform(df)\n</code></pre> <p>output_df:</p> product amount country product_upper Banana lemon orange 1000 USA BANANA LEMON ORANGE Carrots Blueberries 1500 USA CARROTS BLUEBERRIES Beans 1600 USA BEANS <p>In this example, the column <code>product</code> is converted to <code>product_upper</code> and the contents of this column are converted to upper case.</p>"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/change_case.py</code> <pre><code>def func(self, column: Column):\n    return upper(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html","title":"Concat","text":"<p>Concatenates multiple input columns together into a single column, optionally using a given separator.</p>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat","title":"koheesio.spark.transformations.strings.concat.Concat","text":"<p>This is a wrapper around PySpark concat() and concat_ws() functions</p> <p>Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.</p> Concept <p>When working with arrays, the function will return the result of the concatenation of the elements in the array.</p> <ul> <li>If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.</li> <li>If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.</li> </ul> <p>When working with date/timestamps, the function will return the result of the concatenation of the elements in the array. The timestamp is converted to a string using the default format of <code>yyyy-MM-dd HH:mm:ss</code>.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first.</p> required <code>target_column</code> <code>Optional[str]</code> <p>Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.</p> <code>None</code> <code>spacer</code> <code>Optional[str]</code> <p>Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used</p> <code>None</code> Example"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-a-string-column-and-a-timestamp-column","title":"Example using a string column and a timestamp column","text":"<p>input_df:</p> column_a column_b text 1997-02-28 10:30:00 <pre><code>output_df = Concat(\n    columns=[\"column_a\", \"column_b\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column_a column_b concatenated_column text 1997-02-28 10:30:00 text--1997-02-28 10:30:00 <p>In the example above, the resulting column is a string column.</p> <p>If we had left out the spacer, the resulting column would have had the value of <code>text1997-02-28 10:30:00</code> (a string). Note that the timestamp is converted to a string using the default format of <code>yyyy-MM-dd HH:mm:ss</code>.</p>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-two-array-columns","title":"Example using two array columns","text":"<p>input_df:</p> array_col_1 array_col_2 [text1, text2] [text3, text4] <pre><code>output_df = Concat(\n    columns=[\"array_col_1\", \"array_col_2\"],\n    target_column=\"concatenated_column\",\n    spacer=\"--\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> array_col_1 array_col_2 concatenated_column [text1, text2] [text3, text4] \"text1--text2--text3\" <p>Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would have been an array with the values of <code>[\"text1\", \"text2\", \"text3\"]</code>.</p> <p>Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.</p>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.spacer","title":"spacer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>spacer: Optional[str] = Field(default=None, description='Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used', alias='sep')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.target_column","title":"target_column  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_column: Optional[str] = Field(default=None, description=\"Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.\")\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.execute","title":"execute","text":"<pre><code>execute() -&gt; DataFrame\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/concat.py</code> <pre><code>def execute(self) -&gt; DataFrame:\n    columns = [col(s) for s in self.get_columns()]\n    self.output.df = self.df.withColumn(\n        self.target_column, concat_ws(self.spacer, *columns) if self.spacer else concat(*columns)\n    )\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.get_target_column","title":"get_target_column","text":"<pre><code>get_target_column(target_column_value, values)\n</code></pre> <p>Get the target column name if it is not provided.</p> <p>If not provided, a name will be generated by concatenating the names of the source columns with an '_'.</p> Source code in <code>src/koheesio/spark/transformations/strings/concat.py</code> <pre><code>@field_validator(\"target_column\")\ndef get_target_column(cls, target_column_value, values):\n    \"\"\"Get the target column name if it is not provided.\n\n    If not provided, a name will be generated by concatenating the names of the source columns with an '_'.\"\"\"\n    if not target_column_value:\n        columns_value: List = values[\"columns\"]\n        columns = list(dict.fromkeys(columns_value))  # dict.fromkeys is used to dedup while maintaining order\n        return \"_\".join(columns)\n\n    return target_column_value\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html","title":"Pad","text":"<p>Pad the values of a column with a character up until it reaches a certain length.</p> <p>Classes:</p> Name Description <code>Pad</code> <p>Pads the values of <code>source_column</code> with the <code>character</code> up until it reaches <code>length</code> of characters</p> <code>LPad</code> <p>Pad with a character on the left side of the string.</p> <code>RPad</code> <p>Pad with a character on the right side of the string.</p>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.LPad","title":"koheesio.spark.transformations.strings.pad.LPad  <code>module-attribute</code>","text":"<pre><code>LPad = Pad\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.pad_directions","title":"koheesio.spark.transformations.strings.pad.pad_directions  <code>module-attribute</code>","text":"<pre><code>pad_directions = Literal['left', 'right']\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad","title":"koheesio.spark.transformations.strings.pad.Pad","text":"<p>Pads the values of <code>source_column</code> with the <code>character</code> up until it reaches <code>length</code> of characters The <code>direction</code> param can be changed to apply either a left or a right pad. Defaults to left pad.</p> <p>Wraps the <code>lpad</code> and <code>rpad</code> functions from PySpark.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to pad. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>character</code> <code>constr(min_length=1)</code> <p>The character to use for padding</p> required <code>length</code> <code>PositiveInt</code> <p>Positive integer to indicate the intended length</p> required <code>direction</code> <code>Optional[pad_directions]</code> <p>On which side to add the characters. Either \"left\" or \"right\". Defaults to \"left\"</p> <code>left</code> Example <p>input_df:</p> column hello world <pre><code>output_df = Pad(\n    column=\"column\",\n    target_column=\"padded_column\",\n    character=\"*\",\n    length=10,\n    direction=\"right\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column padded_column hello hello***** world world***** <p>Note: in the example above, we could have used the RPad class instead of Pad with direction=\"right\" to achieve the same result.</p>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.character","title":"character  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>character: constr(min_length=1) = Field(default=..., description='The character to use for padding')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: Optional[pad_directions] = Field(default='left', description='On which side to add the characters . Either \"left\" or \"right\". Defaults to \"left\"')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.length","title":"length  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>length: PositiveInt = Field(default=..., description='Positive integer to indicate the intended length')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/pad.py</code> <pre><code>def func(self, column: Column):\n    func = lpad if self.direction == \"left\" else rpad\n    return func(column, self.length, self.character)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad","title":"koheesio.spark.transformations.strings.pad.RPad","text":"<p>Pad with a character on the right side of the string.</p> <p>See Pad class docstring for more information.</p>"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: Optional[pad_directions] = 'right'\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html","title":"Regexp","text":"<p>String transformations using regular expressions.</p> <p>This module contains transformations that use regular expressions to transform strings.</p> <p>Classes:</p> Name Description <code>RegexpExtract</code> <p>Extract a specific group matched by a Java regexp from the specified string column.</p> <code>RegexpReplace</code> <p>Searches for the given regexp and replaces all instances with what is in 'replacement'.</p>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract","title":"koheesio.spark.transformations.strings.regexp.RegexpExtract","text":"<p>Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.</p> <p>A wrapper around the pyspark regexp_extract function</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to extract from. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>regexp</code> <code>str</code> <p>The Java regular expression to extract</p> required <code>index</code> <code>Optional[int]</code> <p>When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.</p> <code>0</code> Example"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--extracting-the-year-and-week-number-from-a-string","title":"Extracting the year and week number from a string","text":"<p>Let's say we have a column containing the year and week in a format like <code>Y## W#</code> and we would like to extract the week numbers.</p> <p>input_df:</p> YWK 2020 W1 2021 WK2 <pre><code>output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"week_number\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=2,  # remember that this is 1-indexed! So 2 will get the week number in this example.\n).transform(input_df)\n</code></pre> <p>output_df:</p> YWK week_number 2020 W1 1 2021 WK2 2"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--using-the-same-example-but-extracting-the-year-instead","title":"Using the same example, but extracting the year instead","text":"<p>If you want to extract the year, you can use index=1.</p> <pre><code>output_df = RegexpExtract(\n    column=\"YWK\",\n    target_column=\"year\",\n    regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n    index=1,  # remember that this is 1-indexed! So 1 will get the year in this example.\n).transform(input_df)\n</code></pre> <p>output_df:</p> YWK year 2020 W1 2020 2021 WK2 2021"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.index","title":"index  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>index: Optional[int] = Field(default=0, description='When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.regexp","title":"regexp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>regexp: str = Field(default=..., description='The Java regular expression to extract')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/regexp.py</code> <pre><code>def func(self, column: Column):\n    return regexp_extract(column, self.regexp, self.index)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace","title":"koheesio.spark.transformations.strings.regexp.RegexpReplace","text":"<p>Searches for the given regexp and replaces all instances with what is in 'replacement'.</p> <p>A wrapper around the pyspark regexp_replace function</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The column (or list of columns) to replace in. Alias: column</p> required <code>target_column</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> required <code>regexp</code> <p>The regular expression to replace</p> required <code>replacement</code> <p>String to replace matched pattern with.</p> required <p>Examples:</p> <p>input_df: | content    | |------------| | hello world|</p> <p>Let's say you want to replace 'hello'.</p> <pre><code>output_df = RegexpReplace(\n    column=\"content\",\n    target_column=\"replaced\",\n    regexp=\"hello\",\n    replacement=\"gutentag\",\n).transform(input_df)\n</code></pre> <p>output_df: | content    | replaced      | |------------|---------------| | hello world| gutentag world|</p>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.regexp","title":"regexp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>regexp: str = Field(default=..., description='The regular expression to replace')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.replacement","title":"replacement  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>replacement: str = Field(default=..., description='String to replace matched pattern with.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/regexp.py</code> <pre><code>def func(self, column: Column):\n    return regexp_replace(column, self.regexp, self.replacement)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html","title":"Replace","text":"<p>String replacements without using regular expressions.</p>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace","title":"koheesio.spark.transformations.strings.replace.Replace","text":"<p>Replace all instances of a string in a column with another string.</p> <p>This transformation uses PySpark when().otherwise() functions.</p> Notes <ul> <li>If original_value is not set, the transformation will replace all null values with new_value</li> <li>If original_value is set, the transformation will replace all values matching original_value with new_value</li> <li>Numeric values are supported, but will be cast to string in the process</li> <li>Replace is meant for simple string replacements. If more advanced replacements are needed, use the <code>RegexpReplace</code>     transformation instead.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to replace values in. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>original_value</code> <code>Optional[str]</code> <p>The original value that needs to be replaced. Alias: from</p> <code>None</code> <code>new_value</code> <code>str</code> <p>The new value to replace this with. Alias: to</p> required <p>Examples:</p> <p>input_df:</p> column hello world None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-null-values-with-a-new-value","title":"Replace all null values with a new value","text":"<pre><code>output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=None,  # This is the default value, so it can be omitted\n    new_value=\"programmer\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column replaced_column hello hello world world None programmer"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-instances-of-a-string-in-a-column-with-another-string","title":"Replace all instances of a string in a column with another string","text":"<pre><code>output_df = Replace(\n    column=\"column\",\n    target_column=\"replaced_column\",\n    original_value=\"world\",\n    new_value=\"programmer\",\n).transform(input_df)\n</code></pre> <p>output_df:</p> column replaced_column hello hello world programmer None None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.new_value","title":"new_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>new_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.original_value","title":"original_value  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>original_value: Optional[str] = Field(default=None, alias='from', description='The original value that needs to be replaced')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.cast_values_to_str","title":"cast_values_to_str","text":"<pre><code>cast_values_to_str(value)\n</code></pre> <p>Cast values to string if they are not None</p> Source code in <code>src/koheesio/spark/transformations/strings/replace.py</code> <pre><code>@field_validator(\"original_value\", \"new_value\", mode=\"before\")\ndef cast_values_to_str(cls, value):\n    \"\"\"Cast values to string if they are not None\"\"\"\n    if value:\n        return str(value)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/replace.py</code> <pre><code>def func(self, column: Column):\n    when_statement = (\n        when(column.isNull(), lit(self.new_value))\n        if not self.original_value\n        else when(\n            column == self.original_value,\n            lit(self.new_value),\n        )\n    )\n    return when_statement.otherwise(column)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html","title":"Split","text":"<p>Splits the contents of a column on basis of a split_pattern</p> <p>Classes:</p> Name Description <code>SplitAll</code> <p>Splits the contents of a column on basis of a split_pattern.</p> <code>SplitAtFirstMatch</code> <p>Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.</p>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll","title":"koheesio.spark.transformations.strings.split.SplitAll","text":"<p>This function splits the contents of a column on basis of a split_pattern.</p> <p>It splits at al the locations the pattern is found. The new column will be of ArrayType.</p> <p>Wraps the pyspark.sql.functions.split function.</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to split. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>split_pattern</code> <code>str</code> <p>This is the pattern that will be used to split the column contents.</p> required Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"<p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = SplitColumn(column=\"product\", target_column=\"split\", split_pattern=\" \").transform(input_df)\n</code></pre> <p>output_df:</p> product amount country split Banana lemon orange 1000 USA [\"Banana\", \"lemon\" \"orange\"] Carrots Blueberries 1500 USA [\"Carrots\", \"Blueberries\"] Beans 1600 USA [\"Beans\"]"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.split_pattern","title":"split_pattern  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>split_pattern: str = Field(default=..., description='The pattern to split the column contents.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/split.py</code> <pre><code>def func(self, column: Column):\n    return split(column, pattern=self.split_pattern)\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch","title":"koheesio.spark.transformations.strings.split.SplitAtFirstMatch","text":"<p>Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..</p> Note <ul> <li>SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you     can toggle the parameter retrieve_first_part.</li> <li>The new column will be of StringType.</li> <li>If you want to split a column more than once, you should call this function multiple times.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to split. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>split_pattern</code> <code>str</code> <p>This is the pattern that will be used to split the column contents.</p> required <code>retrieve_first_part</code> <code>Optional[bool]</code> <p>Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.</p> <code>True</code> Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"<p>input_df:</p> product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA <pre><code>output_df = SplitColumn(column=\"product\", target_column=\"split_first\", split_pattern=\"an\").transform(input_df)\n</code></pre> <p>output_df:</p> product amount country split_first Banana lemon orange 1000 USA B Carrots Blueberries 1500 USA Carrots Blueberries Beans 1600 USA Be"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.retrieve_first_part","title":"retrieve_first_part  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>retrieve_first_part: Optional[bool] = Field(default=True, description='Takes the first part of the split when true, the second part when False. Other parts are ignored.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/split.py</code> <pre><code>def func(self, column: Column):\n    split_func = split(column, pattern=self.split_pattern)\n\n    # first part\n    if self.retrieve_first_part:\n        return split_func.getItem(0)\n\n    # or, second part\n    return coalesce(split_func.getItem(1), lit(\"\"))\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/substring.html","title":"Substring","text":"<p>Extracts a substring from a string column starting at the given position.</p>"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring","title":"koheesio.spark.transformations.strings.substring.Substring","text":"<p>Extracts a substring from a string column starting at the given position.</p> <p>This is a wrapper around PySpark substring() function</p> Notes <ul> <li>Numeric columns will be cast to string</li> <li>start is 1-indexed, not 0-indexed!</li> </ul> <p>Parameters:</p> Name Type Description Default <code>columns</code> <code>Union[str, List[str]]</code> <p>The column (or list of columns) to substring. Alias: column</p> required <code>target_column</code> <code>Optional[str]</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> <code>None</code> <code>start</code> <code>PositiveInt</code> <p>Positive int. Defines where to begin the substring from. The first character of the field has index 1!</p> required <code>length</code> <code>Optional[int]</code> <p>Optional. If not provided, the substring will go until end of string.</p> <code>-1</code> Example"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring--extract-a-substring-from-a-string-column-starting-at-the-given-position","title":"Extract a substring from a string column starting at the given position.","text":"<p>input_df:</p> column skyscraper <pre><code>output_df = Substring(\n    column=\"column\",\n    target_column=\"substring_column\",\n    start=3,  # 1-indexed! So this will start at the 3rd character\n    length=4,\n).transform(input_df)\n</code></pre> <p>output_df:</p> column substring_column skyscraper yscr"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.length","title":"length  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>length: Optional[int] = Field(default=-1, description='The target length for the string. use -1 to perform until end')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.start","title":"start  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>start: PositiveInt = Field(default=..., description='The starting position')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/substring.py</code> <pre><code>def func(self, column: Column):\n    return when(column.isNull(), None).otherwise(substring(column, self.start, self.length)).cast(StringType())\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html","title":"Trim","text":"<p>Trim whitespace from the beginning and/or end of a string.</p> <p>Classes:</p> Name Description <code>- `Trim`</code> <p>Trim whitespace from the beginning and/or end of a string.</p> <code>- `LTrim`</code> <p>Trim whitespace from the beginning of a string.</p> <code>- `RTrim`</code> <p>Trim whitespace from the end of a string.</p> <code>See class docstrings for more information.</code>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.trim_type","title":"koheesio.spark.transformations.strings.trim.trim_type  <code>module-attribute</code>","text":"<pre><code>trim_type = Literal['left', 'right', 'left-right']\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim","title":"koheesio.spark.transformations.strings.trim.LTrim","text":"<p>Trim whitespace from the beginning of a string. Alias: LeftTrim</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: trim_type = 'left'\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim","title":"koheesio.spark.transformations.strings.trim.RTrim","text":"<p>Trim whitespace from the end of a string. Alias: RightTrim</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: trim_type = 'right'\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim","title":"koheesio.spark.transformations.strings.trim.Trim","text":"<p>Trim whitespace from the beginning and/or end of a string.</p> <p>This is a wrapper around PySpark ltrim() and rtrim() functions</p> <p>The <code>direction</code> parameter can be changed to apply either a left or a right trim. Defaults to left AND right trim.</p> <p>Note: If the type of the column is not string, Trim will not be run. A Warning will be thrown indicating this</p> <p>Parameters:</p> Name Type Description Default <code>columns</code> <p>The column (or list of columns) to trim. Alias: column If no columns are provided, all string columns will be trimmed.</p> required <code>target_column</code> <p>The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.</p> required <code>direction</code> <p>On which side to remove the spaces. Either \"left\", \"right\" or \"left-right\". Defaults to \"left-right\"</p> required <p>Examples:</p> <p>input_df: | column    | |-----------| | \" hello \" |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-beginning-of-a-string","title":"Trim whitespace from the beginning of a string","text":"<pre><code>output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"left\").transform(input_df)\n</code></pre> <p>output_df: | column    | trimmed_column | |-----------|----------------| | \" hello \" | \"hello \"       |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-both-sides-of-a-string","title":"Trim whitespace from both sides of a string","text":"<pre><code>output_df = Trim(\n    column=\"column\",\n    target_column=\"trimmed_column\",\n    direction=\"left-right\",  # default value\n).transform(input_df)\n</code></pre> <p>output_df: | column    | trimmed_column | |-----------|----------------| | \" hello \" | \"hello\"        |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-end-of-a-string","title":"Trim whitespace from the end of a string","text":"<pre><code>output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"right\").transform(input_df)\n</code></pre> <p>output_df: | column    | trimmed_column | |-----------|----------------| | \" hello \" | \" hello\"       |</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: ListOfColumns = Field(default='*', alias='column', description='The column (or list of columns) to trim. Alias: column. If no columns are provided, all stringcolumns will be trimmed.')\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.direction","title":"direction  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>direction: trim_type = Field(default='left-right', description=\"On which side to remove the spaces. Either 'left', 'right' or 'left-right'\")\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig","title":"ColumnConfig","text":"<p>Limit data types to string only.</p>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.limit_data_type","title":"limit_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>run_for_all_data_type = [STRING]\n</code></pre>"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.func","title":"func","text":"<pre><code>func(column: Column)\n</code></pre> Source code in <code>src/koheesio/spark/transformations/strings/trim.py</code> <pre><code>def func(self, column: Column):\n    if self.direction == \"left\":\n        return f.ltrim(column)\n\n    if self.direction == \"right\":\n        return f.rtrim(column)\n\n    # both (left-right)\n    return f.rtrim(f.ltrim(column))\n</code></pre>"},{"location":"api_reference/spark/writers/index.html","title":"Writers","text":"<p>The Writer class is used to write the DataFrame to a target.</p>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode","title":"koheesio.spark.writers.BatchOutputMode","text":"<p>For Batch:</p> <ul> <li>append: Append the contents of the DataFrame to the output table, default option in Koheesio.</li> <li>overwrite: overwrite the existing data.</li> <li>ignore: ignore the operation (i.e. no-op).</li> <li>error or errorifexists: throw an exception at runtime.</li> <li>merge: update matching data in the table and insert rows that do not exist.</li> <li>merge_all: update matching data in the table and insert rows that do not exist.</li> </ul>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERROR","title":"ERROR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERROR = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERRORIFEXISTS = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.IGNORE","title":"IGNORE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IGNORE = 'ignore'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE","title":"MERGE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE = 'merge'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGEALL","title":"MERGEALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGEALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE_ALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.OVERWRITE","title":"OVERWRITE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OVERWRITE = 'overwrite'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode","title":"koheesio.spark.writers.StreamingOutputMode","text":"<p>For Streaming:</p> <ul> <li>append: only the new rows in the streaming DataFrame will be written to the sink.</li> <li>complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some    updates.</li> <li>update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time    there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.</li> </ul>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.COMPLETE","title":"COMPLETE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>COMPLETE = 'complete'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.UPDATE","title":"UPDATE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>UPDATE = 'update'\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer","title":"koheesio.spark.writers.Writer","text":"<p>The Writer class is used to write the DataFrame to a target.</p>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.df","title":"df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = Field(default='delta', description='The format of the output')\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.streaming","title":"streaming  <code>property</code>","text":"<pre><code>streaming: bool\n</code></pre> <p>Check if the DataFrame is a streaming DataFrame or not.</p>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Execute on a Writer should handle writing of the self.df (input) as a minimum</p> Source code in <code>src/koheesio/spark/writers/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Execute on a Writer should handle writing of the self.df (input) as a minimum\"\"\"\n    # self.df  # input dataframe\n    ...\n</code></pre>"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.write","title":"write","text":"<pre><code>write(df: Optional[DataFrame] = None) -&gt; Output\n</code></pre> <p>Write the DataFrame to the output using execute() and return the output.</p> <p>If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.</p> Source code in <code>src/koheesio/spark/writers/__init__.py</code> <pre><code>def write(self, df: Optional[DataFrame] = None) -&gt; SparkStep.Output:\n    \"\"\"Write the DataFrame to the output using execute() and return the output.\n\n    If no DataFrame is passed, the self.df will be used.\n    If no self.df is set, a RuntimeError will be thrown.\n    \"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.execute()\n    return self.output\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html","title":"Buffer","text":"<p>This module contains classes for writing data to a buffer before writing to the final destination.</p> <p>The <code>BufferWriter</code> class is a base class for writers that write to a buffer first. It provides methods for writing, reading, and resetting the buffer, as well as checking if the buffer is compressed and compressing the buffer.</p> <p>The <code>PandasCsvBufferWriter</code> class is a subclass of <code>BufferWriter</code> that writes a Spark DataFrame to CSV file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).</p> <p>The <code>PandasJsonBufferWriter</code> class is a subclass of <code>BufferWriter</code> that writes a Spark DataFrame to JSON file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter","title":"koheesio.spark.writers.buffer.BufferWriter","text":"<p>Base class for writers that write to a buffer first, before writing to the final destination.</p> <p><code>execute()</code> method should implement how the incoming DataFrame is written to the buffer object (e.g. BytesIO) in the output.</p> <p>The default implementation uses a <code>SpooledTemporaryFile</code> as the buffer. This is a file-like object that starts off stored in memory and automatically rolls over to a temporary file on disk if it exceeds a certain size. A <code>SpooledTemporaryFile</code> behaves similar to <code>BytesIO</code>, but with the added benefit of being able to handle larger amounts of data.</p> <p>This approach provides a balance between speed and memory usage, allowing for fast in-memory operations for smaller amounts of data while still being able to handle larger amounts of data that would not otherwise fit in memory.</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output","title":"Output","text":"<p>Output class for BufferWriter</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.buffer","title":"buffer  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>buffer: InstanceOf[SpooledTemporaryFile] = Field(default_factory=partial(SpooledTemporaryFile, mode='w+b', max_size=0), exclude=True)\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.compress","title":"compress","text":"<pre><code>compress()\n</code></pre> <p>Compress the file_buffer in place using GZIP</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def compress(self):\n    \"\"\"Compress the file_buffer in place using GZIP\"\"\"\n    # check if the buffer is already compressed\n    if self.is_compressed():\n        self.logger.warn(\"Buffer is already compressed. Nothing to compress...\")\n        return self\n\n    # compress the file_buffer\n    file_buffer = self.buffer\n    compressed = gzip.compress(file_buffer.read())\n\n    # write the compressed content back to the buffer\n    self.reset_buffer()\n    self.buffer.write(compressed)\n\n    return self  # to allow for chaining\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.is_compressed","title":"is_compressed","text":"<pre><code>is_compressed()\n</code></pre> <p>Check if the buffer is compressed.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def is_compressed(self):\n    \"\"\"Check if the buffer is compressed.\"\"\"\n    self.rewind_buffer()\n    magic_number_present = self.buffer.read(2) == b\"\\x1f\\x8b\"\n    self.rewind_buffer()\n    return magic_number_present\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.read","title":"read","text":"<pre><code>read()\n</code></pre> <p>Read the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def read(self):\n    \"\"\"Read the buffer\"\"\"\n    self.rewind_buffer()\n    data = self.buffer.read()\n    self.rewind_buffer()\n    return data\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.reset_buffer","title":"reset_buffer","text":"<pre><code>reset_buffer()\n</code></pre> <p>Reset the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def reset_buffer(self):\n    \"\"\"Reset the buffer\"\"\"\n    self.buffer.truncate(0)\n    self.rewind_buffer()\n    return self\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.rewind_buffer","title":"rewind_buffer","text":"<pre><code>rewind_buffer()\n</code></pre> <p>Rewind the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def rewind_buffer(self):\n    \"\"\"Rewind the buffer\"\"\"\n    self.buffer.seek(0)\n    return self\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.write","title":"write","text":"<pre><code>write(df=None) -&gt; Output\n</code></pre> <p>Write the DataFrame to the buffer</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def write(self, df=None) -&gt; Output:\n    \"\"\"Write the DataFrame to the buffer\"\"\"\n    self.df = df or self.df\n    if not self.df:\n        raise RuntimeError(\"No valid Dataframe was passed\")\n    self.output.reset_buffer()\n    self.execute()\n    return self.output\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter","title":"koheesio.spark.writers.buffer.PandasCsvBufferWriter","text":"<p>Write a Spark DataFrame to CSV file(s) using Pandas.</p> <p>Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html</p> <p>See also: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option</p> Note <p>This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).</p> Pyspark vs Pandas <p>The following table shows the mapping between Pyspark, Pandas, and Koheesio properties. Note that the default values are mostly the same as Pyspark's <code>DataFrameWriter</code> implementation, with some exceptions (see below).</p> <p>This class implements the most commonly used properties. If a property is not explicitly implemented, it can be accessed through <code>params</code>.</p> PySpark Property Default PySpark Pandas Property Default Pandas Koheesio Property Default Koheesio Notes maxRecordsPerFile ... chunksize None max_records_per_file ... Spark property name: spark.sql.files.maxRecordsPerFile sep , sep , sep , lineSep <code>\\n</code> line_terminator os.linesep lineSep (alias=line_terminator) \\n N/A ... index True index False Determines whether row labels (index) are included in the output header False header True header True quote \" quotechar \" quote (alias=quotechar) \" quoteAll False doublequote True quoteAll (alias=doublequote) False escape <code>\\</code> escapechar None escapechar (alias=escape) \\ escapeQuotes True N/A N/A N/A ... Not available in Pandas ignoreLeadingWhiteSpace True N/A N/A N/A ... Not available in Pandas ignoreTrailingWhiteSpace True N/A N/A N/A ... Not available in Pandas charToEscapeQuoteEscaping escape or <code>\u0000</code> N/A N/A N/A ... Not available in Pandas dateFormat <code>yyyy-MM-dd</code> N/A N/A N/A ... Pandas implements Timestamp, not Date timestampFormat <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code> date_format N/A timestampFormat (alias=date_format) yyyy-MM-dd'T'HHss.SSS Follows PySpark defaults timestampNTZFormat <code>yyyy-MM-dd'T'HH:mm:ss[.SSS]</code> N/A N/A N/A ... Pandas implements Timestamp, see above compression None compression infer compression None encoding utf-8 encoding utf-8 N/A ... Not explicitly implemented nullValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented emptyValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented N/A ... float_format N/A N/A ... Not explicitly implemented N/A ... decimal N/A N/A ... Not explicitly implemented N/A ... index_label None N/A ... Not explicitly implemented N/A ... columns N/A N/A ... Not explicitly implemented N/A ... mode N/A N/A ... Not explicitly implemented N/A ... quoting N/A N/A ... Not explicitly implemented N/A ... errors N/A N/A ... Not explicitly implemented N/A ... storage_options N/A N/A ... Not explicitly implemented differences with Pyspark: <ul> <li>dateFormat -&gt; Pandas implements Timestamp, not just Date. Hence, Koheesio sets the default to the python     equivalent of PySpark's default.</li> <li>compression -&gt; Spark does not compress by default, hence Koheesio does not compress by default. Compression can     be provided though.</li> </ul> <p>Parameters:</p> Name Type Description Default <code>header</code> <code>bool</code> <p>Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.</p> <code>True</code> <code>sep</code> <code>str</code> <p>Field delimiter for the output file. Default is ','.</p> <code>,</code> <code>quote</code> <code>str</code> <p>String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'. Default is '\"'.</p> <code>\"</code> <code>quoteAll</code> <code>bool</code> <p>A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio sets the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'. Default is False.</p> <code>False</code> <code>escape</code> <code>str</code> <p>String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to <code>\\</code> to match Pyspark's default behavior. In Pandas, this field is called 'escapechar', and defaults to None. Default is '\\'.</p> <code>\\</code> <code>timestampFormat</code> <code>str</code> <p>Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code> which mimics the iso8601 format (<code>datetime.isoformat()</code>). Default is '%Y-%m-%dT%H:%M:%S.%f'.</p> <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code> <code>lineSep</code> <code>str, optional, default=</code> <p>String of length 1. Defines the character used as line separator that should be used for writing. Default is os.linesep.</p> required <code>compression</code> <code>Optional[Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', 'tar']]</code> <p>A string representing the compression to use for on-the-fly compression of the output data. Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.</p> <code>None</code>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.compression","title":"compression  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>compression: Optional[CompressionOptions] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.escape","title":"escape  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>escape: constr(max_length=1) = Field(default='\\\\', description=\"String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to `\\\\` to match Pyspark's default behavior. In Pandas, this is called 'escapechar', and defaults to None.\", alias='escapechar')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.header","title":"header  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>header: bool = Field(default=True, description=\"Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.index","title":"index  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>index: bool = Field(default=False, description='Toggles whether to write row names (index). Default False in Koheesio - pandas default is True.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.lineSep","title":"lineSep  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lineSep: Optional[constr(max_length=1)] = Field(default=linesep, description='String of length 1. Defines the character used as line separator that should be used for writing.', alias='line_terminator')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quote","title":"quote  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>quote: constr(max_length=1) = Field(default='\"', description=\"String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'.\", alias='quotechar')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quoteAll","title":"quoteAll  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>quoteAll: bool = Field(default=False, description=\"A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio set the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'.\", alias='doublequote')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.sep","title":"sep  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>sep: constr(max_length=1) = Field(default=',', description='Field delimiter for the output file')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.timestampFormat","title":"timestampFormat  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timestampFormat: str = Field(default='%Y-%m-%dT%H:%M:%S.%f', description=\"Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` which mimics the iso8601 format (`datetime.isoformat()`).\", alias='date_format')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output","title":"Output","text":"<p>Output class for PandasCsvBufferWriter</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output.pandas_df","title":"pandas_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Write the DataFrame to the buffer using Pandas to_csv() method. Compression is handled by pandas to_csv() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_csv() method.\n    Compression is handled by pandas to_csv() method.\n    \"\"\"\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = self.df.toPandas()\n\n    # create csv file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_csv(file_buffer, **self.get_options(options_type=\"spark\"))\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.get_options","title":"get_options","text":"<pre><code>get_options(options_type: str = 'csv')\n</code></pre> <p>Returns the options to pass to Pandas' to_csv() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def get_options(self, options_type: str = \"csv\"):\n    \"\"\"Returns the options to pass to Pandas' to_csv() method.\"\"\"\n    try:\n        import pandas as _pd\n\n        # Get the pandas version as a tuple of integers\n        pandas_version = tuple(int(i) for i in _pd.__version__.split(\".\"))\n    except ImportError:\n        raise ImportError(\"Pandas is required to use this writer\")\n\n    # Use line_separator for pandas 2.0.0 and later\n    line_sep_option_naming = \"line_separator\" if pandas_version &gt;= (2, 0, 0) else \"line_terminator\"\n\n    csv_options = {\n        \"header\": self.header,\n        \"sep\": self.sep,\n        \"quotechar\": self.quote,\n        \"doublequote\": self.quoteAll,\n        \"escapechar\": self.escape,\n        \"na_rep\": self.emptyValue or self.nullValue,\n        line_sep_option_naming: self.lineSep,\n        \"index\": self.index,\n        \"date_format\": self.timestampFormat,\n        \"compression\": self.compression,\n        **self.params,\n    }\n\n    if options_type == \"spark\":\n        csv_options[\"lineterminator\"] = csv_options.pop(line_sep_option_naming)\n    elif options_type == \"kohesio_pandas_buffer_writer\":\n        csv_options[\"line_terminator\"] = csv_options.pop(line_sep_option_naming)\n\n    return csv_options\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter","title":"koheesio.spark.writers.buffer.PandasJsonBufferWriter","text":"<p>Write a Spark DataFrame to JSON file(s) using Pandas.</p> <p>Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html</p> Note <p>This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).</p> <p>Parameters:</p> Name Type Description Default <code>orient</code> <p>Format of the resulting JSON string. Default is 'records'.</p> required <code>lines</code> <p>Format output as one JSON object per line. Only used when orient='records'. Default is True. - If true, the output will be formatted as one JSON object per line. - If false, the output will be written as a single JSON object. Note: this value is only used when orient='records' and will be ignored otherwise.</p> required <code>date_format</code> <p>Type of date conversion. Default is 'iso'. See <code>Date and Timestamp Formats</code> for a detailed description and more information.</p> required <code>double_precision</code> <p>Number of decimal places for encoding floating point values. Default is 10.</p> required <code>force_ascii</code> <p>Force encoded string to be ASCII. Default is True.</p> required <code>compression</code> <p>A string representing the compression to use for on-the-fly compression of the output data. Koheesio sets this default to 'None' leaving the data uncompressed. Can be set to gzip' optionally. Other compression options are currently not supported by Koheesio for JSON output.</p> required <code>The</code> required <code>dates</code> required <code>The</code> required <code>different</code> required <code>original</code> required <code>Note</code> required <code>then</code> required <code>References</code> required"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.columns","title":"columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>columns: Optional[list[str]] = Field(default=None, description='The columns to write. If None, all columns will be written.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.compression","title":"compression  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>compression: Optional[Literal['gzip']] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to 'gzip' optionally.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.date_format","title":"date_format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>date_format: Literal['iso', 'epoch'] = Field(default='iso', description=\"Type of date conversion. Default is 'iso'.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.double_precision","title":"double_precision  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>double_precision: int = Field(default=10, description='Number of decimal places for encoding floating point values. Default is 10.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.force_ascii","title":"force_ascii  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_ascii: bool = Field(default=True, description='Force encoded string to be ASCII. Default is True.')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.lines","title":"lines  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lines: bool = Field(default=True, description=\"Format output as one JSON object per line. Only used when orient='records'. Default is True.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.orient","title":"orient  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>orient: Literal['split', 'records', 'index', 'columns', 'values', 'table'] = Field(default='records', description=\"Format of the resulting JSON string. Default is 'records'.\")\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output","title":"Output","text":"<p>Output class for PandasJsonBufferWriter</p>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output.pandas_df","title":"pandas_df  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Write the DataFrame to the buffer using Pandas to_json() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def execute(self):\n    \"\"\"Write the DataFrame to the buffer using Pandas to_json() method.\"\"\"\n    df = self.df\n    if self.columns:\n        df = df[self.columns]\n\n    # convert the Spark DataFrame to a Pandas DataFrame\n    self.output.pandas_df = df.toPandas()\n\n    # create json file in memory\n    file_buffer = self.output.buffer\n    self.output.pandas_df.to_json(file_buffer, **self.get_options())\n\n    # compress the buffer if compression is set\n    if self.compression:\n        self.output.compress()\n</code></pre>"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Returns the options to pass to Pandas' to_json() method.</p> Source code in <code>src/koheesio/spark/writers/buffer.py</code> <pre><code>def get_options(self):\n    \"\"\"Returns the options to pass to Pandas' to_json() method.\"\"\"\n    json_options = {\n        \"orient\": self.orient,\n        \"date_format\": self.date_format,\n        \"double_precision\": self.double_precision,\n        \"force_ascii\": self.force_ascii,\n        \"lines\": self.lines,\n        **self.params,\n    }\n\n    # ignore the 'lines' parameter if orient is not 'records'\n    if self.orient != \"records\":\n        del json_options[\"lines\"]\n\n    return json_options\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html","title":"Dummy","text":"<p>Module for the DummyWriter class.</p>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter","title":"koheesio.spark.writers.dummy.DummyWriter","text":"<p>A simple DummyWriter that performs the equivalent of a df.show() on the given DataFrame and returns the first row of data as a dict.</p> <p>This Writer does not actually write anything to a source/destination, but is useful for debugging or testing purposes.</p> <p>Parameters:</p> Name Type Description Default <code>n</code> <code>PositiveInt</code> <p>Number of rows to show.</p> <code>20</code> <code>truncate</code> <code>bool | PositiveInt</code> <p>If set to <code>True</code>, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length <code>truncate</code> and align cells right.</p> <code>True</code> <code>vertical</code> <code>bool</code> <p>If set to <code>True</code>, print output rows vertically (one line per column value).</p> <code>False</code>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.n","title":"n  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>n: PositiveInt = Field(default=20, description='Number of rows to show.', gt=0)\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.truncate","title":"truncate  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>truncate: Union[bool, PositiveInt] = Field(default=True, description='If set to ``True``, truncate strings longer than 20 chars by default.If set to a number greater than one, truncates long strings to length ``truncate`` and align cells right.')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.vertical","title":"vertical  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>vertical: bool = Field(default=False, description='If set to ``True``, print output rows vertically (one line per column value).')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output","title":"Output","text":"<p>DummyWriter output</p>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.df_content","title":"df_content  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>df_content: str = Field(default=..., description='The content of the DataFrame as a string')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.head","title":"head  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>head: Dict[str, Any] = Field(default=..., description='The first row of the DataFrame as a dict')\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Execute the DummyWriter</p> Source code in <code>src/koheesio/spark/writers/dummy.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"Execute the DummyWriter\"\"\"\n    df: DataFrame = self.df\n\n    # noinspection PyProtectedMember\n    df_content = df._jdf.showString(self.n, self.truncate, self.vertical)\n\n    # logs the equivalent of doing df.show()\n    self.log.info(f\"content of df that was passed to DummyWriter:\\n{df_content}\")\n\n    self.output.head = self.df.head().asDict()\n    self.output.df_content = df_content\n</code></pre>"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.int_truncate","title":"int_truncate","text":"<pre><code>int_truncate(truncate_value) -&gt; int\n</code></pre> <p>Truncate is either a bool or an int.</p> Parameters: <p>truncate_value : int | bool, optional, default=True     If int, specifies the maximum length of the string.     If bool and True, defaults to a maximum length of 20 characters.</p> Returns: <p>int     The maximum length of the string.</p> Source code in <code>src/koheesio/spark/writers/dummy.py</code> <pre><code>@field_validator(\"truncate\")\ndef int_truncate(cls, truncate_value) -&gt; int:\n    \"\"\"\n    Truncate is either a bool or an int.\n\n    Parameters:\n    -----------\n    truncate_value : int | bool, optional, default=True\n        If int, specifies the maximum length of the string.\n        If bool and True, defaults to a maximum length of 20 characters.\n\n    Returns:\n    --------\n    int\n        The maximum length of the string.\n\n    \"\"\"\n    # Same logic as what is inside DataFrame.show()\n    if isinstance(truncate_value, bool) and truncate_value is True:\n        return 20  # default is 20 chars\n    return int(truncate_value)  # otherwise 0, or whatever the user specified\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html","title":"Kafka","text":"<p>Kafka writer to write batch or streaming data into kafka topics</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter","title":"koheesio.spark.writers.kafka.KafkaWriter","text":"<p>Kafka writer to write batch or streaming data into kafka topics</p> <p>All kafka specific options can be provided as additional init params</p> <p>Parameters:</p> Name Type Description Default <code>broker</code> <code>str</code> <p>broker url of the kafka cluster</p> required <code>topic</code> <code>str</code> <p>full topic name to write the data to</p> required <code>trigger</code> <code>Optional[Union[Trigger, str, Dict]]</code> <p>Indicates optionally how to stream the data into kafka, continuous or batch</p> required <code>checkpoint_location</code> <code>str</code> <p>In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs.</p> required Example <pre><code>KafkaWriter(\n    write_broker=\"broker.com:9500\",\n    topic=\"test-topic\",\n    trigger=Trigger(continuous=True)\n    includeHeaders: \"true\",\n    key.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    value.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n    kafka.group.id: \"test-group\",\n    checkpoint_location: \"s3://bucket/test-topic\"\n)\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.batch_writer","title":"batch_writer  <code>property</code>","text":"<pre><code>batch_writer: DataFrameWriter\n</code></pre> <p>returns a batch writer</p> <p>Returns:</p> Type Description <code>DataFrameWriter</code>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.broker","title":"broker  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>broker: str = Field(default=..., description='Kafka brokers to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.checkpoint_location","title":"checkpoint_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = 'kafka'\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.logged_option_keys","title":"logged_option_keys  <code>property</code>","text":"<pre><code>logged_option_keys\n</code></pre> <p>keys to be logged</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.options","title":"options  <code>property</code>","text":"<pre><code>options\n</code></pre> <p>retrieve the kafka options incl topic and broker.</p> <p>Returns:</p> Type Description <code>dict</code> <p>Dict being the combination of kafka options + topic + broker</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.stream_writer","title":"stream_writer  <code>property</code>","text":"<pre><code>stream_writer: DataStreamWriter\n</code></pre> <p>returns a stream writer</p> <p>Returns:</p> Type Description <code>DataStreamWriter</code>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.streaming_query","title":"streaming_query  <code>property</code>","text":"<pre><code>streaming_query: Optional[Union[str, StreamingQuery]]\n</code></pre> <p>return the streaming query</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.topic","title":"topic  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>topic: str = Field(default=..., description='Kafka topic to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.trigger","title":"trigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>trigger: Optional[Union[Trigger, str, Dict]] = Field(Trigger(available_now=True), description='Set the trigger for the stream query. If not set data is processed in batch')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[DataStreamWriter, DataFrameWriter]\n</code></pre> <p>function to get the writer of proper type according to whether the data to written is a stream or not This function will also set the trigger property in case of a datastream.</p> <p>Returns:</p> Type Description <code>Union[DataStreamWriter, DataFrameWriter]</code> <p>In case of streaming data -&gt; DataStreamWriter, else -&gt; DataFrameWriter</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output","title":"Output","text":"<p>Output of the KafkaWriter</p>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output.streaming_query","title":"streaming_query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n</code></pre>"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Effectively write the data from the dataframe (streaming of batch) to kafka topic.</p> <p>Returns:</p> Type Description <code>Output</code> <p>streaming_query function can be used to gain insights on running write.</p> Source code in <code>src/koheesio/spark/writers/kafka.py</code> <pre><code>def execute(self):\n    \"\"\"Effectively write the data from the dataframe (streaming of batch) to kafka topic.\n\n    Returns\n    -------\n    KafkaWriter.Output\n        streaming_query function can be used to gain insights on running write.\n    \"\"\"\n    applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n    self.log.debug(f\"Applying options {applied_options}\")\n\n    self._validate_dataframe()\n\n    _writer = self.writer.format(self.format).options(**self.options)\n    self.output.streaming_query = _writer.start() if self.streaming else _writer.save()\n</code></pre>"},{"location":"api_reference/spark/writers/snowflake.html","title":"Snowflake","text":"<p>This module contains the SnowflakeWriter class, which is used to write data to Snowflake.</p>"},{"location":"api_reference/spark/writers/stream.html","title":"Stream","text":"<p>Module that holds some classes and functions to be able to write to a stream</p> <p>Classes:</p> Name Description <code>Trigger</code> <p>class to set the trigger for a stream query</p> <code>StreamWriter</code> <p>abstract class for stream writers</p> <code>ForEachBatchStreamWriter</code> <p>class to run a writer for each batch</p> <p>Functions:</p> Name Description <code>writer_to_foreachbatch</code> <p>function to be used as batch_function for StreamWriter (sub)classes</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter","title":"koheesio.spark.writers.stream.ForEachBatchStreamWriter","text":"<p>Runnable ForEachBatchWriter</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def execute(self):\n    self.streaming_query = self.writer.start()\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter","title":"koheesio.spark.writers.stream.StreamWriter","text":"<p>ABC Stream Writer</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.batch_function","title":"batch_function  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_function: Optional[Callable] = Field(default=None, description='allows you to run custom batch functions for each micro batch', alias='batch_function_for_each_df')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.checkpoint_location","title":"checkpoint_location  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.output_mode","title":"output_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>output_mode: StreamingOutputMode = Field(default=APPEND, alias='outputMode', description=__doc__)\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.stream_writer","title":"stream_writer  <code>property</code>","text":"<pre><code>stream_writer: DataStreamWriter\n</code></pre> <p>Returns the stream writer for the given DataFrame and settings</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.streaming_query","title":"streaming_query  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.trigger","title":"trigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>trigger: Optional[Union[Trigger, str, Dict]] = Field(default=Trigger(available_now=True), description='Set the trigger for the stream query. If this is not set it process data as batch')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer\n</code></pre> <p>Returns the stream writer since we don't have a batch mode for streams</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.await_termination","title":"await_termination","text":"<pre><code>await_termination(timeout: Optional[int] = None)\n</code></pre> <p>Await termination of the stream query</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def await_termination(self, timeout: Optional[int] = None):\n    \"\"\"Await termination of the stream query\"\"\"\n    self.streaming_query.awaitTermination(timeout=timeout)\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger","title":"koheesio.spark.writers.stream.Trigger","text":"<p>Trigger types for a stream query.</p> <p>Only one trigger can be set!</p> Example <ul> <li>processingTime='5 seconds'</li> <li>continuous='5 seconds'</li> <li>availableNow=True</li> <li>once=True</li> </ul> See Also <ul> <li>https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers</li> </ul>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.available_now","title":"available_now  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>available_now: Optional[bool] = Field(default=None, alias='availableNow', description='if set to True, set a trigger that processes all available data in multiple batches then terminates the query.')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.continuous","title":"continuous  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>continuous: Optional[str] = Field(default=None, description=\"a time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a continuous query with a given checkpoint interval.\")\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(validate_default=False, extra='forbid')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.once","title":"once  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>once: Optional[bool] = Field(default=None, deprecated=True, description='if set to True, set a trigger that processes only one batch of data in a streaming query then terminates the query. use `available_now` instead of `once`.')\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.processing_time","title":"processing_time  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>processing_time: Optional[str] = Field(default=None, alias='processingTime', description=\"a processing time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a microbatch query periodically based on the processing time.\")\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.triggers","title":"triggers  <code>property</code>","text":"<pre><code>triggers\n</code></pre> <p>Returns a list of tuples with the value for each trigger</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.value","title":"value  <code>property</code>","text":"<pre><code>value: Dict[str, str]\n</code></pre> <p>Returns the trigger value as a dictionary</p>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Returns the trigger value as a dictionary This method can be skipped, as the value can be accessed directly from the <code>value</code> property</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def execute(self):\n    \"\"\"Returns the trigger value as a dictionary\n    This method can be skipped, as the value can be accessed directly from the `value` property\n    \"\"\"\n    self.log.warning(\"Trigger.execute is deprecated. Use Trigger.value directly instead\")\n    self.output.value = self.value\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_any","title":"from_any  <code>classmethod</code>","text":"<pre><code>from_any(value)\n</code></pre> <p>Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a dictionary</p> <p>This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@classmethod\ndef from_any(cls, value):\n    \"\"\"Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a\n    dictionary\n\n    This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types\n    \"\"\"\n    if isinstance(value, Trigger):\n        return value\n\n    if isinstance(value, str):\n        return cls.from_string(value)\n\n    if isinstance(value, dict):\n        return cls.from_dict(value)\n\n    raise RuntimeError(f\"Unable to create Trigger based on the given value: {value}\")\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_dict","title":"from_dict  <code>classmethod</code>","text":"<pre><code>from_dict(_dict)\n</code></pre> <p>Creates a Trigger class based on a dictionary</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@classmethod\ndef from_dict(cls, _dict):\n    \"\"\"Creates a Trigger class based on a dictionary\"\"\"\n    return cls(**_dict)\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(trigger: str)\n</code></pre> <p>Creates a Trigger class based on a string</p> Example Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@classmethod\ndef from_string(cls, trigger: str):\n    \"\"\"Creates a Trigger class based on a string\n\n    Example\n    -------\n    ### happy flow\n\n    * processingTime='5 seconds'\n    * processing_time=\"5 hours\"\n    * processingTime=4 minutes\n    * once=True\n    * once=true\n    * available_now=true\n    * continuous='3 hours'\n    * once=TrUe\n    * once=TRUE\n\n    ### unhappy flow\n    valid values, but should fail the validation check of the class\n\n    * availableNow=False\n    * continuous=True\n    * once=false\n    \"\"\"\n    import re\n\n    trigger_from_string = re.compile(r\"(?P&lt;triggerType&gt;\\w+)=[\\'\\\"]?(?P&lt;value&gt;.+)[\\'\\\"]?\")\n    _match = trigger_from_string.match(trigger)\n\n    if _match is None:\n        raise ValueError(\n            f\"Cannot parse value for Trigger: '{trigger}'. \\n\"\n            f\"Valid types are {', '.join(cls._all_triggers_with_alias())}\"\n        )\n\n    trigger_type, value = _match.groups()\n\n    # strip the value of any quotes\n    value = value.strip(\"'\").strip('\"')\n\n    # making value a boolean when given\n    value = convert_str_to_bool(value)\n\n    return cls.from_dict({trigger_type: value})\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--happy-flow","title":"happy flow","text":"<ul> <li>processingTime='5 seconds'</li> <li>processing_time=\"5 hours\"</li> <li>processingTime=4 minutes</li> <li>once=True</li> <li>once=true</li> <li>available_now=true</li> <li>continuous='3 hours'</li> <li>once=TrUe</li> <li>once=TRUE</li> </ul>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--unhappy-flow","title":"unhappy flow","text":"<p>valid values, but should fail the validation check of the class</p> <ul> <li>availableNow=False</li> <li>continuous=True</li> <li>once=false</li> </ul>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_available_now","title":"validate_available_now","text":"<pre><code>validate_available_now(available_now)\n</code></pre> <p>Validate the available_now trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"available_now\", mode=\"before\")\ndef validate_available_now(cls, available_now):\n    \"\"\"Validate the available_now trigger value\"\"\"\n    # making value a boolean when given\n    available_now = convert_str_to_bool(available_now)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if available_now is not True:\n        raise ValueError(f\"Value for availableNow must be True. Got:{available_now}\")\n    return available_now\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_continuous","title":"validate_continuous","text":"<pre><code>validate_continuous(continuous)\n</code></pre> <p>Validate the continuous trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"continuous\", mode=\"before\")\ndef validate_continuous(cls, continuous):\n    \"\"\"Validate the continuous trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger` except that the if statement is not\n    # split in two parts\n    if not isinstance(continuous, str):\n        raise ValueError(f\"Value for continuous must be a string. Got: {continuous}\")\n\n    if len(continuous.strip()) == 0:\n        raise ValueError(f\"Value for continuous must be a non empty string. Got: {continuous}\")\n    return continuous\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_once","title":"validate_once","text":"<pre><code>validate_once(once)\n</code></pre> <p>Validate the once trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"once\", mode=\"before\")\ndef validate_once(cls, once):\n    \"\"\"Validate the once trigger value\"\"\"\n    # making value a boolean when given\n    once = convert_str_to_bool(once)\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if once is not True:\n        raise ValueError(f\"Value for once must be True. Got: {once}\")\n    return once\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_processing_time","title":"validate_processing_time","text":"<pre><code>validate_processing_time(processing_time)\n</code></pre> <p>Validate the processing time trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@field_validator(\"processing_time\", mode=\"before\")\ndef validate_processing_time(cls, processing_time):\n    \"\"\"Validate the processing time trigger value\"\"\"\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n    if not isinstance(processing_time, str):\n        raise ValueError(f\"Value for processing_time must be a string. Got: {processing_time}\")\n\n    if len(processing_time.strip()) == 0:\n        raise ValueError(f\"Value for processingTime must be a non empty string. Got: {processing_time}\")\n    return processing_time\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_triggers","title":"validate_triggers","text":"<pre><code>validate_triggers(triggers: Dict)\n</code></pre> <p>Validate the trigger value</p> Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>@model_validator(mode=\"before\")\ndef validate_triggers(cls, triggers: Dict):\n    \"\"\"Validate the trigger value\"\"\"\n    params = [*triggers.values()]\n\n    # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`; modified to work with pydantic v2\n    if not triggers:\n        raise ValueError(\"No trigger provided\")\n    if len(params) &gt; 1:\n        raise ValueError(\"Multiple triggers not allowed.\")\n\n    return triggers\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch","title":"koheesio.spark.writers.stream.writer_to_foreachbatch","text":"<pre><code>writer_to_foreachbatch(writer: Writer)\n</code></pre> <p>Call <code>writer.execute</code> on each batch</p> <p>To be passed as batch_function for StreamWriter (sub)classes.</p> Example Source code in <code>src/koheesio/spark/writers/stream.py</code> <pre><code>def writer_to_foreachbatch(writer: Writer):\n    \"\"\"Call `writer.execute` on each batch\n\n    To be passed as batch_function for StreamWriter (sub)classes.\n\n    Example\n    -------\n    ### Writing to a Delta table and a Snowflake table\n    ```python\n    DeltaTableStreamWriter(\n        table=\"my_table\",\n        checkpointLocation=\"my_checkpointlocation\",\n        batch_function=writer_to_foreachbatch(\n            SnowflakeWriter(\n                **sfOptions,\n                table=\"snowflake_table\",\n                insert_type=SnowflakeWriter.InsertType.APPEND,\n            )\n        ),\n    )\n    ```\n    \"\"\"\n\n    def inner(df, batch_id: int):\n        \"\"\"Inner method\n\n        As per the Spark documentation:\n        In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a\n        DataFrame and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the\n        output (that is, the provided Dataset) to external systems. The output DataFrame is guaranteed to exactly\n        same for the same batchId (assuming all operations are deterministic in the query).\n        \"\"\"\n        writer.log.debug(f\"Running batch function for batch {batch_id}\")\n        writer.write(df)\n\n    return inner\n</code></pre>"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch--writing-to-a-delta-table-and-a-snowflake-table","title":"Writing to a Delta table and a Snowflake table","text":"<pre><code>DeltaTableStreamWriter(\n    table=\"my_table\",\n    checkpointLocation=\"my_checkpointlocation\",\n    batch_function=writer_to_foreachbatch(\n        SnowflakeWriter(\n            **sfOptions,\n            table=\"snowflake_table\",\n            insert_type=SnowflakeWriter.InsertType.APPEND,\n        )\n    ),\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html","title":"Delta","text":"<p>This module is the entry point for the koheesio.spark.writers.delta package.</p> <p>It imports and exposes the DeltaTableWriter and DeltaTableStreamWriter classes for external use.</p> <p>Classes:     DeltaTableWriter: Class to write data in batch mode to a Delta table.     DeltaTableStreamWriter: Class to write data in streaming mode to a Delta table.</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode","title":"koheesio.spark.writers.delta.BatchOutputMode","text":"<p>For Batch:</p> <ul> <li>append: Append the contents of the DataFrame to the output table, default option in Koheesio.</li> <li>overwrite: overwrite the existing data.</li> <li>ignore: ignore the operation (i.e. no-op).</li> <li>error or errorifexists: throw an exception at runtime.</li> <li>merge: update matching data in the table and insert rows that do not exist.</li> <li>merge_all: update matching data in the table and insert rows that do not exist.</li> </ul>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.APPEND","title":"APPEND  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>APPEND = 'append'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERROR","title":"ERROR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERROR = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ERRORIFEXISTS = 'error'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.IGNORE","title":"IGNORE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IGNORE = 'ignore'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE","title":"MERGE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE = 'merge'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGEALL","title":"MERGEALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGEALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MERGE_ALL = 'merge_all'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.OVERWRITE","title":"OVERWRITE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OVERWRITE = 'overwrite'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.DeltaTableStreamWriter","text":"<p>Delta table stream writer</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options","title":"Options","text":"<p>Options for DeltaTableStreamWriter</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/stream.py</code> <pre><code>def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter","title":"koheesio.spark.writers.delta.DeltaTableWriter","text":"<p>Delta table Writer for both batch and streaming dataframes.</p> Example <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Union[DeltaTableStep, str]</code> <p>The table to write to</p> required <code>output_mode</code> <code>Optional[Union[str, BatchOutputMode, StreamingOutputMode]]</code> <p>The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.</p> required <code>params</code> <code>Optional[dict]</code> <p>Additional parameters to use for specific mode</p> required"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-mergeall","title":"Example for <code>MERGEALL</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val&gt;=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # &lt;------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # &lt;------ DEFAULT, can be changed by providing custom value\n    },\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge","title":"Example for <code>MERGE</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=&lt;target_table_name&gt;)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge_1","title":"Example for <code>MERGE</code>","text":"<p>in case the table isn't created yet, first run will execute an APPEND operation <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"&lt;update_condition&gt;\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"&lt;insert_condition&gt;\",\n            },\n        ],\n        \"merge_cond\": \"&lt;merge_condition&gt;\",\n    },\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"<p>dataframe writer options can be passed as keyword arguments <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = 'delta'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.output_mode","title":"output_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.partition_by","title":"partition_by  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[DeltaMergeBuilder, DataFrameWriter]\n</code></pre> <p>Specify DeltaTableWriter</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.get_output_mode","title":"get_output_mode  <code>classmethod</code>","text":"<pre><code>get_output_mode(choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]\n</code></pre> <p>Retrieve an OutputMode by validating <code>choice</code> against a set of option OutputModes.</p> <p>Currently supported output modes can be found in:</p> <ul> <li>BatchOutputMode</li> <li>StreamingOutputMode</li> </ul> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.SCD2DeltaTableWriter","text":"<p>A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.</p> <p>Attributes:</p> Name Type Description <code>table</code> <code>InstanceOf[DeltaTableStep]</code> <p>The table to merge to.</p> <code>merge_key</code> <code>str</code> <p>The key used for merging data.</p> <code>include_columns</code> <code>List[str]</code> <p>Columns to be merged. Will be selected from DataFrame. Default is all columns.</p> <code>exclude_columns</code> <code>List[str]</code> <p>Columns to be excluded from DataFrame.</p> <code>scd2_columns</code> <code>List[str]</code> <p>List of attributes for SCD2 type (track changes).</p> <code>scd2_timestamp_col</code> <code>Optional[Column]</code> <p>Timestamp column for SCD2 type (track changes). Default to current_timestamp.</p> <code>scd1_columns</code> <code>List[str]</code> <p>List of attributes for SCD1 type (just update).</p> <code>meta_scd2_struct_col_name</code> <code>str</code> <p>SCD2 struct name.</p> <code>meta_scd2_effective_time_col_name</code> <code>str</code> <p>Effective col name.</p> <code>meta_scd2_is_current_col_name</code> <code>str</code> <p>Current col name.</p> <code>meta_scd2_end_time_col_name</code> <code>str</code> <p>End time col name.</p> <code>target_auto_generated_columns</code> <code>List[str]</code> <p>Auto generated columns from target Delta table. Will be used to exclude from merge logic.</p>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.include_columns","title":"include_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.merge_key","title":"merge_key  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>merge_key: str = Field(..., description='Merge key')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> <p>Execute the SCD Type 2 operation.</p> <p>This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.</p> <p>Raises:</p> Type Description <code>TypeError</code> <p>If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.</p> Source code in <code>src/koheesio/spark/writers/delta/scd.py</code> <pre><code>def execute(self) -&gt; None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html","title":"Batch","text":"<p>This module defines the DeltaTableWriter class, which is used to write both batch and streaming dataframes to Delta tables.</p> <p>DeltaTableWriter supports two output modes: <code>MERGEALL</code> and <code>MERGE</code>.</p> <ul> <li>The <code>MERGEALL</code> mode merges all incoming data with existing data in the table based on certain conditions.</li> <li>The <code>MERGE</code> mode allows for more custom merging behavior using the DeltaMergeBuilder class from the <code>delta.tables</code>   library.</li> </ul> <p>The <code>output_mode_params</code> dictionary is used to specify conditions for merging, updating, and inserting data. The <code>target_alias</code> and <code>source_alias</code> keys are used to specify the aliases for the target and source dataframes in the merge conditions.</p> <p>Classes:</p> Name Description <code>DeltaTableWriter</code> <p>A class for writing data to Delta tables.</p> <code>DeltaTableStreamWriter</code> <p>A class for writing streaming data to Delta tables.</p> Example <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val&gt;=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n    },\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter","title":"koheesio.spark.writers.delta.batch.DeltaTableWriter","text":"<p>Delta table Writer for both batch and streaming dataframes.</p> Example <p>Parameters:</p> Name Type Description Default <code>table</code> <code>Union[DeltaTableStep, str]</code> <p>The table to write to</p> required <code>output_mode</code> <code>Optional[Union[str, BatchOutputMode, StreamingOutputMode]]</code> <p>The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.</p> required <code>params</code> <code>Optional[dict]</code> <p>Additional parameters to use for specific mode</p> required"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-mergeall","title":"Example for <code>MERGEALL</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGEALL,\n    output_mode_params={\n        \"merge_cond\": \"target.id=source.id\",\n        \"update_cond\": \"target.col1_val&gt;=source.col1_val\",\n        \"insert_cond\": \"source.col_bk IS NOT NULL\",\n        \"target_alias\": \"target\",  # &lt;------ DEFAULT, can be changed by providing custom value\n        \"source_alias\": \"source\",  # &lt;------ DEFAULT, can be changed by providing custom value\n    },\n)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge","title":"Example for <code>MERGE</code>","text":"<pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        'merge_builder': (\n            DeltaTable\n            .forName(sparkSession=spark, tableOrViewName=&lt;target_table_name&gt;)\n            .alias(target_alias)\n            .merge(source=df, condition=merge_cond)\n            .whenMatchedUpdateAll(condition=update_cond)\n            .whenNotMatchedInsertAll(condition=insert_cond)\n            )\n        }\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge_1","title":"Example for <code>MERGE</code>","text":"<p>in case the table isn't created yet, first run will execute an APPEND operation <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.MERGE,\n    output_mode_params={\n        \"merge_builder\": [\n            {\n                \"clause\": \"whenMatchedUpdate\",\n                \"set\": {\"value\": \"source.value\"},\n                \"condition\": \"&lt;update_condition&gt;\",\n            },\n            {\n                \"clause\": \"whenNotMatchedInsert\",\n                \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n                \"condition\": \"&lt;insert_condition&gt;\",\n            },\n        ],\n        \"merge_cond\": \"&lt;merge_condition&gt;\",\n    },\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"<p>dataframe writer options can be passed as keyword arguments <pre><code>DeltaTableWriter(\n    table=\"test_table\",\n    output_mode=BatchOutputMode.APPEND,\n    partitionOverwriteMode=\"dynamic\",\n    mergeSchema=\"false\",\n)\n</code></pre></p>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.format","title":"format  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>format: str = 'delta'\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.output_mode","title":"output_mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.partition_by","title":"partition_by  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.writer","title":"writer  <code>property</code>","text":"<pre><code>writer: Union[DeltaMergeBuilder, DataFrameWriter]\n</code></pre> <p>Specify DeltaTableWriter</p>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>def execute(self):\n    _writer = self.writer\n\n    if self.table.create_if_not_exists and not self.table.exists:\n        _writer = _writer.options(**self.table.default_create_properties)\n\n    if isinstance(_writer, DeltaMergeBuilder):\n        _writer.execute()\n    else:\n        if options := self.params:\n            # should we add options only if mode is not merge?\n            _writer = _writer.options(**options)\n        _writer.saveAsTable(self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.get_output_mode","title":"get_output_mode  <code>classmethod</code>","text":"<pre><code>get_output_mode(choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]\n</code></pre> <p>Retrieve an OutputMode by validating <code>choice</code> against a set of option OutputModes.</p> <p>Currently supported output modes can be found in:</p> <ul> <li>BatchOutputMode</li> <li>StreamingOutputMode</li> </ul> Source code in <code>src/koheesio/spark/writers/delta/batch.py</code> <pre><code>@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -&gt; Union[BatchOutputMode, StreamingOutputMode]:\n    \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n    Currently supported output modes can be found in:\n\n    - BatchOutputMode\n    - StreamingOutputMode\n    \"\"\"\n    for enum_type in options:\n        if choice.upper() in [om.value.upper() for om in enum_type]:\n            return getattr(enum_type, choice.upper())\n    raise AttributeError(\n        f\"\"\"\n        Invalid outputMode specified '{choice}'. Allowed values are:\n        Batch Mode - {BatchOutputMode.__doc__}\n        Streaming Mode - {StreamingOutputMode.__doc__}\n        \"\"\"\n    )\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html","title":"Scd","text":"<p>This module defines writers to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.</p> <p>Slowly Changing Dimension (SCD) is a technique used in data warehousing to handle changes to dimension data over time. SCD Type 2 is one of the most common types of SCD, where historical changes are tracked by creating new records for each change.</p> <p>Koheesio is a powerful data processing framework that provides advanced capabilities for working with Delta tables in Apache Spark. It offers a convenient and efficient way to handle SCD Type 2 operations on Delta tables.</p> <p>To learn more about Slowly Changing Dimension and SCD Type 2, you can refer to the following resources: - Slowly Changing Dimension (SCD) - Wikipedia</p> <p>By using Koheesio, you can benefit from its efficient merge logic, support for SCD Type 2 and SCD Type 1 attributes, and seamless integration with Delta tables in Spark.</p>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","text":"<p>A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.</p> <p>Attributes:</p> Name Type Description <code>table</code> <code>InstanceOf[DeltaTableStep]</code> <p>The table to merge to.</p> <code>merge_key</code> <code>str</code> <p>The key used for merging data.</p> <code>include_columns</code> <code>List[str]</code> <p>Columns to be merged. Will be selected from DataFrame. Default is all columns.</p> <code>exclude_columns</code> <code>List[str]</code> <p>Columns to be excluded from DataFrame.</p> <code>scd2_columns</code> <code>List[str]</code> <p>List of attributes for SCD2 type (track changes).</p> <code>scd2_timestamp_col</code> <code>Optional[Column]</code> <p>Timestamp column for SCD2 type (track changes). Default to current_timestamp.</p> <code>scd1_columns</code> <code>List[str]</code> <p>List of attributes for SCD1 type (just update).</p> <code>meta_scd2_struct_col_name</code> <code>str</code> <p>SCD2 struct name.</p> <code>meta_scd2_effective_time_col_name</code> <code>str</code> <p>Effective col name.</p> <code>meta_scd2_is_current_col_name</code> <code>str</code> <p>Current col name.</p> <code>meta_scd2_end_time_col_name</code> <code>str</code> <p>End time col name.</p> <code>target_auto_generated_columns</code> <code>List[str]</code> <p>Auto generated columns from target Delta table. Will be used to exclude from merge logic.</p>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.include_columns","title":"include_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.merge_key","title":"merge_key  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>merge_key: str = Field(..., description='Merge key')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.table","title":"table  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.execute","title":"execute","text":"<pre><code>execute() -&gt; None\n</code></pre> <p>Execute the SCD Type 2 operation.</p> <p>This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.</p> <p>Raises:</p> Type Description <code>TypeError</code> <p>If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.</p> Source code in <code>src/koheesio/spark/writers/delta/scd.py</code> <pre><code>def execute(self) -&gt; None:\n    \"\"\"\n    Execute the SCD Type 2 operation.\n\n    This method executes the SCD Type 2 operation on the DataFrame.\n    It validates the existing Delta table, prepares the merge conditions, stages the data,\n    and then performs the merge operation.\n\n    Raises\n    ------\n    TypeError\n        If the scd2_timestamp_col is not of date or timestamp type.\n        If the source DataFrame is missing any of the required merge columns.\n\n    \"\"\"\n    self.df: DataFrame\n    self.spark: SparkSession\n    delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n    src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n    # Prepare required merge columns\n    required_merge_columns = [self.merge_key]\n\n    if self.scd2_columns:\n        required_merge_columns += self.scd2_columns\n\n    if self.scd1_columns:\n        required_merge_columns += self.scd1_columns\n\n    if not all(c in self.df.columns for c in required_merge_columns):\n        missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n        raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n    # Check that required columns are present in the source DataFrame\n    if self.scd2_timestamp_col is not None:\n        timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n        if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n            raise TypeError(\n                f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n                f\"or timestamp type.Current type is {timestamp_col_type}\"\n            )\n\n    # Prepare columns to process\n    include_columns = self.include_columns if self.include_columns else self.df.columns\n    exclude_columns = self.exclude_columns\n    columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n    # Constructing column names for SCD2 attributes\n    meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n    meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n    meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n    # Constructing system merge action logic\n    system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n    if updates_attrs_scd2 := self._prepare_attr_clause(\n        attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n    if updates_attrs_scd1 := self._prepare_attr_clause(\n        attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n    ):\n        system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n    system_merge_action += \" ELSE NULL END\"\n\n    # Prepare the staged DataFrame\n    staged = (\n        self.df.withColumn(\n            \"__meta_scd2_timestamp\",\n            self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n        )\n        .transform(\n            func=self._prepare_staging,\n            delta_table=delta_table,\n            merge_action_logic=F.expr(system_merge_action),\n            meta_scd2_is_current_col=meta_scd2_is_current_col,\n            columns_to_process=columns_to_process,\n            src_alias=src_alias,\n            dest_alias=dest_alias,\n            cross_alias=cross_alias,\n        )\n        .transform(\n            func=self._preserve_existing_target_values,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            target_auto_generated_columns=self.target_auto_generated_columns,\n            src_alias=src_alias,\n            cross_alias=cross_alias,\n            dest_alias=dest_alias,\n            logger=self.log,\n        )\n        .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n        .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n        .withColumn(\n            \"__meta_scd2_effective_time\",\n            self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n        )\n        .transform(\n            func=self._add_scd2_columns,\n            meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n            meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n            meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n            meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n        )\n    )\n\n    self._prepare_merge_builder(\n        delta_table=delta_table,\n        dest_alias=dest_alias,\n        staged=staged,\n        merge_key=self.merge_key,\n        columns_to_process=columns_to_process,\n        meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n    ).execute()\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html","title":"Stream","text":"<p>This module defines the DeltaTableStreamWriter class, which is used to write streaming dataframes to Delta tables.</p>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","text":"<p>Delta table stream writer</p>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options","title":"Options","text":"<p>Options for DeltaTableStreamWriter</p>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n</code></pre>"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> Source code in <code>src/koheesio/spark/writers/delta/stream.py</code> <pre><code>def execute(self):\n    if self.batch_function:\n        self.streaming_query = self.writer.start()\n    else:\n        self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n</code></pre>"},{"location":"api_reference/spark/writers/delta/utils.html","title":"Utils","text":"<p>This module provides utility functions while working with delta framework.</p>"},{"location":"api_reference/spark/writers/delta/utils.html#koheesio.spark.writers.delta.utils.log_clauses","title":"koheesio.spark.writers.delta.utils.log_clauses","text":"<pre><code>log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -&gt; Optional[str]\n</code></pre> <p>Prepare log message for clauses of DeltaMergePlan statement.</p> <p>Parameters:</p> Name Type Description Default <code>clauses</code> <code>JavaObject</code> <p>The clauses of the DeltaMergePlan statement.</p> required <code>source_alias</code> <code>str</code> <p>The source alias.</p> required <code>target_alias</code> <code>str</code> <p>The target alias.</p> required <p>Returns:</p> Type Description <code>Optional[str]</code> <p>The log message if there are clauses, otherwise None.</p> Notes <p>This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses, processes the conditions, and constructs the log message based on the clause type and columns.</p> <p>If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is None, it sets the condition_clause to \"No conditions required\".</p> <p>The log message includes the clauses type, the clause type, the columns, and the condition.</p> Source code in <code>src/koheesio/spark/writers/delta/utils.py</code> <pre><code>def log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -&gt; Optional[str]:\n    \"\"\"\n    Prepare log message for clauses of DeltaMergePlan statement.\n\n    Parameters\n    ----------\n    clauses : JavaObject\n        The clauses of the DeltaMergePlan statement.\n    source_alias : str\n        The source alias.\n    target_alias : str\n        The target alias.\n\n    Returns\n    -------\n    Optional[str]\n        The log message if there are clauses, otherwise None.\n\n    Notes\n    -----\n    This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses,\n    processes the conditions, and constructs the log message based on the clause type and columns.\n\n    If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is\n    None, it sets the condition_clause to \"No conditions required\".\n\n    The log message includes the clauses type, the clause type, the columns, and the condition.\n    \"\"\"\n    log_message = None\n\n    if not clauses.isEmpty():\n        clauses_type = clauses.last().nodeName().replace(\"DeltaMergeInto\", \"\")\n        _processed_clauses = {}\n\n        for i in range(0, clauses.length()):\n            clause = clauses.apply(i)\n            condition = clause.condition()\n\n            if \"value\" in dir(condition):\n                condition_clause = (\n                    condition.value()\n                    .toString()\n                    .replace(f\"'{source_alias}\", source_alias)\n                    .replace(f\"'{target_alias}\", target_alias)\n                )\n            elif condition.toString() == \"None\":\n                condition_clause = \"No conditions required\"\n\n            clause_type: str = clause.clauseType().capitalize()\n            columns = \"ALL\" if clause_type == \"Delete\" else clause.actions().toList().apply(0).toString()\n\n            if clause_type.lower() not in _processed_clauses:\n                _processed_clauses[clause_type.lower()] = []\n\n            log_message = (\n                f\"{clauses_type} will perform action:{clause_type} columns ({columns}) if `{condition_clause}`\"\n            )\n\n    return log_message\n</code></pre>"},{"location":"api_reference/sso/index.html","title":"Sso","text":""},{"location":"api_reference/sso/okta.html","title":"Okta","text":"<p>This module contains Okta integration steps.</p>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter","title":"koheesio.sso.okta.LoggerOktaTokenFilter","text":"<pre><code>LoggerOktaTokenFilter(okta_object: OktaAccessToken, name: str = 'OktaToken')\n</code></pre> <p>Filter which hides token value from log.</p> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def __init__(self, okta_object: OktaAccessToken, name: str = \"OktaToken\"):\n    self.__okta_object = okta_object\n    super().__init__(name=name)\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter.filter","title":"filter","text":"<pre><code>filter(record)\n</code></pre> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def filter(self, record):\n    # noinspection PyUnresolvedReferences\n    if token := self.__okta_object.output.token:\n        token_value = token.get_secret_value()\n        record.msg = record.msg.replace(token_value, \"&lt;SECRET_TOKEN&gt;\")\n\n    return True\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta","title":"koheesio.sso.okta.Okta","text":"<p>Base Okta class</p>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_id","title":"client_id  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_id: str = Field(default=..., alias='okta_id', description='Okta account ID')\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_secret","title":"client_secret  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>client_secret: SecretStr = Field(default=..., alias='okta_secret', description='Okta account secret', repr=False)\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.data","title":"data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data: Optional[Union[Dict[str, str], str]] = Field(default={'grant_type': 'client_credentials'}, description='Data to be sent along with the token request')\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken","title":"koheesio.sso.okta.OktaAccessToken","text":"<pre><code>OktaAccessToken(**kwargs)\n</code></pre> <p>Get Okta authorization token</p> <p>Example: <pre><code>token = (\n    OktaAccessToken(\n        url=\"https://org.okta.com\",\n        client_id=\"client\",\n        client_secret=SecretStr(\"secret\"),\n        params={\n            \"p1\": \"foo\",\n            \"p2\": \"bar\",\n        },\n    )\n    .execute()\n    .token\n)\n</code></pre></p> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def __init__(self, **kwargs):\n    _logger = LoggingFactory.get_logger(name=self.__class__.__name__, inherit_from_koheesio=True)\n    logger_filter = LoggerOktaTokenFilter(okta_object=self)\n    _logger.addFilter(logger_filter)\n    super().__init__(**kwargs)\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output","title":"Output","text":"<p>Output class for OktaAccessToken.</p>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output.token","title":"token  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>token: Optional[SecretStr] = Field(default=None, description='Okta authentication token')\n</code></pre>"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Execute an HTTP Post call to Okta service and retrieve the access token.</p> Source code in <code>src/koheesio/sso/okta.py</code> <pre><code>def execute(self):\n    \"\"\"\n    Execute an HTTP Post call to Okta service and retrieve the access token.\n    \"\"\"\n    HttpPostStep.execute(self)\n\n    # noinspection PyUnresolvedReferences\n    status_code = self.output.status_code\n    # noinspection PyUnresolvedReferences\n    raw_payload = self.output.raw_payload\n\n    if status_code != 200:\n        raise HTTPError(f\"Request failed with '{status_code}' code. Payload: {raw_payload}\")\n\n    # noinspection PyUnresolvedReferences\n    json_payload = self.output.json_payload\n\n    if token := json_payload.get(\"access_token\"):\n        self.output.token = SecretStr(token)\n    else:\n        raise ValueError(f\"No 'access_token' found in the Okta response: {json_payload}\")\n</code></pre>"},{"location":"api_reference/steps/index.html","title":"Steps","text":"<p>Steps Module</p> <p>This module contains the definition of the <code>Step</code> class, which serves as the base class for custom units of logic that can be executed. It also includes the <code>StepOutput</code> class, which defines the output data model for a <code>Step</code>.</p> <p>The <code>Step</code> class is designed to be subclassed for creating new steps in a data pipeline. Each subclass should implement the <code>execute</code> method, specifying the expected inputs and outputs.</p> <p>This module also exports the <code>SparkStep</code> class for steps that interact with Spark</p> Classes: <ul> <li>Step: Base class for a custom unit of logic that can be executed.</li> <li>StepOutput: Defines the output data model for a <code>Step</code>.</li> </ul>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step","title":"koheesio.steps.Step","text":"<p>Base class for a step</p> <p>A custom unit of logic that can be executed.</p> <p>The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the <code>def execute(self)</code> method, specifying the expected inputs and outputs.</p> <p>Note: since the Step class is meta classed, the execute method is wrapped with the <code>do_execute</code> function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.</p> Methods and Attributes <p>The Step class has several attributes and methods.</p> Background <p>A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.</p> <p>A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!</p> <p>The diagram serves to illustrate the concept of a Step:</p> <pre><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                   \u2502                  \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre> <p>Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.</p> <ul> <li>Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to   automatically validate data against the defined fields and their types.</li> <li>Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the <code>execute</code> method of the Step   class with the <code>_execute_wrapper</code> function. This ensures that the <code>execute</code> method always returns the output of   the Step along with providing logging and validation of the output.</li> <li>Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output   of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be   accessed through the <code>Step.Output</code> attribute.</li> <li>The <code>Output</code> class can be extended to add additional fields to the output of the Step.</li> </ul> <p>Examples:</p> <pre><code>class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -&gt; MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--input","title":"INPUT","text":"<p>The following fields are available by default on the Step class: - <code>name</code>: Name of the Step. If not set, the name of the class will be used. - <code>description</code>: Description of the Step. If not set, the docstring of the class will be used. If the docstring     contains multiple lines, only the first line will be used.</p> <p>When subclassing a Step, any additional pydantic field will be treated as <code>input</code> to the Step. See also the explanation on the <code>.execute()</code> method below.</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--output","title":"OUTPUT","text":"<p>Every Step has an <code>Output</code> class, which is a subclass of <code>StepOutput</code>. This class is used to validate the output of the Step. The <code>Output</code> class is defined as an inner class of the Step class. The <code>Output</code> class can be accessed through the <code>Step.Output</code> attribute. The <code>Output</code> class can be extended to add additional fields to the output of the Step. See also the explanation on the <code>.execute()</code>.</p> <ul> <li><code>Output</code>: A nested class representing the output of the Step used to validate the output of the     Step and based on the StepOutput class.</li> <li><code>output</code>: Allows you to interact with the Output of the Step lazily (see above and StepOutput)</li> </ul> <p>When subclassing a Step, any additional pydantic field added to the nested <code>Output</code> class will be treated as <code>output</code> of the Step. See also the description of <code>StepOutput</code> for more information.</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--methods","title":"Methods:","text":"<ul> <li><code>execute</code>: Abstract method to implement for new steps.<ul> <li>The Inputs of the step can be accessed, using <code>self.input_name</code>.</li> <li>The output of the step can be accessed, using <code>self.output.output_name</code>.</li> </ul> </li> <li><code>run</code>: Alias to .execute() method. You can use this to run the step, but execute is preferred.</li> <li><code>to_yaml</code>: YAML dump the step</li> <li><code>get_description</code>: Get the description of the Step</li> </ul> <p>When subclassing a Step, <code>execute</code> is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.</p> <p>Note: since the Step class is meta-classed, the execute method is automatically wrapped with the <code>do_execute</code> function making it always return a StepOutput. See also the explanation on the <code>do_execute</code> function.</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--class-methods","title":"class methods:","text":"<ul> <li><code>from_step</code>: Returns a new Step instance based on the data of another Step instance.     for example: <code>MyStep.from_step(other_step, a=\"foo\")</code></li> <li><code>get_description</code>: Get the description of the Step</li> </ul>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--dunder-methods","title":"dunder methods:","text":"<ul> <li><code>__getattr__</code>: Allows input to be accessed through <code>self.input_name</code></li> <li><code>__repr__</code> and <code>__str__</code>: String representation of a step</li> </ul>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.output","title":"output  <code>property</code> <code>writable</code>","text":"<pre><code>output: Output\n</code></pre> <p>Interact with the output of the Step</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.Output","title":"Output","text":"<p>Output class for Step</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.execute","title":"execute  <code>abstractmethod</code>","text":"<pre><code>execute()\n</code></pre> <p>Abstract method to implement for new steps.</p> <p>The Inputs of the step can be accessed, using <code>self.input_name</code></p> <p>Note: since the Step class is meta-classed, the execute method is wrapped with the <code>do_execute</code> function making   it always return the Steps output</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@abstractmethod\ndef execute(self):\n    \"\"\"Abstract method to implement for new steps.\n\n    The Inputs of the step can be accessed, using `self.input_name`\n\n    Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n      it always return the Steps output\n    \"\"\"\n    raise NotImplementedError\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.from_step","title":"from_step  <code>classmethod</code>","text":"<pre><code>from_step(step: Step, **kwargs)\n</code></pre> <p>Returns a new Step instance based on the data of another Step or BaseModel instance</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>@classmethod\ndef from_step(cls, step: Step, **kwargs):\n    \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n    return cls.from_basemodel(step, **kwargs)\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_json","title":"repr_json","text":"<pre><code>repr_json(simple=False) -&gt; str\n</code></pre> <p>dump the step to json, meant for representation</p> <p>Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid json</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_json(self, simple=False) -&gt; str:\n    \"\"\"dump the step to json, meant for representation\n\n    Note: use to_json if you want to dump the step to json for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_json())\n    {\"input\": {\"a\": \"foo\"}}\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid json\n    \"\"\"\n    model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n    _result = {}\n\n    # extract input\n    _input = self.model_dump(**model_dump_options)\n\n    # remove name and description from input and add to result if simple is not set\n    name = _input.pop(\"name\", None)\n    description = _input.pop(\"description\", None)\n    if not simple:\n        if name:\n            _result[\"name\"] = name\n        if description:\n            _result[\"description\"] = description\n    else:\n        model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n    # extract output\n    _output = self.output.model_dump(**model_dump_options)\n\n    # add output to result\n    if _output:\n        _result[\"output\"] = _output\n\n    # add input to result\n    _result[\"input\"] = _input\n\n    class MyEncoder(json.JSONEncoder):\n        \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n        def default(self, o: Any) -&gt; Any:\n            try:\n                return super().default(o)\n            except TypeError:\n                return o.__class__.__name__\n\n    # Use MyEncoder when converting the dictionary to a JSON string\n    json_str = json.dumps(_result, cls=MyEncoder)\n\n    return json_str\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_yaml","title":"repr_yaml","text":"<pre><code>repr_yaml(simple=False) -&gt; str\n</code></pre> <p>dump the step to yaml, meant for representation</p> <p>Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!</p> <p>Examples:</p> <pre><code>&gt;&gt;&gt; step = MyStep(a=\"foo\")\n&gt;&gt;&gt; print(step.repr_yaml())\ninput:\n  a: foo\n</code></pre> <p>Parameters:</p> Name Type Description Default <code>simple</code> <p>When toggled to True, a briefer output will be produced. This is friendlier for logging purposes</p> <code>False</code> <p>Returns:</p> Type Description <code>str</code> <p>A string, which is valid yaml</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def repr_yaml(self, simple=False) -&gt; str:\n    \"\"\"dump the step to yaml, meant for representation\n\n    Note: use to_yaml if you want to dump the step to yaml for serialization\n    This method is meant for representation purposes only!\n\n    Examples\n    --------\n    ```python\n    &gt;&gt;&gt; step = MyStep(a=\"foo\")\n    &gt;&gt;&gt; print(step.repr_yaml())\n    input:\n      a: foo\n    ```\n\n    Parameters\n    ----------\n    simple: bool\n        When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n    Returns\n    -------\n    str\n        A string, which is valid yaml\n    \"\"\"\n    json_str = self.repr_json(simple=simple)\n\n    # Parse the JSON string back into a dictionary\n    _result = json.loads(json_str)\n\n    return yaml.dump(_result)\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.run","title":"run","text":"<pre><code>run()\n</code></pre> <p>Alias to .execute()</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def run(self):\n    \"\"\"Alias to .execute()\"\"\"\n    return self.execute()\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepMetaClass","title":"koheesio.steps.StepMetaClass","text":"<p>StepMetaClass has to be set up as a Metaclass extending ModelMetaclass to allow Pydantic to be unaffected while allowing for the execute method to be auto-decorated with do_execute</p>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput","title":"koheesio.steps.StepOutput","text":"<p>Class for the StepOutput model</p> Usage <p>Setting up the StepOutputs class is done like this: <pre><code>class YourOwnOutput(StepOutput):\n    a: str\n    b: int\n</code></pre></p>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(validate_default=False, defer_build=True)\n</code></pre>"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.validate_output","title":"validate_output","text":"<pre><code>validate_output() -&gt; StepOutput\n</code></pre> <p>Validate the output of the Step</p> <p>Essentially, this method is a wrapper around the validate method of the BaseModel class</p> Source code in <code>src/koheesio/steps/__init__.py</code> <pre><code>def validate_output(self) -&gt; StepOutput:\n    \"\"\"Validate the output of the Step\n\n    Essentially, this method is a wrapper around the validate method of the BaseModel class\n    \"\"\"\n    validated_model = self.validate()\n    return StepOutput.from_basemodel(validated_model)\n</code></pre>"},{"location":"api_reference/steps/dummy.html","title":"Dummy","text":"<p>Dummy step for testing purposes.</p> <p>This module contains a dummy step for testing purposes. It is used to test the Koheesio framework or to provide a simple example of how to create a new step.</p> Example <p><pre><code>s = DummyStep(a=\"a\", b=2)\ns.execute()\n</code></pre> In this case, <code>s.output</code> will be equivalent to the following dictionary: <pre><code>{\"a\": \"a\", \"b\": 2, \"c\": \"aa\"}\n</code></pre></p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput","title":"koheesio.steps.dummy.DummyOutput","text":"<p>Dummy output for testing purposes.</p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.a","title":"a  <code>instance-attribute</code>","text":"<pre><code>a: str\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.b","title":"b  <code>instance-attribute</code>","text":"<pre><code>b: int\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep","title":"koheesio.steps.dummy.DummyStep","text":"<p>Dummy step for testing purposes.</p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.a","title":"a  <code>instance-attribute</code>","text":"<pre><code>a: str\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.b","title":"b  <code>instance-attribute</code>","text":"<pre><code>b: int\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output","title":"Output","text":"<p>Dummy output for testing purposes.</p>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output.c","title":"c  <code>instance-attribute</code>","text":"<pre><code>c: str\n</code></pre>"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.execute","title":"execute","text":"<pre><code>execute()\n</code></pre> <p>Dummy execute for testing purposes.</p> Source code in <code>src/koheesio/steps/dummy.py</code> <pre><code>def execute(self):\n    \"\"\"Dummy execute for testing purposes.\"\"\"\n    self.output.a = self.a\n    self.output.b = self.b\n    self.output.c = self.a * self.b\n</code></pre>"},{"location":"api_reference/steps/http.html","title":"Http","text":"<p>This module contains a few simple HTTP Steps that can be used to perform API Calls to HTTP endpoints</p> Example <pre><code>from koheesio.steps.http import HttpGetStep\n\nresponse = HttpGetStep(url=\"https://google.com\").execute().json_payload\n</code></pre> <p>In the above example, the <code>response</code> variable will contain the JSON response from the HTTP request.</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep","title":"koheesio.steps.http.HttpDeleteStep","text":"<p>send DELETE requests</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = DELETE\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep","title":"koheesio.steps.http.HttpGetStep","text":"<p>send GET requests</p> Example <p><pre><code>response = HttpGetStep(url=\"https://google.com\").execute().json_payload\n</code></pre> In the above example, the <code>response</code> variable will contain the JSON response from the HTTP request.</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = GET\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod","title":"koheesio.steps.http.HttpMethod","text":"<p>Enumeration of allowed http methods</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.DELETE","title":"DELETE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DELETE = 'delete'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.GET","title":"GET  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>GET = 'get'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.POST","title":"POST  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>POST = 'post'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.PUT","title":"PUT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PUT = 'put'\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.from_string","title":"from_string  <code>classmethod</code>","text":"<pre><code>from_string(value: str)\n</code></pre> <p>Allows for getting the right Method Enum by simply passing a string value This method is not case-sensitive</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@classmethod\ndef from_string(cls, value: str):\n    \"\"\"Allows for getting the right Method Enum by simply passing a string value\n    This method is not case-sensitive\n    \"\"\"\n    return getattr(cls, value.upper())\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep","title":"koheesio.steps.http.HttpPostStep","text":"<p>send POST requests</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = POST\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep","title":"koheesio.steps.http.HttpPutStep","text":"<p>send PUT requests</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: HttpMethod = PUT\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep","title":"koheesio.steps.http.HttpStep","text":"<p>Can be used to perform API Calls to HTTP endpoints</p> Understanding Retries <p>This class includes a built-in retry mechanism for handling temporary issues, such as network errors or server downtime, that might cause the HTTP request to fail. The retry mechanism is controlled by three parameters: <code>max_retries</code>, <code>initial_delay</code>, and <code>backoff</code>.</p> <ul> <li> <p><code>max_retries</code> determines the number of retries after the initial request. For example, if <code>max_retries</code> is set to     4, the request will be attempted a total of 5 times (1 initial attempt + 4 retries). If <code>max_retries</code> is set to     0, no retries will be attempted, and the request will be tried only once.</p> </li> <li> <p><code>initial_delay</code> sets the waiting period before the first retry. If <code>initial_delay</code> is set to 3, the delay before     the first retry will be 3 seconds. Changing the <code>initial_delay</code> value directly affects the amount of delay     before each retry.</p> </li> <li> <p><code>backoff</code> controls the rate at which the delay increases for each subsequent retry. If <code>backoff</code> is set to 2 (the     default), the delay will double with each retry. If <code>backoff</code> is set to 1, the delay between retries will     remain constant. Changing the <code>backoff</code> value affects how quickly the delay increases.</p> </li> </ul> <p>Given the default values of <code>max_retries=3</code>, <code>initial_delay=2</code>, and <code>backoff=2</code>, the delays between retries would be 2 seconds, 4 seconds, and 8 seconds, respectively. This results in a total delay of 14 seconds before all retries are exhausted.</p> <p>For example, if you set <code>initial_delay=3</code> and <code>backoff=2</code>, the delays before the retries would be <code>3 seconds</code>, <code>6 seconds</code>, and <code>12 seconds</code>. If you set <code>initial_delay=2</code> and <code>backoff=3</code>, the delays before the retries would be <code>2 seconds</code>, <code>6 seconds</code>, and <code>18 seconds</code>. If you set <code>initial_delay=2</code> and <code>backoff=1</code>, the delays before the retries would be <code>2 seconds</code>, <code>2 seconds</code>, and <code>2 seconds</code>.</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.data","title":"data  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>data: Optional[Union[Dict[str, str], str]] = Field(default_factory=dict, description='[Optional] Data to be sent along with the request', alias='body')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.headers","title":"headers  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Optional[Dict[str, Union[str, SecretStr]]] = Field(default_factory=dict, description='Request headers', alias='header')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.method","title":"method  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to HTTP request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.session","title":"session  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>session: Session = Field(default_factory=Session, description='Requests session object to be used for making HTTP requests', exclude=True, repr=False)\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.timeout","title":"timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timeout: Optional[int] = Field(default=3, description='[Optional] Request timeout')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: str = Field(default=..., description='API endpoint URL', alias='uri')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output","title":"Output","text":"<p>Output class for HttpStep</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.json_payload","title":"json_payload  <code>property</code>","text":"<pre><code>json_payload\n</code></pre> <p>Alias for response_json</p>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.raw_payload","title":"raw_payload  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>raw_payload: Optional[str] = Field(default=None, alias='response_text', description='The raw response for the request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_json","title":"response_json  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>response_json: Optional[Union[Dict, List]] = Field(default=None, alias='json_payload', description='The JSON response for the request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_raw","title":"response_raw  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>response_raw: Optional[Response] = Field(default=None, alias='response', description='The raw requests.Response object returned by the appropriate requests.request() call')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.status_code","title":"status_code  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>status_code: Optional[int] = Field(default=None, description='The status return code of the request')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.decode_sensitive_headers","title":"decode_sensitive_headers","text":"<pre><code>decode_sensitive_headers(headers)\n</code></pre> <p>Authorization headers are being converted into SecretStr under the hood to avoid dumping any sensitive content into logs by the <code>encode_sensitive_headers</code> method.</p> <p>However, when calling the <code>get_headers</code> method, the SecretStr should be converted back to string, otherwise sensitive info would have looked like '**********'.</p> <p>This method decodes values of the <code>headers</code> dictionary that are of type SecretStr into plain text.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@field_serializer(\"headers\", when_used=\"json\")\ndef decode_sensitive_headers(self, headers):\n    \"\"\"\n    Authorization headers are being converted into SecretStr under the hood to avoid dumping any\n    sensitive content into logs by the `encode_sensitive_headers` method.\n\n    However, when calling the `get_headers` method, the SecretStr should be converted back to\n    string, otherwise sensitive info would have looked like '**********'.\n\n    This method decodes values of the `headers` dictionary that are of type SecretStr into plain text.\n    \"\"\"\n    for k, v in headers.items():\n        headers[k] = v.get_secret_value() if isinstance(v, SecretStr) else v\n    return headers\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.delete","title":"delete","text":"<pre><code>delete() -&gt; Response\n</code></pre> <p>Execute an HTTP DELETE call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def delete(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP DELETE call\"\"\"\n    self.method = HttpMethod.DELETE\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.encode_sensitive_headers","title":"encode_sensitive_headers","text":"<pre><code>encode_sensitive_headers(headers)\n</code></pre> <p>Encode potentially sensitive data into pydantic.SecretStr class to prevent them being displayed as plain text in logs.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@field_validator(\"headers\", mode=\"before\")\ndef encode_sensitive_headers(cls, headers):\n    \"\"\"\n    Encode potentially sensitive data into pydantic.SecretStr class to prevent them\n    being displayed as plain text in logs.\n    \"\"\"\n    if auth := headers.get(\"Authorization\"):\n        headers[\"Authorization\"] = auth if isinstance(auth, SecretStr) else SecretStr(auth)\n    return headers\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Executes the HTTP request.</p> <p>This method simply calls <code>self.request()</code>, which includes the retry logic. If <code>self.request()</code> raises an exception, it will be propagated to the caller of this method.</p> <p>Raises:</p> Type Description <code>(RequestException, HTTPError)</code> <p>The last exception that was caught if <code>self.request()</code> fails after <code>self.max_retries</code> attempts.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def execute(self) -&gt; Output:\n    \"\"\"\n    Executes the HTTP request.\n\n    This method simply calls `self.request()`, which includes the retry logic. If `self.request()` raises an\n    exception, it will be propagated to the caller of this method.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `self.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get","title":"get","text":"<pre><code>get() -&gt; Response\n</code></pre> <p>Execute an HTTP GET call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP GET call\"\"\"\n    self.method = HttpMethod.GET\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_headers","title":"get_headers","text":"<pre><code>get_headers()\n</code></pre> <p>Dump headers into JSON without SecretStr masking.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get_headers(self):\n    \"\"\"\n    Dump headers into JSON without SecretStr masking.\n    \"\"\"\n    return json.loads(self.model_dump_json()).get(\"headers\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>options to be passed to requests.request()</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get_options(self):\n    \"\"\"options to be passed to requests.request()\"\"\"\n    return {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self.params,  # type: ignore\n    }\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_proper_http_method_from_str_value","title":"get_proper_http_method_from_str_value","text":"<pre><code>get_proper_http_method_from_str_value(method_value)\n</code></pre> <p>Converts string value to HttpMethod enum value</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>@field_validator(\"method\")\ndef get_proper_http_method_from_str_value(cls, method_value):\n    \"\"\"Converts string value to HttpMethod enum value\"\"\"\n    if isinstance(method_value, str):\n        try:\n            method_value = HttpMethod.from_string(method_value)\n        except AttributeError as e:\n            raise AttributeError(\n                \"Only values from HttpMethod class are allowed! \"\n                f\"Provided value: '{method_value}', allowed values: {', '.join(HttpMethod.__members__.keys())}\"\n            ) from e\n\n    return method_value\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.post","title":"post","text":"<pre><code>post() -&gt; Response\n</code></pre> <p>Execute an HTTP POST call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def post(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP POST call\"\"\"\n    self.method = HttpMethod.POST\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.put","title":"put","text":"<pre><code>put() -&gt; Response\n</code></pre> <p>Execute an HTTP PUT call</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def put(self) -&gt; requests.Response:\n    \"\"\"Execute an HTTP PUT call\"\"\"\n    self.method = HttpMethod.PUT\n    return self.request()\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.request","title":"request","text":"<pre><code>request(method: Optional[HttpMethod] = None) -&gt; Response\n</code></pre> <p>Executes the HTTP request with retry logic.</p> <p>Actual http_method execution is abstracted into this method. This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.</p> <p>This method will try to execute <code>requests.request</code> up to <code>self.max_retries</code> times. If <code>self.request()</code> raises an exception, it logs a warning message and the error message, then waits for <code>self.initial_delay * (self.backoff ** i)</code> seconds before retrying. The delay increases exponentially after each failed attempt due to the <code>self.backoff ** i</code> term.</p> <p>If <code>self.request()</code> still fails after <code>self.max_retries</code> attempts, it logs an error message and re-raises the last exception that was caught.</p> <p>This is a good way to handle temporary issues that might cause <code>self.request()</code> to fail, such as network errors or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with requests if it's struggling to respond.</p> <p>Parameters:</p> Name Type Description Default <code>method</code> <code>HttpMethod</code> <p>Optional parameter that allows calls to different HTTP methods and bypassing class level <code>method</code> parameter.</p> <code>None</code> <p>Raises:</p> Type Description <code>(RequestException, HTTPError)</code> <p>The last exception that was caught if <code>requests.request()</code> fails after <code>self.max_retries</code> attempts.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def request(self, method: Optional[HttpMethod] = None) -&gt; requests.Response:\n    \"\"\"\n    Executes the HTTP request with retry logic.\n\n    Actual http_method execution is abstracted into this method.\n    This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.\n\n    This method will try to execute `requests.request` up to `self.max_retries` times. If `self.request()` raises\n    an exception, it logs a warning message and the error message, then waits for\n    `self.initial_delay * (self.backoff ** i)` seconds before retrying. The delay increases exponentially\n    after each failed attempt due to the `self.backoff ** i` term.\n\n    If `self.request()` still fails after `self.max_retries` attempts, it logs an error message and re-raises the\n    last exception that was caught.\n\n    This is a good way to handle temporary issues that might cause `self.request()` to fail, such as network errors\n    or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with\n    requests if it's struggling to respond.\n\n    Parameters\n    ----------\n    method : HttpMethod\n        Optional parameter that allows calls to different HTTP methods and bypassing class level `method`\n        parameter.\n\n    Raises\n    ------\n    requests.RequestException, requests.HTTPError\n        The last exception that was caught if `requests.request()` fails after `self.max_retries` attempts.\n    \"\"\"\n    _method = (method or self.method).value.upper()\n    options = self.get_options()\n\n    self.log.debug(f\"Making {_method} request to {options['url']} with headers {options['headers']}\")\n\n    response = self.session.request(method=_method, **options)\n    response.raise_for_status()\n\n    self.log.debug(f\"Received response with status code {response.status_code} and body {response.text}\")\n    self.set_outputs(response)\n\n    return response\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.set_outputs","title":"set_outputs","text":"<pre><code>set_outputs(response)\n</code></pre> <p>Types of response output</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def set_outputs(self, response):\n    \"\"\"\n    Types of response output\n    \"\"\"\n    self.output.response_raw = response\n    self.output.raw_payload = response.text\n    self.output.status_code = response.status_code\n\n    # Only decode non empty payloads to avoid triggering decoding error unnecessarily.\n    if self.output.raw_payload:\n        try:\n            self.output.response_json = response.json()\n\n        except json.decoder.JSONDecodeError as e:\n            self.log.info(f\"An error occurred while processing the JSON payload. Error message:\\n{e.msg}\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep","title":"koheesio.steps.http.PaginatedHtppGetStep","text":"<p>Represents a paginated HTTP GET step.</p> <p>Parameters:</p> Name Type Description Default <code>paginate</code> <code>bool</code> <p>Whether to paginate the API response. Defaults to False.</p> required <code>pages</code> <code>int</code> <p>Number of pages to paginate. Defaults to 1.</p> required <code>offset</code> <code>int</code> <p>Offset for paginated API calls. Offset determines the starting page. Defaults to 1.</p> required <code>limit</code> <code>int</code> <p>Limit for paginated API calls. Defaults to 100.</p> required"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.limit","title":"limit  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>limit: Optional[int] = Field(default=100, description='Limit for paginated API calls. The url should (optionally) contain a named limit parameter, for example: api.example.com/data?limit={limit}')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.offset","title":"offset  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>offset: Optional[int] = Field(default=1, description=\"Offset for paginated API calls. Offset determines the starting page. Defaults to 1. The url can (optionally) contain a named 'offset' parameter, for example: api.example.com/data?offset={offset}\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.pages","title":"pages  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pages: Optional[int] = Field(default=1, description='Number of pages to paginate. Defaults to 1')\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.paginate","title":"paginate  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>paginate: Optional[bool] = Field(default=False, description=\"Whether to paginate the API response. Defaults to False. When set to True, the API response will be paginated. The url should contain a named 'page' parameter for example: api.example.com/data?page={page}\")\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.execute","title":"execute","text":"<pre><code>execute() -&gt; Output\n</code></pre> <p>Executes the HTTP GET request and handles pagination.</p> <p>Returns:</p> Type Description <code>Output</code> <p>The output of the HTTP GET request.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def execute(self) -&gt; HttpGetStep.Output:\n    \"\"\"\n    Executes the HTTP GET request and handles pagination.\n\n    Returns\n    -------\n    HttpGetStep.Output\n        The output of the HTTP GET request.\n    \"\"\"\n    # Set up pagination parameters\n    offset, pages = (self.offset, self.pages + 1) if self.paginate else (1, 1)  # type: ignore\n    data = []\n    _basic_url = self.url\n\n    for page in range(offset, pages):\n        if self.paginate:\n            self.log.info(f\"Fetching page {page} of {pages - 1}\")\n\n        self.url = self._url(basic_url=_basic_url, page=page)\n        self.request()\n\n        if isinstance(self.output.response_json, list):\n            data += self.output.response_json\n        else:\n            data.append(self.output.response_json)\n\n    self.url = _basic_url\n    self.output.response_json = data\n    self.output.response_raw = None\n    self.output.raw_payload = None\n    self.output.status_code = None\n</code></pre>"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.get_options","title":"get_options","text":"<pre><code>get_options()\n</code></pre> <p>Returns the options to be passed to the requests.request() function.</p> <p>Returns:</p> Type Description <code>dict</code> <p>The options.</p> Source code in <code>src/koheesio/steps/http.py</code> <pre><code>def get_options(self):\n    \"\"\"\n    Returns the options to be passed to the requests.request() function.\n\n    Returns\n    -------\n    dict\n        The options.\n    \"\"\"\n    options = {\n        \"url\": self.url,\n        \"headers\": self.get_headers(),\n        \"data\": self.data,\n        \"timeout\": self.timeout,\n        **self._adjust_params(),  # type: ignore\n    }\n\n    return options\n</code></pre>"},{"location":"community/approach-documentation.html","title":"Approach documentation","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#scope","title":"Scope","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#the-system","title":"The System","text":"<p>We will be adopting \"The Documentation System\".</p> <p>From documentation.divio.com:</p> <p>There is a secret that needs to be understood in order to write good software documentation: there isn\u2019t one thing called documentation, there are four.</p> <p>They are: tutorials, how-to guides, technical reference and explanation. They represent four different purposes or functions, and require four different approaches to their creation. Understanding the implications of this will help improve most documentation - often immensely.</p> <p>About the system  The documentation system outlined here is a simple, comprehensive and nearly universally-applicable scheme. It is proven in practice across a wide variety of fields and applications.</p> <p>There are some very simple principles that govern documentation that are very rarely if ever spelled out. They seem to be a secret, though they shouldn\u2019t be.</p> <p>If you can put these principles into practice, it will make your documentation better and your project, product or team more successful - that\u2019s a promise.</p> <p>The system is widely adopted for large and small, open and proprietary documentation projects.</p> <p>Video Presentation on YouTube:</p> <p></p>","tags":["doctype/explanation"]},{"location":"community/contribute.html","title":"Contribute","text":""},{"location":"community/contribute.html#how-to-contribute","title":"How to contribute","text":"<p>There are a few guidelines that we need contributors to follow so that we are able to process requests as efficiently as possible. If you have any questions or concerns please feel free to contact us at opensource@nike.com.</p>"},{"location":"community/contribute.html#getting-started","title":"Getting Started","text":"<ul> <li>Review our Code of Conduct</li> <li>Make sure you have a GitHub account</li> <li>Submit a ticket for your issue, assuming one does not already exist.<ul> <li>Clearly describe the issue including steps to reproduce when it is a bug.</li> <li>Make sure you fill in the earliest version that you know has the issue.</li> </ul> </li> <li>Fork the repository on GitHub</li> </ul>"},{"location":"community/contribute.html#making-changes","title":"Making Changes","text":"<ul> <li>Create a feature branch off of <code>main</code> before you start your work.<ul> <li>Please avoid working directly on the <code>main</code> branch.</li> </ul> </li> <li>Setup the required package manager hatch</li> <li>Setup the dev environment see below</li> <li>Make commits of logical units.<ul> <li>You may be asked to squash unnecessary commits down to logical units.</li> </ul> </li> <li>Check for unnecessary whitespace with <code>git diff --check</code> before committing.</li> <li>Write meaningful, descriptive commit messages.</li> <li>Please follow existing code conventions when working on a file</li> <li>Make sure to check the standards on the code, see below</li> <li>Make sure to test the code before you push changes see below</li> </ul>"},{"location":"community/contribute.html#submitting-changes","title":"\ud83e\udd1d Submitting Changes","text":"<ul> <li>Push your changes to a topic branch in your fork of the repository.</li> <li>Submit a pull request to the repository in the Nike-Inc organization.</li> <li>After feedback has been given we expect responses within two weeks. After two weeks we may close the pull request  if it isn't showing any activity.</li> <li>Bug fixes or features that lack appropriate tests may not be considered for merge.</li> <li>Changes that lower test coverage may not be considered for merge.</li> </ul>"},{"location":"community/contribute.html#make-commands","title":"\ud83d\udd28 Make commands","text":"<p>We use <code>make</code> for managing different steps of setup and maintenance in the project. You can install make by following the instructions here</p> <p>For a full list of available make commands, you can run:</p> <pre><code>make help\n</code></pre>"},{"location":"community/contribute.html#package-manager","title":"\ud83d\udce6 Package manager","text":"<p>We use <code>hatch</code> as our package manager.</p> <p>Note: Please DO NOT use pip or conda to install the dependencies. Instead, use hatch.</p> <p>To install hatch, run the following command: <pre><code>make init\n</code></pre></p> <p>or, <pre><code>make hatch-install\n</code></pre></p> <p>This will install hatch using brew if you are on a Mac. </p> <p>If you are on a different OS, you can follow the instructions here</p>"},{"location":"community/contribute.html#dev-environment-setup","title":"\ud83d\udccc Dev Environment Setup","text":"<p>To ensure our standards, make sure to install the required packages.</p> <pre><code>make dev\n</code></pre> <p>This will install all the required packages for development in the project under the <code>.venv</code> directory. Use this virtual environment to run the code and tests during local development.</p>"},{"location":"community/contribute.html#linting-and-standards","title":"\ud83e\uddf9 Linting and Standards","text":"<p>We use <code>ruff</code>, <code>pylint</code>, <code>isort</code>, <code>black</code> and <code>mypy</code> to maintain standards in the codebase.</p> <p>Run the following two commands to check the codebase for any issues:</p> <p><pre><code>make check\n</code></pre> This will run all the checks including pylint and mypy.</p> <p><pre><code>make fmt\n</code></pre> This will format the codebase using black, isort, and ruff.</p> <p>Make sure that the linters and formatters do not report any errors or warnings before submitting a pull request.</p>"},{"location":"community/contribute.html#testing","title":"\ud83e\uddea Testing","text":"<p>We use <code>pytest</code> to test our code. </p> <p>You can run the tests by running one of the following commands:</p> <pre><code>make cov  # to run the tests and check the coverage\nmake all-tests  # to run all the tests\nmake spark-tests  # to run the spark tests\nmake non-spark-tests  # to run the non-spark tests\n</code></pre> <p>Make sure that all tests pass and that you have adequate coverage before submitting a pull request.</p>"},{"location":"community/contribute.html#additional-resources","title":"Additional Resources","text":"<ul> <li>General GitHub documentation</li> <li>GitHub pull request documentation</li> <li>Nike's Code of Conduct</li> <li>Nike's Individual Contributor License Agreement</li> <li>Nike OSS</li> </ul>"},{"location":"includes/glossary.html","title":"Glossary","text":""},{"location":"includes/glossary.html#pydantic","title":"Pydantic","text":"<p>Pydantic is a Python library for data validation and settings management using Python type annotations. It allows Koheesio to bring in strong typing and a high level of type safety. Essentially, it allows Koheesio to consider configurations of a pipeline (i.e. the settings used inside Steps, Tasks, etc.) as data that can be validated and structured.</p>"},{"location":"includes/glossary.html#pyspark","title":"PySpark","text":"<p>PySpark is a Python library for Apache Spark, a powerful open-source data processing engine. It allows Koheesio to handle large-scale data processing tasks efficiently. </p>"},{"location":"misc/info.html","title":"Info","text":"<p>{{ macros_info() }}</p>"},{"location":"reference/concepts/concepts.html","title":"Concepts","text":"<p>The framework architecture is built from a set of core components. Each of the implementations that the framework  provides out of the box, can be swapped out for custom implementations as long as they match the API.</p> <p>The core components are the following:</p> <p>Note: click on the 'Concept' to take you to the corresponding module. The module documentation will have    greater detail on the specifics of the implementation</p>"},{"location":"reference/concepts/concepts.html#step","title":"Step","text":"<p>A custom unit of logic that can be executed. A Step is an atomic operation and serves as the building block of data  pipelines built with the framework. A step can be seen as an operation on a set of inputs, and returns a set of  outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure  below.</p> <pre><code>\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &amp;nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\nStep\n&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\n\"]\n\nI1[\"Input 1\"] ---&gt; Step\nI2[\"Input 2\"] ---&gt; Step\nI3[\"Input 3\"] ---&gt; Step\n\nStep ---&gt; O1[\"Output 1\"]\nStep ---&gt; O2[\"Output 2\"]\nStep ---&gt; O3[\"Output 3\"]\n</code></pre> <p>Step is the core abstraction of the framework. Meaning, that it is the core building block of the framework and is used to define all the operations that can be executed. </p> <p>Please see the Step documentation for more details.</p>"},{"location":"reference/concepts/concepts.html#task","title":"Task","text":"<p>The unit of work of one execution of the framework. </p> <p>An execution usually consists of an <code>Extract - Transform - Load</code> approach of one data object. Tasks typically consist of a series of Steps.</p> <p>Please see the Task documentation for more details.</p>"},{"location":"reference/concepts/concepts.html#context","title":"Context","text":"<p>The Context is used to configure the environment where a Task or Step runs.</p> <p>It is often based on configuration files and can be used to adapt behaviour of a Task or Step based on the environment it runs in.</p> <p>Please see the Context documentation for more details.</p>"},{"location":"reference/concepts/concepts.html#logger","title":"logger","text":"<p>A logger object to log messages with different levels.</p> <p>Please see the Logging documentation for more details.</p> <p>The interactions between the base concepts of the model is visible in the below diagram:  </p> <pre><code>---\ntitle: Koheesio Class Diagram\n---\nclassDiagram\n    Step .. Task\n    Step .. Transformation\n    Step .. Reader\n    Step .. Writer\n\n    class Context\n\n    class LoggingFactory\n\n    class Task{\n        &lt;&lt;abstract&gt;&gt;\n        + List~Step~ steps\n        ...\n        + execute() Output\n    }\n\n    class Step{\n        &lt;&lt;abstract&gt;&gt;\n        ...\n        Output: ...\n        + execute() Output\n    }\n\n    class Transformation{\n        &lt;&lt;abstract&gt;&gt;\n        + df: DataFrame\n        ...\n        Output:\n        + df: DataFrame\n        + transform(df: DataFrame) DataFrame\n    }\n\n    class Reader{\n        &lt;&lt;abstract&gt;&gt;\n        ...\n        Output:\n        + df: DataFrame\n        + read() DataFrame\n    }\n\n    class Writer{\n        &lt;&lt;abstract&gt;&gt;\n        + df: DataFrame\n        ...\n        + write(df: DataFrame)\n    }</code></pre>"},{"location":"reference/concepts/context.html","title":"Context in Koheesio","text":"<p>In the Koheesio framework, the <code>Context</code> class plays a pivotal role. It serves as a flexible and powerful tool for  managing configuration data and shared variables across tasks and steps in your application.</p> <p><code>Context</code> behaves much like a Python dictionary, but with additional features that enhance its usability and  flexibility. It allows you to store and retrieve values, including complex Python objects, with ease. You can access  these values using dictionary-like methods or as class attributes, providing a simple and intuitive interface.</p> <p>Moreover, <code>Context</code> supports nested keys and recursive merging of contexts, making it a versatile tool for managing  complex configurations. It also provides serialization and deserialization capabilities, allowing you to easily save  and load configurations in JSON, YAML, or TOML formats.</p> <p>Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks,  <code>Context</code> provides a robust and efficient solution. </p> <p>This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio  applications.</p>"},{"location":"reference/concepts/context.html#api-reference","title":"API Reference","text":"<p>See API Reference for a detailed description of the <code>Context</code> class and its methods.</p>"},{"location":"reference/concepts/context.html#key-features","title":"Key Features","text":"<ul> <li> <p>Accessing Values: <code>Context</code> simplifies accessing configuration values. You can access them using dictionary-like    methods or as class attributes. This allows for a more intuitive interaction with the <code>Context</code> object. For example:</p> <pre><code>context = Context({\"bronze_table\": \"catalog.schema.table_name\"})\nprint(context.bronze_table)  # Outputs: catalog.schema.table_name\n</code></pre> </li> <li> <p>Nested Keys: <code>Context</code> supports nested keys, allowing you to access and add nested keys in a straightforward way.    This is useful when dealing with complex configurations that require a hierarchical structure. For example:</p> <pre><code>context = Context({\"bronze\": {\"table\": \"catalog.schema.table_name\"}})\nprint(context.bronze.table)  # Outputs: catalog.schema.table_name\n</code></pre> </li> <li> <p>Merging Contexts: You can merge two <code>Contexts</code> together, with the incoming <code>Context</code> having priority. Recursive    merging is also supported. This is particularly useful when you want to update a <code>Context</code> with new data without    losing the existing values. For example:</p> <pre><code>context1 = Context({\"bronze_table\": \"catalog.schema.table_name\"})\ncontext2 = Context({\"silver_table\": \"catalog.schema.table_name\"})\ncontext1.merge(context2)\nprint(context1.silver_table)  # Outputs: catalog.schema.table_name\n</code></pre> </li> <li> <p>Adding Keys: You can add keys to a Context by using the <code>add</code> method. This allows you to dynamically update the    <code>Context</code> as needed. For example:</p> <pre><code>context.add(\"silver_table\", \"catalog.schema.table_name\")\n</code></pre> </li> <li> <p>Checking Key Existence: You can check if a key exists in a Context by using the <code>contains</code> method. This is useful    when you want to ensure a key is present before attempting to access its value. For example:</p> <pre><code>context.contains(\"silver_table\")  # Returns: True\n</code></pre> </li> <li> <p>Getting Key-Value Pair: You can get a key-value pair from a Context by using the <code>get_item</code> method. This can be    useful when you want to extract a specific piece of data from the <code>Context</code>. For example:</p> <pre><code>context.get_item(\"silver_table\")  # Returns: {\"silver_table\": \"catalog.schema.table_name\"}\n</code></pre> </li> <li> <p>Converting to Dictionary: You can convert a Context to a dictionary by using the <code>to_dict</code> method. This can be    useful when you need to interact with code that expects a standard Python dictionary. For example:</p> <pre><code>context_dict = context.to_dict()\n</code></pre> </li> <li> <p>Creating from Dictionary: You can create a Context from a dictionary by using the <code>from_dict</code> method. This allows    you to easily convert existing data structures into a <code>Context</code>. For example:</p> <pre><code>context = Context.from_dict({\"bronze_table\": \"catalog.schema.table_name\"})\n</code></pre> </li> </ul>"},{"location":"reference/concepts/context.html#advantages-over-a-dictionary","title":"Advantages over a Dictionary","text":"<p>While a dictionary can be used to store configuration values, <code>Context</code> provides several advantages:</p> <ul> <li> <p>Support for nested keys: Unlike a standard Python dictionary, <code>Context</code> allows you to access nested keys as if    they were attributes. This makes it easier to work with complex, hierarchical data.</p> </li> <li> <p>Recursive merging of two <code>Contexts</code>: <code>Context</code> allows you to merge two <code>Contexts</code> together, with the incoming    <code>Context</code> having priority. This is useful when you want to update a <code>Context</code> with new data without losing the    existing values.</p> </li> <li> <p>Accessing keys as if they were class attributes: This provides a more intuitive way to interact with the    <code>Context</code>, as you can use dot notation to access values.</p> </li> <li> <p>Code completion in IDEs: Because you can access keys as if they were attributes, IDEs can provide code completion    for <code>Context</code> keys. This can make your coding process more efficient and less error-prone.</p> </li> <li> <p>Easy creation from a YAML, JSON, or TOML file: <code>Context</code> provides methods to easily load data from YAML or JSON    files, making it a great tool for managing configuration data.</p> </li> </ul>"},{"location":"reference/concepts/context.html#data-formats-and-serialization","title":"Data Formats and Serialization","text":"<p><code>Context</code> leverages JSON, YAML, and TOML for serialization and deserialization. These formats are widely used in the  industry and provide a balance between readability and ease of use.</p> <ul> <li> <p>JSON: A lightweight data-interchange format that is easy for humans to read and write and easy for machines to    parse and generate. It's widely used for APIs and web-based applications.</p> </li> <li> <p>YAML: A human-friendly data serialization standard often used for configuration files. It's more readable than    JSON and supports complex data structures.</p> </li> <li> <p>TOML: A minimal configuration file format that's easy to read due to its clear and simple syntax. It's often used    for configuration files in Python applications.</p> </li> </ul>"},{"location":"reference/concepts/context.html#examples","title":"Examples","text":"<p>In this section, we provide a variety of examples to demonstrate the capabilities of the <code>Context</code> class in Koheesio.</p>"},{"location":"reference/concepts/context.html#basic-operations","title":"Basic Operations","text":"<p>Here are some basic operations you can perform with <code>Context</code>. These operations form the foundation of how you interact  with a <code>Context</code> object:</p> <pre><code># Create a Context\ncontext = Context({\"bronze_table\": \"catalog.schema.table_name\"})\n\n# Access a value\nvalue = context.bronze_table\n\n# Add a key\ncontext.add(\"silver_table\", \"catalog.schema.table_name\")\n\n# Merge two Contexts\ncontext.merge(Context({\"silver_table\": \"catalog.schema.table_name\"}))\n</code></pre>"},{"location":"reference/concepts/context.html#serialization-and-deserialization","title":"Serialization and Deserialization","text":"<p><code>Context</code> supports serialization and deserialization to and from JSON, YAML, and TOML formats. This allows you to  easily save and load <code>Context</code> data:</p> <pre><code># Load context from a JSON file\ncontext = Context.from_json(\"path/to/context.json\")\n\n# Save context to a JSON file\ncontext.to_json(\"path/to/context.json\")\n\n# Load context from a YAML file\ncontext = Context.from_yaml(\"path/to/context.yaml\")\n\n# Save context to a YAML file\ncontext.to_yaml(\"path/to/context.yaml\")\n\n# Load context from a TOML file\ncontext = Context.from_toml(\"path/to/context.toml\")\n\n# Save context to a TOML file\ncontext.to_toml(\"path/to/context.toml\")\n</code></pre>"},{"location":"reference/concepts/context.html#nested-keys","title":"Nested Keys","text":"<p><code>Context</code> supports nested keys, allowing you to create hierarchical configurations. This is useful when dealing with  complex data structures:</p> <pre><code># Create a Context with nested keys\ncontext = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Access a nested key\nprint(context.database.bronze_table)  # Outputs: catalog.schema.bronze_table\n</code></pre>"},{"location":"reference/concepts/context.html#recursive-merging","title":"Recursive Merging","text":"<p><code>Context</code> also supports recursive merging, allowing you to merge two <code>Contexts</code> together at all levels of their  hierarchy. This is particularly useful when you want to update a <code>Context</code> with new data without losing the existing  values:</p> <pre><code># Create two Contexts with nested keys\ncontext1 = Context({\n    \"database\": {\n        \"bronze_table\": \"catalog.schema.bronze_table\",\n        \"silver_table\": \"catalog.schema.silver_table\"\n    }\n})\n\ncontext2 = Context({\n    \"database\": {\n        \"silver_table\": \"catalog.schema.new_silver_table\",\n        \"gold_table\": \"catalog.schema.gold_table\"\n    }\n})\n\n# Merge the two Contexts\ncontext1.merge(context2)\n\n# Print the merged Context\nprint(context1.to_dict())  \n# Outputs: \n# {\n#     \"database\": {\n#         \"bronze_table\": \"catalog.schema.bronze_table\",\n#         \"silver_table\": \"catalog.schema.new_silver_table\",\n#         \"gold_table\": \"catalog.schema.gold_table\"\n#     }\n# }\n</code></pre>"},{"location":"reference/concepts/context.html#jsonpickle-and-complex-python-objects","title":"Jsonpickle and Complex Python Objects","text":"<p>The <code>Context</code> class in Koheesio also uses <code>jsonpickle</code> for serialization and deserialization of complex Python objects  to and from JSON. This allows you to convert complex Python objects, including custom classes, into a format that can  be easily stored and transferred.</p> <p>Here's an example of how this works:</p> <pre><code># Import necessary modules\nfrom koheesio.context import Context\n\n# Initialize SnowflakeReader and store in a Context\nsnowflake_reader = SnowflakeReader(...)  # fill in with necessary arguments\ncontext = Context({\"snowflake_reader\": snowflake_reader})\n\n# Serialize the Context to a JSON string\njson_str = context.to_json()\n\n# Print the serialized Context\nprint(json_str)\n\n# Deserialize the JSON string back into a Context\ndeserialized_context = Context.from_json(json_str)\n\n# Access the deserialized SnowflakeReader\ndeserialized_snowflake_reader = deserialized_context.snowflake_reader\n\n# Now you can use the deserialized SnowflakeReader as you would the original\n</code></pre> <p>This feature is particularly useful when you need to save the state of your application, transfer it over a network,  or store it in a database. When you're ready to use the stored data, you can easily convert it back into the original  Python objects.</p> <p>However, there are a few things to keep in mind:</p> <ol> <li> <p>The classes you're serializing must be importable (i.e., they must be in the Python path) when you're deserializing     the JSON. <code>jsonpickle</code> needs to be able to import the class to reconstruct the object. This holds true for most    Koheesio classes, as they are designed to be importable and reconstructible.</p> </li> <li> <p>Not all Python objects can be serialized. For example, objects that hold a reference to a file or a network     connection can't be serialized because their state can't be easily captured in a static file.</p> </li> <li> <p>As mentioned in the code comments, <code>jsonpickle</code> is not secure against malicious data. You should only deserialize     data that you trust.</p> </li> </ol> <p>So, while the <code>Context</code> class provides a powerful tool for handling complex Python objects, it's important to be aware  of these limitations.</p>"},{"location":"reference/concepts/context.html#conclusion","title":"Conclusion","text":"<p>In this document, we've covered the key features of the <code>Context</code> class in the Koheesio framework, including its  ability to handle complex Python objects, support for nested keys and recursive merging, and its serialization and  deserialization capabilities. </p> <p>Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks,  <code>Context</code> provides a robust and efficient solution.</p>"},{"location":"reference/concepts/context.html#further-reading","title":"Further Reading","text":"<p>For more information, you can refer to the following resources:</p> <ul> <li>Python jsonpickle Documentation</li> <li>Python JSON Documentation</li> <li>Python YAML Documentation</li> <li>Python TOML Documentation</li> </ul> <p>Refer to the API documentation for more details on the <code>Context</code> class and its methods.</p>"},{"location":"reference/concepts/logger.html","title":"Python Logger Code Instructions","text":"<p>Here you can find instructions on how to use the Koheesio Logging Factory.</p>"},{"location":"reference/concepts/logger.html#logging-factory","title":"Logging Factory","text":"<p>The <code>LoggingFactory</code> class is a factory for creating and configuring loggers. To use it, follow these steps:</p> <ol> <li> <p>Import the necessary modules:</p> <pre><code>from koheesio.logger import LoggingFactory\n</code></pre> </li> <li> <p>Initialize logging factory for koheesio modules:</p> <pre><code>factory = LoggingFactory(name=\"replace_koheesio_parent_name\", env=\"local\", logger_id=\"your_run_id\")\n# Or use default \nfactory = LoggingFactory()\n# Or just specify log level for koheesio modules\nfactory = LoggingFactory(level=\"DEBUG\")\n</code></pre> </li> <li> <p>Create a logger by calling the <code>create_logger</code> method of the <code>LoggingFactory</code> class, you can inherit from koheesio logger:</p> <p><code>python logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME)    # Or for koheesio modules logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME,inherit_from_koheesio=True)</code></p> </li> <li> <p>You can now use the <code>logger</code> object to log messages:</p> <pre><code>logger.debug(\"Debug message\")\nlogger.info(\"Info message\")\nlogger.warning(\"Warning message\")\nlogger.error(\"Error message\")\nlogger.critical(\"Critical message\")\n</code></pre> </li> <li> <p>(Optional) You can add additional handlers to the logger by calling the <code>add_handlers</code> method of the <code>LoggingFactory</code> class:</p> <pre><code>handlers = [\n    (\"your_handler_module.YourHandlerClass\", {\"level\": \"INFO\"}),\n    # Add more handlers if needed\n]\nfactory.add_handlers(handlers)\n</code></pre> </li> <li> <p>(Optional) You can create child loggers based on the parent logger by calling the <code>get_logger</code> method of the <code>LoggingFactory</code> class:</p> <pre><code>child_logger = factory.get_logger(name=\"your_child_logger_name\")\n</code></pre> </li> <li> <p>(Optional) Get an independent logger without inheritance</p> <p>If you need an independent logger without inheriting from the <code>LoggingFactory</code> logger, you can use the <code>get_logger</code> method:</p> <pre><code>your_logger = factory.get_logger(name=\"your_logger_name\", inherit=False)\n</code></pre> </li> </ol> <p>By setting <code>inherit</code> to <code>False</code>, you will obtain a logger that is not tied to the <code>LoggingFactory</code> logger hierarchy, only format of message will be the same, but you can also change it. This allows you to have an independent logger with its own configuration.    You can use the <code>your_logger</code> object to log messages:</p> <pre><code>```python\nyour_logger.debug(\"Debug message\")\nyour_logger.info(\"Info message\")\nyour_logger.warning(\"Warning message\")\nyour_logger.error(\"Error message\")\nyour_logger.critical(\"Critical message\")\n```\n</code></pre> <ol> <li> <p>(Optional) You can use Masked types to masked secrets/tokens/passwords in output. The Masked types are special types provided by the koheesio library to handle sensitive data     that should not be logged or printed in plain text. They are used to wrap sensitive data and override their string representation to prevent accidental exposure of the data.Here are some examples of how to use Masked types:</p> <pre><code>import logging\nfrom koheesio.logger import MaskedString, MaskedInt, MaskedFloat, MaskedDict\n\n# Set up logging\nlogger = logging.getLogger(__name__)\nlogging.basicConfig(level=logging.INFO)\n\n# Using MaskedString\nmasked_string = MaskedString(\"my secret string\")\nlogger.info(masked_string)  # This will not log the actual string\n\n# Using MaskedInt\nmasked_int = MaskedInt(12345)\nlogger.info(masked_int)  # This will not log the actual integer\n\n# Using MaskedFloat\nmasked_float = MaskedFloat(3.14159)\nlogger.info(masked_float)  # This will not log the actual float\n\n# Using MaskedDict\nmasked_dict = MaskedDict({\"key\": \"value\"})\nlogger.info(masked_dict)  # This will not log the actual dictionary\n</code></pre> </li> </ol> <p>Please make sure to replace \"your_logger_name\", \"your_run_id\", \"your_handler_module.YourHandlerClass\", \"your_child_logger_name\", and other placeholders with your own values according to your application's requirements.</p> <p>By following these steps, you can obtain an independent logger without inheriting from the <code>LoggingFactory</code> logger. This allows you to customize the logger configuration and use it separately in your code.</p> <p>Note: Ensure that you have imported the necessary modules, instantiated the <code>LoggingFactory</code> class, and customized the logger name and other parameters according to your application's requirements.</p>"},{"location":"reference/concepts/logger.html#example","title":"Example","text":"<pre><code>import logging\n\n# Step 2: Instantiate the LoggingFactory class\nfactory = LoggingFactory(env=\"local\")\n\n# Step 3: Create an independent logger with a custom log level\nyour_logger = factory.get_logger(\"your_logger\", inherit_from_koheesio=False)\nyour_logger.setLevel(logging.DEBUG)\n\n# Step 4: Create a logger using the create_logger method from LoggingFactory with a different log level\nfactory_logger = LoggingFactory(level=\"WARNING\").get_logger(name=factory.LOGGER_NAME)\n\n# Step 5: Create a child logger with a debug level\nchild_logger = factory.get_logger(name=\"child\")\nchild_logger.setLevel(logging.DEBUG)\n\nchild2_logger = factory.get_logger(name=\"child2\")\nchild2_logger.setLevel(logging.INFO)\n\n# Step 6: Log messages at different levels for both loggers\nyour_logger.debug(\"Debug message\")  # This message will be displayed\nyour_logger.info(\"Info message\")  # This message will be displayed\nyour_logger.warning(\"Warning message\")  # This message will be displayed\nyour_logger.error(\"Error message\")  # This message will be displayed\nyour_logger.critical(\"Critical message\")  # This message will be displayed\n\nfactory_logger.debug(\"Debug message\")  # This message will not be displayed\nfactory_logger.info(\"Info message\")  # This message will not be displayed\nfactory_logger.warning(\"Warning message\")  # This message will be displayed\nfactory_logger.error(\"Error message\")  # This message will be displayed\nfactory_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild_logger.debug(\"Debug message\")  # This message will be displayed\nchild_logger.info(\"Info message\")  # This message will be displayed\nchild_logger.warning(\"Warning message\")  # This message will be displayed\nchild_logger.error(\"Error message\")  # This message will be displayed\nchild_logger.critical(\"Critical message\")  # This message will be displayed\n\nchild2_logger.debug(\"Debug message\")  # This message will be displayed\nchild2_logger.info(\"Info message\")  # This message will be displayed\nchild2_logger.warning(\"Warning message\")  # This message will be displayed\nchild2_logger.error(\"Error message\")  # This message will be displayed\nchild2_logger.critical(\"Critical message\")  # This message will be displayed\n</code></pre> <p>Output:</p> <pre><code>[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [your_logger] {__init__.py:&lt;module&gt;:118} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [your_logger] {__init__.py:&lt;module&gt;:119} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [your_logger] {__init__.py:&lt;module&gt;:120} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [your_logger] {__init__.py:&lt;module&gt;:121} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [your_logger] {__init__.py:&lt;module&gt;:122} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio] {__init__.py:&lt;module&gt;:126} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio] {__init__.py:&lt;module&gt;:127} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio] {__init__.py:&lt;module&gt;:128} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [koheesio.child] {__init__.py:&lt;module&gt;:130} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child] {__init__.py:&lt;module&gt;:131} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child] {__init__.py:&lt;module&gt;:132} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child] {__init__.py:&lt;module&gt;:133} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child] {__init__.py:&lt;module&gt;:134} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child2] {__init__.py:&lt;module&gt;:137} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child2] {__init__.py:&lt;module&gt;:138} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child2] {__init__.py:&lt;module&gt;:139} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child2] {__init__.py:&lt;module&gt;:140} - Critical message\n</code></pre>"},{"location":"reference/concepts/logger.html#loggeridfilter-class","title":"LoggerIDFilter Class","text":"<p>The <code>LoggerIDFilter</code> class is a filter that injects <code>run_id</code> information into the log. To use it, follow these steps:</p> <ol> <li> <p>Import the necessary modules:</p> <pre><code>import logging\n</code></pre> </li> <li> <p>Create an instance of the <code>LoggerIDFilter</code> class:</p> <pre><code>logger_filter = LoggerIDFilter()\n</code></pre> </li> <li> <p>Set the <code>LOGGER_ID</code> attribute of the <code>LoggerIDFilter</code> class to the desired run ID:</p> <pre><code>LoggerIDFilter.LOGGER_ID = \"your_run_id\"\n</code></pre> </li> <li> <p>Add the <code>logger_filter</code> to your logger or handler:</p> <pre><code>logger = logging.getLogger(\"your_logger_name\")\nlogger.addFilter(logger_filter)\n</code></pre> </li> </ol>"},{"location":"reference/concepts/logger.html#loggingfactory-set-up-optional","title":"LoggingFactory Set Up (Optional)","text":"<ol> <li> <p>Import the <code>LoggingFactory</code> class in your application code.</p> </li> <li> <p>Set the value for the <code>LOGGER_FILTER</code> variable:</p> </li> <li>If you want to assign a specific <code>logging.Filter</code> instance, replace <code>None</code> with your desired filter instance.</li> <li> <p>If you want to keep the default value of <code>None</code>, leave it unchanged.</p> </li> <li> <p>Set the value for the <code>LOGGER_LEVEL</code> variable:</p> </li> <li>If you want to use the value from the <code>\"KOHEESIO_LOGGING_LEVEL\"</code> environment variable, leave the code as is.</li> <li> <p>If you want to use a different environment variable or a specific default value, modify the code accordingly.</p> </li> <li> <p>Set the value for the <code>LOGGER_ENV</code> variable:</p> </li> <li> <p>Replace <code>\"local\"</code> with your desired environment name.</p> </li> <li> <p>Set the value for the <code>LOGGER_FORMAT</code> variable:</p> </li> <li>If you want to customize the log message format, modify the value within the double quotes.</li> <li> <p>The format should follow the desired log message format pattern.</p> </li> <li> <p>Set the value for the <code>LOGGER_FORMATTER</code> variable:</p> </li> <li>If you want to assign a specific <code>Formatter</code> instance, replace <code>Formatter(LOGGER_FORMAT)</code> with your desired formatter instance.</li> <li> <p>If you want to keep the default formatter with the defined log message format, leave it unchanged.</p> </li> <li> <p>Set the value for the <code>CONSOLE_HANDLER</code> variable:</p> <ul> <li>If you want to assign a specific <code>logging.Handler</code> instance, replace <code>None</code> with your desired handler instance.</li> <li>If you want to keep the default value of <code>None</code>, leave it unchanged.</li> </ul> </li> <li> <p>Set the value for the <code>ENV</code> variable:</p> <ul> <li>Replace <code>None</code> with your desired environment value if applicable.</li> <li>If you don't need to set this variable, leave it as <code>None</code>.</li> </ul> </li> <li> <p>Save the changes to the file.</p> </li> </ol>"},{"location":"reference/concepts/step.html","title":"Steps in Koheesio","text":"<p>In the Koheesio framework, the <code>Step</code> class and its derivatives play a crucial role. They serve as the building blocks  for creating data pipelines, allowing you to define custom units of logic that can be executed. This document will  guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.</p> <p>Several type of Steps are available in Koheesio, including <code>Reader</code>, <code>Transformation</code>, <code>Writer</code>, and <code>Task</code>.</p>"},{"location":"reference/concepts/step.html#what-is-a-step","title":"What is a Step?","text":"<p>A <code>Step</code> is an atomic operation serving as the building block of data pipelines built with the Koheesio framework.  Tasks typically consist of a series of Steps. </p> <p>A step can be seen as an operation on a set of inputs, that returns a set of outputs. This does not imply that steps  are stateless (e.g. data writes)! This concept is visualized in the figure below.</p> <pre><code>\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502       Step       \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2502                  \u2502        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%%                    \u2502                  \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510        \u2502                  \u2502        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502                  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518        \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% &amp;nbsp; is for increasing the box size without having to mess with CSS settings\nStep[\"\n&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\nStep\n&amp;nbsp;\n&amp;nbsp;\n&amp;nbsp;\n\"]\n\nI1[\"Input 1\"] ---&gt; Step\nI2[\"Input 2\"] ---&gt; Step\nI3[\"Input 3\"] ---&gt; Step\n\nStep ---&gt; O1[\"Output 1\"]\nStep ---&gt; O2[\"Output 2\"]\nStep ---&gt; O3[\"Output 3\"]\n</code></pre>"},{"location":"reference/concepts/step.html#how-to-read-a-step","title":"How to Read a Step?","text":"<p>A <code>Step</code> in Koheesio is a class that represents a unit of work in a data pipeline. It's similar to a Python built-in  data class, but with additional features for execution, validation, and logging.</p> <p>When you look at a <code>Step</code>, you'll typically see the following components:</p> <ol> <li> <p>Class Definition: The <code>Step</code> is defined as a class that inherits from the base <code>Step</code> class in Koheesio.      For example, <code>class MyStep(Step):</code>.</p> </li> <li> <p>Input Fields: These are defined as class attributes with type annotations, similar to attributes in a Python      data class. These fields represent the inputs to the <code>Step</code>. For example, <code>a: str</code> defines an input field <code>a</code> of      type <code>str</code>. Additionally, you will often see these fields defined using Pydantic's <code>Field</code> class, which allows     for more detailed validation and documentation as well as default values and aliasing.</p> </li> <li> <p>Output Fields: These are defined in a nested class called <code>Output</code> that inherits from <code>StepOutput</code>. This class      represents the output of the <code>Step</code>. For example, <code>class Output(StepOutput): b: str</code> defines an output field <code>b</code> of      type <code>str</code>.</p> </li> <li> <p>Execute Method: This is a method that you need to implement when you create a new <code>Step</code>. It contains the logic      of the <code>Step</code> and is where you use the input fields and populate the output fields. For example,      <code>def execute(self): self.output.b = f\"{self.a}-some-suffix\"</code>.</p> </li> </ol> <p>Here's an example of a <code>Step</code>:</p> <pre><code>class MyStep(Step):\n    a: str  # input\n\n    class Output(StepOutput):  # output\n        b: str\n\n    def execute(self) -&gt; MyStep.Output:\n        self.output.b = f\"{self.a}-some-suffix\"\n</code></pre> <p>In this <code>Step</code>, <code>a</code> is an input field of type <code>str</code>, <code>b</code> is an output field of type <code>str</code>, and the <code>execute</code> method appends <code>-some-suffix</code> to the input <code>a</code> and assigns it to the output <code>b</code>.</p> <p>When you see a <code>Step</code>, you can think of it as a function where the class attributes are the inputs, the <code>Output</code> class  defines the outputs, and the <code>execute</code> method is the function body. The main difference is that a <code>Step</code> also includes  automatic validation of inputs and outputs (thanks to Pydantic), logging, and error handling.</p>"},{"location":"reference/concepts/step.html#understanding-inheritance-in-steps","title":"Understanding Inheritance in Steps","text":"<p>Inheritance is a core concept in object-oriented programming where a class (child or subclass) inherits properties and  methods from another class (parent or superclass). In the context of Koheesio, when you create a new <code>Step</code>, you're  creating a subclass that inherits from the base <code>Step</code> class.</p> <p>When a new Step is defined (like <code>class MyStep(Step):</code>), it inherits all the properties and methods from the <code>Step</code>  class. This includes the <code>execute</code> method, which is then overridden to provide the specific functionality for that Step.</p> <p>Here's a simple breakdown:</p> <ol> <li> <p>Parent Class (Superclass): This is the <code>Step</code> class in Koheesio. It provides the basic structure and      functionalities of a Step, including input and output validation, logging, and error handling.</p> </li> <li> <p>Child Class (Subclass): This is the new Step you define, like <code>MyStep</code>. It inherits all the properties and      methods from the <code>Step</code> class and can add or override them as needed.</p> </li> <li> <p>Inheritance: This is the process where <code>MyStep</code> inherits the properties and methods from the <code>Step</code> class. In      Python, this is done by mentioning the parent class in parentheses when defining the child class, like      <code>class MyStep(Step):</code>.</p> </li> <li> <p>Overriding: This is when you provide a new implementation of a method in the child class that is already defined      in the parent class. In the case of Steps, you override the <code>execute</code> method to define the specific logic of your      Step.</p> </li> </ol> <p>Understanding inheritance is key to understanding how Steps work in Koheesio. It allows you to leverage the  functionalities provided by the <code>Step</code> class and focus on implementing the specific logic of your Step.</p>"},{"location":"reference/concepts/step.html#benefits-of-using-steps-in-data-pipelines","title":"Benefits of Using Steps in Data Pipelines","text":"<p>The concept of a <code>Step</code> is beneficial when creating Data Pipelines or Data Products for several reasons:</p> <ol> <li> <p>Modularity: Each <code>Step</code> represents a self-contained unit of work, which makes the pipeline modular. This makes      it easier to understand, test, and maintain the pipeline. If a problem arises, you can pinpoint which step is      causing the issue.</p> </li> <li> <p>Reusability: Steps can be reused across different pipelines. Once a <code>Step</code> is defined, it can be used in any      number of pipelines. This promotes code reuse and consistency across projects.</p> </li> <li> <p>Readability: Steps make the pipeline code more readable. Each <code>Step</code> has a clear input, output, and execution      logic, which makes it easier to understand what each part of the pipeline is doing.</p> </li> <li> <p>Validation: Steps automatically validate their inputs and outputs. This ensures that the data flowing into and      out of each step is of the expected type and format, which can help catch errors early.</p> </li> <li> <p>Logging: Steps automatically log the start and end of their execution, along with the input and output data.      This can be very useful for debugging and understanding the flow of data through the pipeline.</p> </li> <li> <p>Error Handling: Steps provide built-in error handling. If an error occurs during the execution of a step, it is     caught, logged, and then re-raised. This provides a clear indication of where the error occurred.</p> </li> <li> <p>Scalability: Steps can be easily parallelized or distributed, which is crucial for processing large datasets.      This is especially true for steps that are designed to work with distributed computing frameworks like Apache Spark.</p> </li> </ol> <p>By using the concept of a <code>Step</code>, you can create data pipelines that are modular, reusable, readable, and robust, while also being easier to debug and scale.</p>"},{"location":"reference/concepts/step.html#compared-to-a-regular-pydantic-basemodel","title":"Compared to a regular Pydantic Basemodel","text":"<p>A <code>Step</code> in Koheesio, while built on top of Pydantic's <code>BaseModel</code>, provides additional features specifically designed  for creating data pipelines. Here are some key differences:</p> <ol> <li> <p>Execution Method: A <code>Step</code> includes an <code>execute</code> method that needs to be implemented. This method contains the      logic of the step and is automatically decorated with functionalities such as logging and output validation.</p> </li> <li> <p>Input and Output Validation: A <code>Step</code> uses Pydantic models to define and validate its inputs and outputs. This      ensures that the data flowing into and out of the step is of the expected type and format.</p> </li> <li> <p>Automatic Logging: A <code>Step</code> automatically logs the start and end of its execution, along with the input and      output data. This is done through the <code>do_execute</code> decorator applied to the <code>execute</code> method.</p> </li> <li> <p>Error Handling: A <code>Step</code> provides built-in error handling. If an error occurs during the execution of the step,      it is caught, logged, and then re-raised. This should help in debugging and understanding the flow of data.</p> </li> <li> <p>Serialization: A <code>Step</code> can be serialized to a YAML string using the <code>to_yaml</code> method. This can be useful for      saving and loading steps.</p> </li> <li> <p>Lazy Mode Support: The <code>StepOutput</code> class in a <code>Step</code> supports lazy mode, which allows validation of the items      stored in the class to be called at will instead of being forced to run it upfront.</p> </li> </ol> <p>In contrast, a regular Pydantic <code>BaseModel</code> is a simple data validation model that doesn't include these additional  features. It's used for data parsing and validation, but doesn't include methods for execution, automatic logging,  error handling, or serialization to YAML.</p>"},{"location":"reference/concepts/step.html#key-features-of-a-step","title":"Key Features of a Step","text":""},{"location":"reference/concepts/step.html#defining-a-step","title":"Defining a Step","text":"<p>To define a new step, you subclass the <code>Step</code> class and implement the <code>execute</code> method. The inputs of the step can be  accessed using <code>self.input_name</code>. The output of the step can be accessed using <code>self.output.output_name</code>. For example:</p> <pre><code>class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n</code></pre>"},{"location":"reference/concepts/step.html#running-a-step","title":"Running a Step","text":"<p>To run a step, you can call the <code>execute</code> method. You can also use the <code>run</code> method, which is an alias to <code>execute</code>.  For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\n</code></pre>"},{"location":"reference/concepts/step.html#accessing-step-output","title":"Accessing Step Output","text":"<p>The output of a step can be accessed using <code>self.output.output_name</code>. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\n</code></pre>"},{"location":"reference/concepts/step.html#serializing-a-step","title":"Serializing a Step","text":"<p>You can serialize a step to a YAML string using the <code>to_yaml</code> method. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\nyaml_str = step.to_yaml()\n</code></pre>"},{"location":"reference/concepts/step.html#getting-step-description","title":"Getting Step Description","text":"<p>You can get the description of a step using the <code>get_description</code> method. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2)\ndescription = step.get_description()\n</code></pre>"},{"location":"reference/concepts/step.html#defining-a-step-with-multiple-inputs-and-outputs","title":"Defining a Step with Multiple Inputs and Outputs","text":"<p>Here's an example of how to define a new step with multiple inputs and outputs:</p> <pre><code>class MyStep(Step):\n    input1: str = Field(...)\n    input2: int = Field(...)\n    input3: int = Field(...)\n\n    class Output(StepOutput):\n        output1: str = Field(...)\n        output2: int = Field(...)\n\n    def execute(self):\n        # Your logic here\n        self.output.output1 = \"result\"\n        self.output.output2 = self.input2 + self.input3\n</code></pre>"},{"location":"reference/concepts/step.html#running-a-step-with-multiple-inputs","title":"Running a Step with Multiple Inputs","text":"<p>To run a step with multiple inputs, you can do the following:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\n</code></pre>"},{"location":"reference/concepts/step.html#accessing-multiple-step-outputs","title":"Accessing Multiple Step Outputs","text":"<p>The outputs of a step can be accessed using <code>self.output.output_name</code>. For example:</p> <pre><code>step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\nprint(step.output.output1)  # Outputs: \"result\"\nprint(step.output.output2)  # Outputs: 5\n</code></pre>"},{"location":"reference/concepts/step.html#special-features","title":"Special Features","text":""},{"location":"reference/concepts/step.html#the-execute-method","title":"The Execute method","text":"<p>The <code>execute</code> method in the <code>Step</code> class is automatically decorated with the <code>StepMetaClass._execute_wrapper</code> function  due to the metaclass <code>StepMetaClass</code>. This provides several advantages:</p> <ol> <li> <p>Automatic Output Validation: The decorator ensures that the output of the <code>execute</code> method is always a      <code>StepOutput</code> instance. This means that the output is automatically validated against the defined output model,      ensuring data integrity and consistency.</p> </li> <li> <p>Logging: The decorator provides automatic logging at the start and end of the <code>execute</code> method. This includes      logging the input and output of the step, which can be useful for debugging and understanding the flow of data.</p> </li> <li> <p>Error Handling: If an error occurs during the execution of the <code>Step</code>, the decorator catches the exception and      logs an error message before re-raising the exception. This provides a clear indication of where the error occurred.</p> </li> <li> <p>Simplifies Step Implementation: Since the decorator handles output validation, logging, and error handling, the      user can focus on implementing the logic of the <code>execute</code> method without worrying about these aspects.</p> </li> <li> <p>Consistency: By automatically decorating the <code>execute</code> method, the library ensures that these features are      consistently applied across all steps, regardless of who implements them or how they are used. This makes the      behavior of steps predictable and consistent.</p> </li> <li> <p>Prevents Double Wrapping: The decorator checks if the function is already wrapped with <code>StepMetaClass._execute_wrapper</code>     and prevents double wrapping. This ensures that the decorator doesn't interfere with itself if <code>execute</code> is overridden in      subclasses.</p> </li> </ol> <p>Notice that you never have to explicitly return anything from the <code>execute</code> method. The <code>StepMetaClass._execute_wrapper</code>  decorator takes care of that for you.</p> <p>Implementation examples for custom metaclass which can be used to override the default behavior of the <code>StepMetaClass._execute_wrapper</code>:</p> <pre><code>    class MyMetaClass(StepMetaClass):\n        @classmethod\n        def _log_end_message(cls, step: Step, skip_logging: bool = False, *args, **kwargs):\n            print(\"It's me from custom meta class\")\n            super()._log_end_message(step, skip_logging, *args, **kwargs)\n\n    class MyMetaClass2(StepMetaClass):\n        @classmethod\n        def _validate_output(cls, step: Step, skip_validating: bool = False, *args, **kwargs):\n            # i want always have a dummy value in the output\n            step.output.dummy_value = \"dummy\"\n\n    class YourClassWithCustomMeta(Step, metaclass=MyMetaClass):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n\n    class YourClassWithCustomMeta2(Step, metaclass=MyMetaClass2):\n        def execute(self):\n            self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n</code></pre>"},{"location":"reference/concepts/step.html#sparkstep","title":"SparkStep","text":"<p>The <code>SparkStep</code> class is a subclass of <code>Step</code> that is designed for steps that interact with Spark. It extends the  <code>Step</code> class with SparkSession support. Spark steps are expected to return a Spark DataFrame as output. The <code>spark</code>  property is available to access the active SparkSession instance. <code>Output</code> in a <code>SparkStep</code> is expected to be a <code>DataFrame</code> although optional.</p>"},{"location":"reference/concepts/step.html#using-a-sparkstep","title":"Using a SparkStep","text":"<p>Here's an example of how to use a <code>SparkStep</code>:</p> <pre><code>class MySparkStep(SparkStep):\n    input1: str = Field(...)\n\n    class Output(StepOutput):\n        output1: DataFrame = Field(...)\n\n    def execute(self):\n        # Your logic here\n        df = self.spark.read.text(self.input1)\n        self.output.output1 = df\n</code></pre> <p>To run a <code>SparkStep</code>, you can do the following:</p> <pre><code>step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\n</code></pre> <p>To access the output of a <code>SparkStep</code>, you can do the following:</p> <pre><code>step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\ndf = step.output.output1\ndf.show()\n</code></pre>"},{"location":"reference/concepts/step.html#conclusion","title":"Conclusion","text":"<p>In this document, we've covered the key features of the <code>Step</code> class in the Koheesio framework, including its ability  to define custom units of logic, manage inputs and outputs, and support for serialization. The automatic decoration of  the <code>execute</code> method provides several advantages that simplify step implementation and ensure consistency across all  steps.</p> <p>Whether you're defining a new operation in your data pipeline or managing the flow of data between steps, <code>Step</code>  provides a robust and efficient solution.</p>"},{"location":"reference/concepts/step.html#further-reading","title":"Further Reading","text":"<p>For more information, you can refer to the following resources:</p> <ul> <li>Python Pydantic Documentation</li> <li>Python YAML Documentation</li> </ul> <p>Refer to the API documentation for more details on the <code>Step</code> class and its methods.</p>"},{"location":"reference/spark/readers.html","title":"Reader Module","text":"<p>The <code>Reader</code> module in Koheesio provides a set of classes for reading data from various sources. A <code>Reader</code> is a type  of <code>SparkStep</code> that reads data from a source based on the input parameters and stores the result in <code>self.output.df</code>  for subsequent steps.</p>"},{"location":"reference/spark/readers.html#what-is-a-reader","title":"What is a Reader?","text":"<p>A <code>Reader</code> is a subclass of <code>SparkStep</code> that reads data from a source and stores the result. The source could be a  file, a database, a web API, or any other data source. The result is stored in a DataFrame, which is accessible through  the <code>df</code> property of the <code>Reader</code>.</p>"},{"location":"reference/spark/readers.html#api-reference","title":"API Reference","text":"<p>See API Reference for a detailed description of the <code>Reader</code> class and its methods.</p>"},{"location":"reference/spark/readers.html#key-features-of-a-reader","title":"Key Features of a Reader","text":"<ol> <li>Read Method: The <code>Reader</code> class provides a <code>read</code> method that calls the <code>execute</code> method and returns the result.    Essentially, calling <code>.read()</code> is a shorthand for calling <code>.execute().output.df</code>. This allows you to read data from    a <code>Reader</code> without having to call the <code>execute</code> method directly. This is a convenience method that simplifies the    usage of a <code>Reader</code>.</li> </ol> <p>Here's an example of how to use the <code>.read()</code> method:</p> <pre><code># Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the .read() method to get the data as a DataFrame\ndf = my_reader.read()\n\n# Now df is a DataFrame with the data read by MyReader\n</code></pre> <p>In this example, <code>MyReader</code> is a subclass of <code>Reader</code> that you've defined. After creating an instance of <code>MyReader</code>,    you call the <code>.read()</code> method to read the data and get it back as a DataFrame. The DataFrame <code>df</code> now contains the    data read by <code>MyReader</code>.</p> <ol> <li>DataFrame Property: The <code>Reader</code> class provides a <code>df</code> property as a shorthand for accessing <code>self.output.df</code>.    If <code>self.output.df</code> is <code>None</code>, the <code>execute</code> method is run first. This property ensures that the data is loaded and    ready to be used, even if the <code>execute</code> method hasn't been explicitly called.</li> </ol> <p>Here's an example of how to use the <code>df</code> property:</p> <pre><code># Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the df property to get the data as a DataFrame\ndf = my_reader.df\n\n# Now df is a DataFrame with the data read by MyReader\n</code></pre> <p>In this example, <code>MyReader</code> is a subclass of <code>Reader</code> that you've defined. After creating an instance of <code>MyReader</code>,    you access the <code>df</code> property to get the data as a DataFrame. The DataFrame <code>df</code> now contains the data read by    <code>MyReader</code>.</p> <ol> <li>SparkSession: Every <code>Reader</code> has a <code>SparkSession</code> available as <code>self.spark</code>. This is the currently active    <code>SparkSession</code>, which can be used to perform distributed data processing tasks.</li> </ol> <p>Here's an example of how to use the <code>spark</code> property:</p> <pre><code># Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the spark property to get the SparkSession\nspark = my_reader.spark\n\n# Now spark is the SparkSession associated with MyReader\n</code></pre> <p>In this example, <code>MyReader</code> is a subclass of <code>Reader</code> that you've defined. After creating an instance of <code>MyReader</code>,    you access the <code>spark</code> property to get the <code>SparkSession</code>. The <code>SparkSession</code> <code>spark</code> can now be used to perform    distributed data processing tasks.</p>"},{"location":"reference/spark/readers.html#how-to-define-a-reader","title":"How to Define a Reader?","text":"<p>To define a <code>Reader</code>, you create a subclass of the <code>Reader</code> class and implement the <code>execute</code> method. The <code>execute</code>  method should read from the source and store the result in <code>self.output.df</code>. This is an abstract method, which means it must be implemented in any subclass of <code>Reader</code>.</p> <p>Here's an example of a <code>Reader</code>:</p> <pre><code>class MyReader(Reader):\n  def execute(self):\n    # read data from source\n    data = read_from_source()\n    # store result in self.output.df\n    self.output.df = data\n</code></pre>"},{"location":"reference/spark/readers.html#understanding-inheritance-in-readers","title":"Understanding Inheritance in Readers","text":"<p>Just like a <code>Step</code>, a <code>Reader</code> is defined as a subclass that inherits from the base <code>Reader</code> class. This means it  inherits all the properties and methods from the <code>Reader</code> class and can add or override them as needed. The main method that needs to be overridden is the <code>execute</code> method, which should implement the logic for reading data from the source and storing it in <code>self.output.df</code>.</p>"},{"location":"reference/spark/readers.html#benefits-of-using-readers-in-data-pipelines","title":"Benefits of Using Readers in Data Pipelines","text":"<p>Using <code>Reader</code> classes in your data pipelines has several benefits:</p> <ol> <li> <p>Simplicity: Readers abstract away the details of reading data from various sources, allowing you to focus on the   logic of your pipeline.</p> </li> <li> <p>Consistency: By using Readers, you ensure that data is read in a consistent manner across different parts of your   pipeline.</p> </li> <li> <p>Flexibility: Readers can be easily swapped out for different data sources without changing the rest of your   pipeline.</p> </li> <li> <p>Efficiency: Readers automatically manage resources like connections and file handles, ensuring efficient use of   resources.</p> </li> </ol> <p>By using the concept of a <code>Reader</code>, you can create data pipelines that are simple, consistent, flexible, and efficient.</p>"},{"location":"reference/spark/readers.html#examples-of-reader-classes-in-koheesio","title":"Examples of Reader Classes in Koheesio","text":"<p>Koheesio provides a variety of <code>Reader</code> subclasses for reading data from different sources. Here are just a few examples:</p> <ol> <li> <p>Teradata Reader: A <code>Reader</code> subclass for reading data from Teradata databases.    It's defined in the <code>koheesio/steps/readers/teradata.py</code> file.</p> </li> <li> <p>Snowflake Reader: A <code>Reader</code> subclass for reading data from Snowflake databases.    It's defined in the <code>koheesio/steps/readers/snowflake.py</code> file.</p> </li> <li> <p>Box Reader: A <code>Reader</code> subclass for reading data from Box.   It's defined in the <code>koheesio/steps/integrations/box.py</code> file.</p> </li> </ol> <p>These are just a few examples of the many <code>Reader</code> subclasses available in Koheesio. Each <code>Reader</code> subclass is designed to read data from a specific source. They all inherit from the base <code>Reader</code> class and implement the <code>execute</code> method to read data from their respective sources and store it in <code>self.output.df</code>.</p> <p>Please note that this is not an exhaustive list. Koheesio provides many more <code>Reader</code> subclasses for a wide range of data sources. For a complete list, please refer to the Koheesio documentation or the source code.</p> <p>More readers can be found in the <code>koheesio/steps/readers</code> module.</p>"},{"location":"reference/spark/transformations.html","title":"Transformation Module","text":"<p>The <code>Transformation</code> module in Koheesio provides a set of classes for transforming data within a DataFrame. A <code>Transformation</code> is a type of <code>SparkStep</code> that takes a DataFrame as input, applies a transformation, and returns a DataFrame as output. The transformation logic is implemented in the <code>execute</code> method of each <code>Transformation</code> subclass.</p>"},{"location":"reference/spark/transformations.html#what-is-a-transformation","title":"What is a Transformation?","text":"<p>A <code>Transformation</code> is a subclass of <code>SparkStep</code> that applies a transformation to a DataFrame and stores the result. The transformation could be any operation that modifies the data or structure of the DataFrame, such as adding a new column, filtering rows, or aggregating data.</p> <p>Using <code>Transformation</code> classes ensures that data is transformed in a consistent manner across different parts of your  pipeline. This can help avoid errors and make your code easier to understand and maintain.</p>"},{"location":"reference/spark/transformations.html#api-reference","title":"API Reference","text":"<p>See API Reference for a detailed description of the <code>Transformation</code> classes and their methods.</p>"},{"location":"reference/spark/transformations.html#types-of-transformations","title":"Types of Transformations","text":"<p>There are three main types of transformations in Koheesio:</p> <ol> <li> <p><code>Transformation</code>: This is the base class for all transformations. It takes a DataFrame as input and returns a     DataFrame as output. The transformation logic is implemented in the <code>execute</code> method.</p> </li> <li> <p><code>ColumnsTransformation</code>: This is an extended <code>Transformation</code> class with a preset validator for handling column(s)     data. It standardizes the input for a single column or multiple columns. If more than one column is passed, the      transformation will be run in a loop against all the given columns.</p> </li> <li> <p><code>ColumnsTransformationWithTarget</code>: This is an extended <code>ColumnsTransformation</code> class with an additional     <code>target_column</code> field. This field can be used to store the result of the transformation in a new column. If the      <code>target_column</code> is not provided, the result will be stored in the source column.</p> </li> </ol> <p>Each type of transformation has its own use cases and advantages. The right one to use depends on the specific requirements of your data pipeline.</p>"},{"location":"reference/spark/transformations.html#how-to-define-a-transformation","title":"How to Define a Transformation","text":"<p>To define a <code>Transformation</code>, you create a subclass of the <code>Transformation</code> class and implement the <code>execute</code> method. The <code>execute</code> method should take a DataFrame from <code>self.input.df</code>, apply a transformation, and store the result in <code>self.output.df</code>.</p> <p><code>Transformation</code> classes abstract away some of the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.</p> <p>Here's an example of a <code>Transformation</code>:</p> <pre><code>class MyTransformation(Transformation):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # apply transformation\n        transformed_data = apply_transformation(data)\n        # store result in self.output.df\n        self.output.df = transformed_data\n</code></pre> <p>In this example, <code>MyTransformation</code> is a subclass of <code>Transformation</code> that you've defined. The <code>execute</code> method gets the data from <code>self.input.df</code>, applies a transformation called <code>apply_transformation</code> (undefined in this example), and stores the result in <code>self.output.df</code>.</p>"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformation","title":"How to Define a ColumnsTransformation","text":"<p>To define a <code>ColumnsTransformation</code>, you create a subclass of the <code>ColumnsTransformation</code> class and implement the  <code>execute</code> method. The <code>execute</code> method should apply a transformation to the specified columns of the DataFrame.</p> <p><code>ColumnsTransformation</code> classes can be easily swapped out for different data transformations without changing the rest  of your pipeline. This can make your pipeline more flexible and easier to modify or extend.</p> <p>Here's an example of a <code>ColumnsTransformation</code>:</p> <pre><code>from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\nclass AddOne(ColumnsTransformation):\n    def execute(self):\n        for column in self.get_columns():\n            self.output.df = self.df.withColumn(column, f.col(column) + 1)\n</code></pre> <p>In this example, <code>AddOne</code> is a subclass of <code>ColumnsTransformation</code> that you've defined. The <code>execute</code> method adds 1 to each column in <code>self.get_columns()</code>.</p> <p>The <code>ColumnsTransformation</code> class has a <code>ColumnConfig</code> class that can be used to configure the behavior of the class. This class has the following fields:</p> <ul> <li><code>run_for_all_data_type</code>: Allows to run the transformation for all columns of a given type.</li> <li><code>limit_data_type</code>: Allows to limit the transformation to a specific data type.</li> <li><code>data_type_strict_mode</code>: Toggles strict mode for data type validation. Will only work if <code>limit_data_type</code> is set.</li> </ul> <p>Note that data types need to be specified as a <code>SparkDatatype</code> enum. Users should not have to interact with the <code>ColumnConfig</code> class directly.</p>"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformationwithtarget","title":"How to Define a ColumnsTransformationWithTarget","text":"<p>To define a <code>ColumnsTransformationWithTarget</code>, you create a subclass of the <code>ColumnsTransformationWithTarget</code> class and implement the <code>func</code> method. The <code>func</code> method should return the transformation that will be applied to the column(s). The <code>execute</code> method, which is already preset, will use the <code>get_columns_with_target</code> method to loop over all the columns and apply this function to transform the DataFrame.</p> <p>Here's an example of a <code>ColumnsTransformationWithTarget</code>:</p> <pre><code>from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n    def func(self, col: Column):\n        return col + 1\n</code></pre> <p>In this example, <code>AddOneWithTarget</code> is a subclass of <code>ColumnsTransformationWithTarget</code> that you've defined. The <code>func</code> method adds 1 to the values of a given column.</p> <p>The <code>ColumnsTransformationWithTarget</code> class has an additional <code>target_column</code> field. This field can be used to store the result of the transformation in a new column. If the <code>target_column</code> is not provided, the result will be stored in the source column. If more than one column is passed, the <code>target_column</code> will be used as a suffix. Leaving this blank will result in the original columns being renamed.</p> <p>The <code>ColumnsTransformationWithTarget</code> class also has a <code>get_columns_with_target</code> method. This method returns an iterator of the columns and handles the <code>target_column</code> as well.</p>"},{"location":"reference/spark/transformations.html#key-features-of-a-transformation","title":"Key Features of a Transformation","text":"<ol> <li> <p>Execute Method: The <code>Transformation</code> class provides an <code>execute</code> method to implement in your subclass.     This method should take a DataFrame from <code>self.input.df</code>, apply a transformation, and store the result in      <code>self.output.df</code>.</p> <p>For <code>ColumnsTransformation</code> and <code>ColumnsTransformationWithTarget</code>, the <code>execute</code> method is already implemented in the base class. Instead of overriding <code>execute</code>, you implement a <code>func</code> method in your subclass. This <code>func</code> method should return the transformation to be applied to each column. The <code>execute</code> method will then apply this func to each column in a loop.</p> </li> <li> <p>DataFrame Property: The <code>Transformation</code> class provides a <code>df</code> property as a shorthand for accessing     <code>self.input.df</code>. This property ensures that the data is ready to be transformed, even if the <code>execute</code> method     hasn't been explicitly called. This is useful for 'early validation' of the input data.</p> </li> <li> <p>SparkSession: Every <code>Transformation</code> has a <code>SparkSession</code> available as <code>self.spark</code>. This is the currently active     <code>SparkSession</code>, which can be used to perform distributed data processing tasks.</p> </li> <li> <p>Columns Property: The <code>ColumnsTransformation</code> and <code>ColumnsTransformationWithTarget</code> classes provide a <code>columns</code>     property. This property standardizes the input for a single column or multiple columns. If more than one column is     passed, the transformation will be run in a loop against all the given columns.</p> </li> <li> <p>Target Column Property: The <code>ColumnsTransformationWithTarget</code> class provides a <code>target_column</code> property. This     field can be used to store the result of the transformation in a new column. If the <code>target_column</code> is not provided,     the result will be stored in the source column.</p> </li> </ol>"},{"location":"reference/spark/transformations.html#examples-of-transformation-classes-in-koheesio","title":"Examples of Transformation Classes in Koheesio","text":"<p>Koheesio provides a variety of <code>Transformation</code> subclasses for transforming data in different ways. Here are some examples:</p> <ul> <li> <p><code>DataframeLookup</code>: This transformation joins two dataframes together based on a list of join mappings. It allows you   to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.</p> <p>Here's an example of how to use the <code>DataframeLookup</code> transformation:</p> <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\nspark = SparkSession.builder.getOrCreate()\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\nlookup = DataframeLookup(\n    df=left_df,\n    other=right_df,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\noutput_df = lookup.execute().df\n</code></pre> </li> <li> <p><code>HashUUID5</code>: This transformation is a subclass of <code>Transformation</code> and provides an interface to generate a UUID5    hash for each row in the DataFrame. The hash is generated based on the values of the specified source columns.</p> <p>Here's an example of how to use the <code>HashUUID5</code> transformation:</p> <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\n\nspark = SparkSession.builder.getOrCreate()\ndf = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\n\nhash_transform = HashUUID5(\n    df=df,\n    source_columns=[\"id\", \"value\"],\n    target_column=\"hash\"\n)\n\noutput_df = hash_transform.execute().df\n</code></pre> </li> </ul> <p>In this example, <code>HashUUID5</code> is a subclass of <code>Transformation</code>. After creating an instance of <code>HashUUID5</code>, you call the <code>execute</code> method to apply the transformation. The <code>execute</code> method generates a UUID5 hash for each row in the DataFrame based on the values of the <code>id</code> and <code>value</code> columns and stores the result in a new column named <code>hash</code>.</p>"},{"location":"reference/spark/transformations.html#benefits-of-using-koheesio-transformations","title":"Benefits of using Koheesio Transformations","text":"<p>Using a Koheesio <code>Transformation</code> over plain Spark provides several benefits:</p> <ol> <li> <p>Consistency: By using <code>Transformation</code> classes, you ensure that data is transformed in a consistent manner     across different parts of your pipeline. This can help avoid errors and make your code easier to understand and     maintain.</p> </li> <li> <p>Abstraction: <code>Transformation</code> classes abstract away the details of transforming data, allowing you to focus on     the logic of your pipeline. This can make your code cleaner and easier to read.</p> </li> <li> <p>Flexibility: <code>Transformation</code> classes can be easily swapped out for different data transformations without     changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.</p> </li> <li> <p>Early Input Validation: As a <code>Transformation</code> is a type of <code>SparkStep</code>, which in turn is a <code>Step</code> and a type of     Pydantic <code>BaseModel</code>, all inputs are validated when an instance of a <code>Transformation</code> class is created. This early     validation helps catch errors related to invalid input, such as an invalid column name, before the PySpark pipeline     starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.</p> </li> <li> <p>Ease of Testing: <code>Transformation</code> classes are designed to be easily testable. This can make it easier to write     unit tests for your data pipeline, helping to ensure its correctness and reliability.</p> </li> <li> <p>Robustness: Koheesio has been extensively tested with hundreds of unit tests, ensuring that the <code>Transformation</code>     classes work as expected under a wide range of conditions. This makes your data pipelines more robust and less     likely to fail due to unexpected inputs or edge cases.</p> </li> </ol> <p>By using the concept of a <code>Transformation</code>, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.</p>"},{"location":"reference/spark/transformations.html#advanced-usage-of-transformations","title":"Advanced Usage of Transformations","text":"<p>Transformations can be combined and chained together to create complex data processing pipelines. Here's an example of how to chain transformations:</p> <pre><code>from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\n# Create a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Define two DataFrames\ndf1 = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\ndf2 = spark.createDataFrame([(1, \"C\"), (3, \"D\")], [\"id\", \"value\"])\n\n# Define the first transformation\nlookup = DataframeLookup(\n    other=df2,\n    on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n    targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n    how=JoinType.LEFT,\n)\n\n# Apply the first transformation\noutput_df = lookup.transform(df1)\n\n# Define the second transformation\nhash_transform = HashUUID5(\n    source_columns=[\"id\", \"value\", \"right_value\"],\n    target_column=\"hash\"\n)\n\n# Apply the second transformation\noutput_df2 = hash_transform.transform(output_df)\n</code></pre> <p>In this example, <code>DataframeLookup</code> is a subclass of <code>ColumnsTransformation</code> and <code>HashUUID5</code> is a subclass of  <code>Transformation</code>. After creating instances of <code>DataframeLookup</code> and <code>HashUUID5</code>, you call the <code>transform</code> method to  apply each transformation. The <code>transform</code> method of <code>DataframeLookup</code> performs a left join with <code>df2</code> on the <code>id</code> column and adds the <code>value</code> column from <code>df2</code> to the result DataFrame as <code>right_value</code>. The <code>transform</code> method of <code>HashUUID5</code> generates a UUID5 hash for each row in the DataFrame based on the values of the <code>id</code>, <code>value</code>, and <code>right_value</code> columns and stores the result in a new column named <code>hash</code>.</p>"},{"location":"reference/spark/transformations.html#troubleshooting-transformations","title":"Troubleshooting Transformations","text":"<p>If you encounter an error when using a transformation, here are some steps you can take to troubleshoot:</p> <ol> <li> <p>Check the Input Data: Make sure the input DataFrame to the transformation is correct. You can use the <code>show</code>     method of the DataFrame to print the first few rows of the DataFrame.</p> </li> <li> <p>Check the Transformation Parameters: Make sure the parameters passed to the transformation are correct. For     example, if you're using a <code>DataframeLookup</code>, make sure the join mappings and target columns are correctly     specified.</p> </li> <li> <p>Check the Transformation Logic: If the input data and parameters are correct, there might be an issue with the     transformation logic. You can use PySpark's logging utilities to log intermediate results and debug the     transformation logic.</p> </li> <li> <p>Check the Output Data: If the transformation executes without errors but the output data is not as expected, you     can use the <code>show</code> method of the DataFrame to print the first few rows of the output DataFrame. This can help you     identify any issues with the transformation logic.</p> </li> </ol>"},{"location":"reference/spark/transformations.html#conclusion","title":"Conclusion","text":"<p>The <code>Transformation</code> module in Koheesio provides a powerful and flexible way to transform data in a DataFrame. By using <code>Transformation</code> classes, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable. Whether you're performing simple transformations like adding a new column, or complex transformations like joining multiple DataFrames, the <code>Transformation</code> module has you covered.</p>"},{"location":"reference/spark/writers.html","title":"Writer Module","text":"<p>The <code>Writer</code> module in Koheesio provides a set of classes for writing data to various destinations. A <code>Writer</code> is a type of <code>SparkStep</code> that takes data from <code>self.input.df</code> and writes it to a destination based on the output parameters.</p>"},{"location":"reference/spark/writers.html#what-is-a-writer","title":"What is a Writer?","text":"<p>A <code>Writer</code> is a subclass of <code>SparkStep</code> that writes data to a destination. The data to be written is taken from a DataFrame, which is accessible through the <code>df</code> property of the <code>Writer</code>.</p>"},{"location":"reference/spark/writers.html#how-to-define-a-writer","title":"How to Define a Writer?","text":"<p>To define a <code>Writer</code>, you create a subclass of the <code>Writer</code> class and implement the <code>execute</code> method. The <code>execute</code> method should take data from <code>self.input.df</code> and write it to the destination.</p> <p>Here's an example of a <code>Writer</code>:</p> <pre><code>class MyWriter(Writer):\n    def execute(self):\n        # get data from self.input.df\n        data = self.input.df\n        # write data to destination\n        write_to_destination(data)\n</code></pre>"},{"location":"reference/spark/writers.html#key-features-of-a-writer","title":"Key Features of a Writer","text":"<ol> <li> <p>Write Method: The <code>Writer</code> class provides a <code>write</code> method that calls the <code>execute</code> method and writes the data     to the destination. Essentially, calling <code>.write()</code> is a shorthand for calling <code>.execute().output.df</code>. This allows     you to write data to a <code>Writer</code> without having to call the <code>execute</code> method directly. This is a convenience method     that simplifies the usage of a <code>Writer</code>.</p> <p>Here's an example of how to use the <code>.write()</code> method:</p> <pre><code># Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the .write() method to write the data\nmy_writer.write()\n\n# The data from MyWriter's DataFrame is now written to the destination\n</code></pre> <p>In this example, <code>MyWriter</code> is a subclass of <code>Writer</code> that you've defined. After creating an instance of <code>MyWriter</code>, you call the <code>.write()</code> method to write the data to the destination. The data from <code>MyWriter</code>'s DataFrame is now written to the destination.</p> </li> <li> <p>DataFrame Property: The <code>Writer</code> class provides a <code>df</code> property as a shorthand for accessing <code>self.input.df</code>.     This property ensures that the data is ready to be written, even if the <code>execute</code> method hasn't been explicitly     called.</p> <p>Here's an example of how to use the <code>df</code> property:</p> <pre><code># Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the df property to get the data as a DataFrame\ndf = my_writer.df\n\n# Now df is a DataFrame with the data that will be written by MyWriter\n</code></pre> <p>In this example, <code>MyWriter</code> is a subclass of <code>Writer</code> that you've defined. After creating an instance of <code>MyWriter</code>, you access the <code>df</code> property to get the data as a DataFrame. The DataFrame <code>df</code> now contains the data that will be written by <code>MyWriter</code>.</p> </li> <li> <p>SparkSession: Every <code>Writer</code> has a <code>SparkSession</code> available as <code>self.spark</code>. This is the currently active     <code>SparkSession</code>, which can be used to perform distributed data processing tasks.</p> <p>Here's an example of how to use the <code>spark</code> property:</p> <pre><code># Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the spark property to get the SparkSession\nspark = my_writer.spark\n\n# Now spark is the SparkSession associated with MyWriter\n</code></pre> <p>In this example, <code>MyWriter</code> is a subclass of <code>Writer</code> that you've defined. After creating an instance of <code>MyWriter</code>, you access the <code>spark</code> property to get the <code>SparkSession</code>. The <code>SparkSession</code> <code>spark</code> can now be used to perform distributed data processing tasks.</p> </li> </ol>"},{"location":"reference/spark/writers.html#understanding-inheritance-in-writers","title":"Understanding Inheritance in Writers","text":"<p>Just like a <code>Step</code>, a <code>Writer</code> is defined as a subclass that inherits from the base <code>Writer</code> class. This means it inherits all the properties and methods from the <code>Writer</code> class and can add or override them as needed. The main method that needs to be overridden is the <code>execute</code> method, which should implement the logic for writing data from <code>self.input.df</code> to the destination.</p>"},{"location":"reference/spark/writers.html#examples-of-writer-classes-in-koheesio","title":"Examples of Writer Classes in Koheesio","text":"<p>Koheesio provides a variety of <code>Writer</code> subclasses for writing data to different destinations. Here are just a few examples:</p> <ul> <li><code>BoxFileWriter</code></li> <li><code>DeltaTableStreamWriter</code></li> <li><code>DeltaTableWriter</code></li> <li><code>DummyWriter</code></li> <li><code>ForEachBatchStreamWriter</code></li> <li><code>KafkaWriter</code></li> <li><code>SnowflakeWriter</code></li> <li><code>StreamWriter</code></li> </ul> <p>Please note that this is not an exhaustive list. Koheesio provides many more <code>Writer</code> subclasses for a wide range of data destinations. For a complete list, please refer to the Koheesio documentation or the source code.</p>"},{"location":"reference/spark/writers.html#benefits-of-using-writers-in-data-pipelines","title":"Benefits of Using Writers in Data Pipelines","text":"<p>Using <code>Writer</code> classes in your data pipelines has several benefits:</p> <ol> <li>Simplicity: Writers abstract away the details of writing data to various destinations, allowing you to focus on     the logic of your pipeline.</li> <li>Consistency: By using Writers, you ensure that data is written in a consistent manner across different parts of     your pipeline.</li> <li>Flexibility: Writers can be easily swapped out for different data destinations without changing the rest of your     pipeline.</li> <li>Efficiency: Writers automatically manage resources like connections and file handles, ensuring efficient use of     resources.</li> <li>Early Input Validation: As a <code>Writer</code> is a type of <code>SparkStep</code>, which in turn is a <code>Step</code> and a type of Pydantic     <code>BaseModel</code>, all inputs are validated when an instance of a <code>Writer</code> class is created. This early validation helps     catch errors related to invalid input, such as an invalid URL for a database, before the PySpark pipeline starts     executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.</li> </ol> <p>By using the concept of a <code>Writer</code>, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.</p>"},{"location":"tutorials/advanced-data-processing.html","title":"Advanced Data Processing with Koheesio","text":"<p>In this guide, we will explore some advanced data processing techniques using Koheesio. We will cover topics such as  complex transformations, handling large datasets, and optimizing performance.</p>"},{"location":"tutorials/advanced-data-processing.html#complex-transformations","title":"Complex Transformations","text":"<p>Koheesio provides a variety of built-in transformations, but sometimes you may need to perform more complex operations  on your data. In such cases, you can create custom transformations.</p> <p>Here's an example of a custom transformation that normalizes a column in a DataFrame:</p> <pre><code>from pyspark.sql import DataFrame\nfrom koheesio.spark.transformations.transform import Transform\n\ndef normalize_column(df: DataFrame, column: str) -&gt; DataFrame:\n    max_value = df.agg({column: \"max\"}).collect()[0][0]\n    min_value = df.agg({column: \"min\"}).collect()[0][0]\n    return df.withColumn(column, (df[column] - min_value) / (max_value - min_value))\n\n\nclass NormalizeColumnTransform(Transform):\n    column: str\n\n    def transform(self, df: DataFrame) -&gt; DataFrame:\n        return normalize_column(df, self.column)\n</code></pre>"},{"location":"tutorials/advanced-data-processing.html#handling-large-datasets","title":"Handling Large Datasets","text":"<p>When working with large datasets, it's important to manage resources effectively to ensure good performance. Koheesio  provides several features to help with this.  </p>"},{"location":"tutorials/advanced-data-processing.html#partitioning","title":"Partitioning","text":"<p>Partitioning is a technique that divides your data into smaller, more manageable pieces, called partitions. Koheesio  allows you to specify the partitioning scheme for your data when writing it to a target.</p> <pre><code>from koheesio.steps.writers.delta import DeltaTableWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\nclass MyTask(EtlTask):\n    target = DeltaTableWriter(table=\"my_table\", partitionBy=[\"column1\", \"column2\"])\n</code></pre>"},{"location":"tutorials/getting-started.html","title":"Getting Started with Koheesio","text":""},{"location":"tutorials/getting-started.html#requirements","title":"Requirements","text":"<ul> <li>Python 3.9+</li> </ul>"},{"location":"tutorials/getting-started.html#installation","title":"Installation","text":""},{"location":"tutorials/getting-started.html#poetry","title":"Poetry","text":"<p>If you're using Poetry, add the following entry to the <code>pyproject.toml</code> file:</p> pyproject.toml<pre><code>[[tool.poetry.source]]\nname = \"nike\"\nurl = \"https://artifactory.nike.com/artifactory/api/pypi/python-virtual/simple\"\nsecondary = true\n</code></pre> <pre><code>poetry add koheesio\n</code></pre>"},{"location":"tutorials/getting-started.html#pip","title":"pip","text":"<p>If you're using pip, run the following command to install Koheesio:</p> <p>Requires pip.</p> <pre><code>pip install koheesio\n</code></pre>"},{"location":"tutorials/getting-started.html#basic-usage","title":"Basic Usage","text":"<p>Once you've installed Koheesio, you can start using it in your Python scripts. Here's a basic example:</p> <pre><code>from koheesio import Step\n\n# Define a step\nclass MyStep(Step):\n    def execute(self):\n        # Your step logic here\n\n# Create an instance of the step\nstep = MyStep()\n\n# Run the step\nstep.execute()\n</code></pre>"},{"location":"tutorials/getting-started.html#advanced-usage","title":"Advanced Usage","text":"<pre><code>from pyspark.sql.functions import lit\nfrom pyspark.sql import DataFrame, SparkSession\n\n# Step 1: import Koheesio dependencies\nfrom koheesio.context import Context\nfrom koheesio.steps.readers.dummy import DummyReader\nfrom koheesio.steps.transformations.camel_to_snake import CamelToSnakeTransformation\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\n# Step 2: Set up a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Step 3: Configure your Context\ncontext = Context({\n    \"source\": DummyReader(),\n    \"transformations\": [CamelToSnakeTransformation()],\n    \"target\": DummyWriter(),\n    \"my_favorite_movie\": \"inception\",\n})\n\n# Step 4: Create a Task\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: DataFrame = None) -&gt; DataFrame:\n        df = df.withColumn(\"MyFavoriteMovie\", lit(self.my_favorite_movie))\n        return super().transform(df)\n\n# Step 5: Run your Task\ntask = MyFavoriteMovieTask(**context)\ntask.run()\n</code></pre>"},{"location":"tutorials/getting-started.html#contributing","title":"Contributing","text":"<p>If you want to contribute to Koheesio, check out the <code>CONTRIBUTING.md</code> file in this repository. It contains guidelines for contributing, including how to submit issues and pull requests.</p>"},{"location":"tutorials/getting-started.html#testing","title":"Testing","text":"<p>To run the tests for Koheesio, use the following command:</p> <pre><code>make dev-test\n</code></pre> <p>This will run all the tests in the <code>tests</code> directory.</p>"},{"location":"tutorials/hello-world.html","title":"Simple Examples","text":""},{"location":"tutorials/hello-world.html#creating-a-custom-step","title":"Creating a Custom Step","text":"<p>This example demonstrates how to use the <code>SparkStep</code> class from the <code>koheesio</code> library to create a custom step named  <code>HelloWorldStep</code>.</p>"},{"location":"tutorials/hello-world.html#code","title":"Code","text":"<pre><code>from koheesio.steps.step import SparkStep\n\nclass HelloWorldStep(SparkStep):\n    message: str\n\n    def execute(self) -&gt; SparkStep.Output:\n        # create a DataFrame with a single row containing the message\n        self.output.df = self.spark.createDataFrame([(1, self.message)], [\"id\", \"message\"])\n</code></pre>"},{"location":"tutorials/hello-world.html#usage","title":"Usage","text":"<pre><code>hello_world_step = HelloWorldStep(message=\"Hello, World!\")\nhello_world_step.execute()\n\nhello_world_step.output.df.show()\n</code></pre>"},{"location":"tutorials/hello-world.html#understanding-the-code","title":"Understanding the Code","text":"<p>The <code>HelloWorldStep</code> class is a <code>SparkStep</code> in Koheesio, designed to generate a DataFrame with a single row containing a custom message. Here's a more detailed overview:</p> <ul> <li><code>HelloWorldStep</code> inherits from <code>SparkStep</code>, a fundamental building block in Koheesio for creating data processing steps with Apache Spark.</li> <li>It has a <code>message</code> attribute. When creating an instance of <code>HelloWorldStep</code>, you can pass a custom message that will be used in the DataFrame.</li> <li><code>SparkStep</code> has a <code>spark</code> attribute, which is the active SparkSession. This is the entry point for any Spark functionality, allowing the step to interact with the Spark cluster.</li> <li><code>SparkStep</code> also includes an <code>Output</code> class, used to store the output of the step. In this case, <code>Output</code> has a <code>df</code> attribute to store the output DataFrame.</li> <li>The <code>execute</code> method creates a DataFrame with the custom message and stores it in <code>output.df</code>. It doesn't return a value explicitly; instead, the output DataFrame can be accessed via <code>output.df</code>.</li> <li>Koheesio uses pydantic for automatic validation of the step's input and output, ensuring they are correctly defined and of the correct types.</li> </ul> <p>Note: Pydantic is a data validation library that provides a way to validate that the data (in this case, the input and output of the step) conforms to the expected format.</p>"},{"location":"tutorials/hello-world.html#creating-a-custom-task","title":"Creating a Custom Task","text":"<p>This example demonstrates how to use the <code>EtlTask</code> from the <code>koheesio</code> library to create a custom task named <code>MyFavoriteMovieTask</code>.</p>"},{"location":"tutorials/hello-world.html#code_1","title":"Code","text":"<pre><code>from typing import Any\nfrom pyspark.sql import DataFrame, functions as f\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.tasks.etl_task import EtlTask\n\n\ndef add_column(df: DataFrame, target_column: str, value: Any):\n    return df.withColumn(target_column, f.lit(value))\n\n\nclass MyFavoriteMovieTask(EtlTask):\n    my_favorite_movie: str\n\n    def transform(self, df: Optional[DataFrame] = None) -&gt; DataFrame:\n        df = df or self.extract()\n\n        # pre-transformations specific to this class\n        pre_transformations = [\n            Transform(add_column, target_column=\"myFavoriteMovie\", value=self.my_favorite_movie)\n        ]\n\n        # execute transformations one by one\n        for t in pre_transformations:\n            df = t.transform(df)\n\n        self.output.transform_df = df\n        return df\n</code></pre>"},{"location":"tutorials/hello-world.html#configuration","title":"Configuration","text":"<p>Here is the <code>sample.yaml</code> configuration file used in this example:</p> <pre><code>raw_layer:\n  catalog: development\n  schema: my_favorite_team\n  table: some_random_table\nmovies:\n  favorite: Office Space\nhash_settings:\n  source_columns:\n  - id\n  - foo\n  target_column: hash_uuid5\nsource:\n  range: 4\n</code></pre>"},{"location":"tutorials/hello-world.html#usage_1","title":"Usage","text":"<pre><code>from pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\n\ncontext = Context.from_yaml(\"sample.yaml\")\n\nSparkSession.builder.getOrCreate()\n\nmy_fav_mov_task = MyFavoriteMovieTask(\n    source=DummyReader(**context.raw_layer),\n    target=DummyWriter(truncate=False),\n    my_favorite_movie=context.movies.favorite,\n)\nmy_fav_mov_task.execute()\n</code></pre>"},{"location":"tutorials/hello-world.html#understanding-the-code_1","title":"Understanding the Code","text":"<p>This example creates a <code>MyFavoriteMovieTask</code> that adds a column named <code>myFavoriteMovie</code> to the DataFrame. The value for this column is provided when the task is instantiated.</p> <p>The <code>MyFavoriteMovieTask</code> class is a custom task that extends the <code>EtlTask</code> from the <code>koheesio</code> library. It demonstrates how to add a custom transformation to a DataFrame. Here's a detailed breakdown:</p> <ul> <li> <p><code>MyFavoriteMovieTask</code> inherits from <code>EtlTask</code>, a base class in Koheesio for creating Extract-Transform-Load (ETL) tasks with Apache Spark.</p> </li> <li> <p>It has a <code>my_favorite_movie</code> attribute. When creating an instance of <code>MyFavoriteMovieTask</code>, you can pass a custom movie title that will be used in the DataFrame.</p> </li> <li> <p>The <code>transform</code> method is where the main logic of the task is implemented. It first extracts the data (if not already provided), then applies a series of transformations to the DataFrame.</p> </li> <li> <p>In this case, the transformation is adding a new column to the DataFrame named <code>myFavoriteMovie</code>, with the value set to the <code>my_favorite_movie</code> attribute. This is done using the <code>add_column</code> function and the <code>Transform</code> class from Koheesio.</p> </li> <li> <p>The transformed DataFrame is then stored in <code>self.output.transform_df</code>.</p> </li> <li> <p>The <code>sample.yaml</code> configuration file is used to provide the context for the task, including the source data and the favorite movie title.</p> </li> <li> <p>In the usage example, an instance of <code>MyFavoriteMovieTask</code> is created with a <code>DummyReader</code> as the source, a <code>DummyWriter</code> as the target, and the favorite movie title from the context. The task is then executed, which runs the transformations and stores the result in <code>self.output.transform_df</code>.</p> </li> </ul>"},{"location":"tutorials/learn-koheesio.html","title":"Learn Koheesio","text":"<p>Koheesio is designed to simplify the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.</p>"},{"location":"tutorials/learn-koheesio.html#core-concepts","title":"Core Concepts","text":"<p>Koheesio is built around several core concepts:</p> <ul> <li>Step: The fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in    inputs and producing outputs.   <p>See the Step documentation for more information.</p> </li> <li>Context: A configuration class used to set up the environment for a Task. It can be used to share variables across   tasks and adapt the behavior of a Task based on its environment.   <p>See the Context documentation for more information.</p> </li> <li>Logger: A class for logging messages at different levels.   <p>See the Logger documentation for more information.</p> </li> </ul> <p>The Logger and Context classes provide support, enabling detailed logging of the pipeline's execution and customization  of the pipeline's behavior based on the environment, respectively.</p>"},{"location":"tutorials/learn-koheesio.html#implementations","title":"Implementations","text":"<p>In the context of Koheesio, an implementation refers to a specific way of executing Steps, the fundamental units of work in Koheesio. Each implementation uses a different technology or approach to process data along with its own set of  Steps, designed to work with the specific technology or approach used by the implementation. </p> <p>For example, the Spark implementation includes Steps for reading data from a Spark DataFrame, transforming the data  using Spark operations, and writing the data to a Spark-supported destination.</p> <p>Currently, Koheesio supports two implementations: Spark, and AsyncIO.</p>"},{"location":"tutorials/learn-koheesio.html#spark","title":"Spark","text":"<p>Requires: Apache Spark (pyspark) Installation: <code>pip install koheesio[spark]</code> Module: <code>koheesio.spark</code> </p> <p>This implementation uses Apache Spark, a powerful open-source unified analytics engine for large-scale data processing. </p> <p>Steps that use this implementation can leverage Spark's capabilities for distributed data processing, making it suitable for handling large volumes of data. The Spark implementation includes the following types of Steps:  </p> <ul> <li> <p>Reader: <code>from koheesio.spark.readers import Reader</code>   A type of Step that reads data from a source and stores the result (to make it available for subsequent steps).   For more information, see the Reader documentation.</p> </li> <li> <p>Writer: <code>from koheesio.spark.writers import Writer</code>    This controls how data is written to the output in both batch and streaming contexts.   For more information, see the Writer documentation.</p> </li> <li> <p>Transformation: <code>from koheesio.spark.transformations import Transformation</code>    A type of Step that takes a DataFrame as input and returns a DataFrame as output.   For more information, see the Transformation documentation.</p> </li> </ul> <p>In any given pipeline, you can expect to use Readers, Writers, and Transformations to express the ETL logic. Readers are responsible for extracting data from various sources, such as databases, files, or APIs. Transformations then process this  data, performing operations like filtering, aggregation, or conversion. Finally, Writers handle the loading of the transformed data to the desired destination, which could be a database, a file, or a data stream. </p>"},{"location":"tutorials/learn-koheesio.html#async","title":"Async","text":"<p>Module: <code>koheesio.asyncio</code></p> <p>This implementation uses Python's asyncio library for writing single-threaded concurrent code using coroutines,  multiplexing I/O access over sockets and other resources, running network clients and servers, and other related  primitives. Steps that use this implementation can perform data processing tasks asynchronously, which can be beneficial for IO-bound tasks.  </p>"},{"location":"tutorials/learn-koheesio.html#best-practices","title":"Best Practices","text":"<p>Here are some best practices for using Koheesio:</p> <ol> <li> <p>Use Context: The <code>Context</code> class in Koheesio is designed to behave like a dictionary, but with added features.       It's a good practice to use <code>Context</code> to customize the behavior of a task. This allows you to share variables      across tasks and adapt the behavior of a task based on its environment; for example, by changing the source or      target of the data between development and production environments.</p> </li> <li> <p>Modular Design: Each step in the pipeline (reading, transformation, writing) should be encapsulated in its own      class, making the code easier to understand and maintain. This also promotes re-usability as steps can be reused      across different tasks.</p> </li> <li> <p>Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks. Make      sure to leverage this feature to make your pipelines robust and fault-tolerant.</p> </li> <li> <p>Logging: Use the built-in logging feature in Koheesio to log information and errors in data processing tasks.      This can be very helpful for debugging and monitoring the pipeline. Koheesio sets the log level to <code>WARNING</code> by      default, but you can change it to <code>INFO</code> or <code>DEBUG</code> as needed.</p> </li> <li> <p>Testing: Each step can be tested independently, making it easier to write unit tests. It's a good practice to      write tests for your steps to ensure they are working as expected.</p> </li> <li> <p>Use Transformations: The <code>Transform</code> class in Koheesio allows you to define transformations on your data. It's a      good practice to encapsulate your transformation logic in <code>Transform</code> classes for better readability and      maintainability.</p> </li> <li> <p>Consistent Structure: Koheesio enforces a consistent structure for data processing tasks. Stick to this structure      to make your codebase easier to understand for new developers.</p> </li> <li> <p>Use Readers and Writers: Use the built-in <code>Reader</code> and <code>Writer</code> classes in Koheesio to handle data extraction      and loading. This not only simplifies your code but also makes it more robust and efficient.</p> </li> </ol> <p>Remember, these are general best practices and might need to be adapted based on your specific use case and requirements.</p>"},{"location":"tutorials/learn-koheesio.html#pydantic","title":"Pydantic","text":"<p>Koheesio Steps are Pydantic models, which means they can be validated and serialized. This makes it easy to define the inputs and outputs of a Step, and to validate them before running the Step. Pydantic models also provide a consistent way to define the schema of the data that a Step expects and produces, making it easier to understand and maintain the code.</p> <p>Learn more about Pydantic here.</p>"},{"location":"tutorials/onboarding.html","title":"Onboarding","text":"<p>tags:     - doctype/how-to</p>"},{"location":"tutorials/onboarding.html#onboarding-to-koheesio","title":"Onboarding to Koheesio","text":"<p>Koheesio is a Python library that simplifies the development of data engineering pipelines. It provides a structured  way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows. </p> <p>This guide will walk you through the process of transforming a traditional Spark application into a Koheesio pipeline  along with explaining the advantages of using Koheesio over raw Spark.</p>"},{"location":"tutorials/onboarding.html#traditional-spark-application","title":"Traditional Spark Application","text":"<p>First let's create a simple Spark application that you might use to process data.</p> <p>The following Spark application reads a CSV file, performs a transformation, and writes the result to a Delta table.  The transformation includes filtering data where age is greater than 18 and performing an aggregation to calculate the  average salary per country. The result is then written to a Delta table partitioned by country.</p> <pre><code>from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, avg\n\nspark = SparkSession.builder.getOrCreate()\n\n# Read data from CSV file\ndf = spark.read.csv(\"input.csv\", header=True, inferSchema=True)\n\n# Filter data where age is greater than 18\ndf = df.filter(col(\"age\") &gt; 18)\n\n# Perform aggregation\ndf = df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n\n# Write data to Delta table with partitioning\ndf.write.format(\"delta\").partitionBy(\"country\").save(\"/path/to/delta_table\")\n</code></pre>"},{"location":"tutorials/onboarding.html#transforming-to-koheesio","title":"Transforming to Koheesio","text":"<p>The same pipeline can be rewritten using Koheesio's <code>EtlTask</code>. In this version, each step (reading, transformations,  writing) is encapsulated in its own class, making the code easier to understand and maintain.  </p> <p>First, a <code>CsvReader</code> is defined to read the input CSV file. Then, a <code>DeltaTableWriter</code> is defined to write the result  to a Delta table partitioned by country. </p> <p>Two transformations are defined:  1. one to filter data where age is greater than 18 2. and, another to calculate the average salary per country. </p> <p>These transformations are then passed to an <code>EtlTask</code> along with the reader and writer. Finally, the <code>EtlTask</code> is  executed to run the pipeline.</p> <p><pre><code>from koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta.batch import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\nfrom pyspark.sql.functions import col, avg\n\n# Define reader\nreader = CsvReader(path=\"input.csv\", header=True, inferSchema=True)\n\n# Define writer\nwriter = DeltaTableWriter(table=\"delta_table\", partition_by=[\"country\"])\n\n# Define transformations\nage_transformation = Transform(\n    func=lambda df: df.filter(col(\"age\") &gt; 18)\n)\navg_salary_per_country = Transform(\n    func=lambda df: df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n)\n\n# Define and execute EtlTask\ntask = EtlTask(\n    source=reader, \n    target=writer, \n    transformations=[\n        age_transformation,\n        avg_salary_per_country\n    ]\n)\ntask.execute()\n</code></pre> This approach with Koheesio provides several advantages. It makes the code more modular and easier to test. Each step can be tested independently and reused across different tasks. It also makes the pipeline more readable and easier to maintain.</p>"},{"location":"tutorials/onboarding.html#advantages-of-koheesio","title":"Advantages of Koheesio","text":"<p>Using Koheesio instead of raw Spark has several advantages:</p> <ul> <li>Modularity: Each step in the pipeline (reading, transformation, writing) is encapsulated in its own class,      making the code easier to understand and maintain.</li> <li>Reusability: Steps can be reused across different tasks, reducing code duplication.</li> <li>Testability: Each step can be tested independently, making it easier to write unit tests.</li> <li>Flexibility: The behavior of a task can be customized using a <code>Context</code> class.</li> <li>Consistency: Koheesio enforces a consistent structure for data processing tasks, making it easier for new      developers to understand the codebase.</li> <li>Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks.</li> <li>Logging: Koheesio provides a consistent way to log information and errors in data processing tasks.</li> </ul> <p>In contrast, using the plain PySpark API for transformations can lead to more verbose and less structured code, which  can be harder to understand, maintain, and test. It also doesn't provide the same level of error handling, logging, and flexibility as the Koheesio Transform class.</p>"},{"location":"tutorials/onboarding.html#using-a-context-class","title":"Using a Context Class","text":"<p>Here's a simple example of how to use a <code>Context</code> class to customize the behavior of a task. The Context class in Koheesio is designed to behave like a dictionary, but with added features. </p> <pre><code>from koheesio import Context\nfrom koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\n\ncontext = Context({  # this could be stored in a JSON or YAML\n    \"age_threshold\": 18,\n    \"reader_options\": {\n        \"path\": \"input.csv\",\n        \"header\": True,\n        \"inferSchema\": True\n    },\n    \"writer_options\": {\n        \"table\": \"delta_table\",\n        \"partition_by\": [\"country\"]\n    }\n})\n\ntask = EtlTask(\n    source = CsvReader(**context.reader_options),\n    target = DeltaTableWriter(**context.writer_options),\n    transformations = [\n        Transform(func=lambda df: df.filter(df[\"age\"] &gt; context.age_threshold))\n    ]\n)\n\ntask.execute()\n</code></pre> <p>In this example, we're using <code>CsvReader</code> to read the input data, <code>DeltaTableWriter</code> to write the output data, and a  <code>Transform</code> step to filter the data based on the age threshold. The options for the reader and writer are stored in a <code>Context</code> object, which can be easily updated or loaded from a JSON or YAML file.</p>"},{"location":"tutorials/testing-koheesio-steps.html","title":"Testing Koheesio Tasks","text":"<p>Testing is a crucial part of any software development process. Koheesio provides a structured way to define and execute data processing tasks, which makes it easier to build, test, and maintain complex data workflows. This guide will walk you through the process of testing Koheesio tasks.</p>"},{"location":"tutorials/testing-koheesio-steps.html#unit-testing","title":"Unit Testing","text":"<p>Unit testing involves testing individual components of the software in isolation. In the context of Koheesio, this means testing individual tasks or steps.</p> <p>Here's an example of how to unit test a Koheesio task:</p> <pre><code>from koheesio.tasks.etl_task import EtlTask\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.steps.transformations import Transform\nfrom pyspark.sql import SparkSession, DataFrame\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df: DataFrame) -&gt; DataFrame:\n    return df.filter(col(\"Age\") &gt; 18)\n\n\ndef test_etl_task():\n    # Initialize SparkSession\n    spark = SparkSession.builder.getOrCreate()\n\n    # Create a DataFrame for testing\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    df = spark.createDataFrame(data, [\"Name\", \"Age\"])\n\n    # Define the task\n    task = EtlTask(\n        source=DummyReader(df=df),\n        target=DummyWriter(),\n        transformations=[\n            Transform(filter_age)\n        ]\n    )\n\n    # Execute the task\n    task.execute()\n\n    # Assert the result\n    result_df = task.output.df\n    assert result_df.count() == 2\n    assert result_df.filter(\"Name == 'Tom'\").count() == 0\n</code></pre> <p>In this example, we're testing an EtlTask that reads data from a DataFrame, applies a filter transformation, and writes  the result to another DataFrame. The test asserts that the task correctly filters out rows where the age is less than or equal to 18.</p>"},{"location":"tutorials/testing-koheesio-steps.html#integration-testing","title":"Integration Testing","text":"<p>Integration testing involves testing the interactions between different components of the software. In the context of  Koheesio, this means testing the entirety of data flowing through one or more tasks.</p> <p>We'll create a simple test for a hypothetical EtlTask that uses DeltaReader and DeltaWriter. We'll use pytest and unittest.mock to mock the responses of the reader and writer.  First, let's assume that you have an EtlTask defined in a module named my_module. This task reads data from a Delta table, applies some transformations, and writes the result to another Delta table.</p> <p>Here's an example of how to write an integration test for this task:</p> <pre><code># my_module.py\nfrom koheesio.tasks.etl_task import EtlTask\nfrom koheesio.spark.readers.delta import DeltaReader\nfrom koheesio.steps.writers.delta import DeltaWriter\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.context import Context\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df):\n    return df.filter(col(\"Age\") &gt; 18)\n\n\ncontext = Context({\n    \"reader_options\": {\n        \"table\": \"input_table\"\n    },\n    \"writer_options\": {\n        \"table\": \"output_table\"\n    }\n})\n\ntask = EtlTask(\n    source=DeltaReader(**context.reader_options),\n    target=DeltaWriter(**context.writer_options),\n    transformations=[\n        Transform(filter_age)\n    ]\n)\n</code></pre> <p>Now, let's create a test for this task. We'll use pytest and unittest.mock to mock the responses of the reader and writer. We'll also use a pytest fixture to create a test context and a test DataFrame.</p> <pre><code># test_my_module.py\nimport pytest\nfrom unittest.mock import MagicMock, patch\nfrom pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import Reader\nfrom koheesio.steps.writers import Writer\n\nfrom my_module import task\n\n@pytest.fixture(scope=\"module\")\ndef spark():\n    return SparkSession.builder.getOrCreate()\n\n@pytest.fixture(scope=\"module\")\ndef test_context():\n    return Context({\n        \"reader_options\": {\n            \"table\": \"test_input_table\"\n        },\n        \"writer_options\": {\n            \"table\": \"test_output_table\"\n        }\n    })\n\n@pytest.fixture(scope=\"module\")\ndef test_df(spark):\n    data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n    return spark.createDataFrame(data, [\"Name\", \"Age\"])\n\ndef test_etl_task(spark, test_context, test_df):\n    # Mock the read method of the Reader class\n    with patch.object(Reader, \"read\", return_value=test_df):\n        # Mock the write method of the Writer class\n        with patch.object(Writer, \"write\") as mock_write:\n            # Execute the task\n            task.execute()\n\n            # Assert the result\n            result_df = task.output.df\n            assert result_df.count() == 2\n            assert result_df.filter(\"Name == 'Tom'\").count() == 0\n\n            # Assert that the reader and writer were called with the correct arguments\n            Reader.read.assert_called_once_with(**test_context.reader_options)\n            mock_write.assert_called_once_with(**test_context.writer_options)\n</code></pre> <p>In this test, we're mocking the DeltaReader and DeltaWriter to return a test DataFrame and check that they're called  with the correct arguments. We're also asserting that the task correctly filters out rows where the age is less than  or equal to 18.</p>"},{"location":"misc/tags.html","title":"{{ page.title }}","text":""},{"location":"misc/tags.html#doctypeexplanation","title":"doctype/explanation","text":"<ul> <li>Approach documentation</li> </ul>"},{"location":"misc/tags.html#doctypehow-to","title":"doctype/how-to","text":"<ul> <li>How to</li> </ul>"}]}
\ No newline at end of file