From 0dcfac777f632071df65687a2e7504bb50d21053 Mon Sep 17 00:00:00 2001 From: Danny Meijer <10511979+dannymeijer@users.noreply.github.com> Date: Thu, 30 May 2024 19:10:56 +0100 Subject: [PATCH] Deployed e6bc5b5 to dev with MkDocs 1.6.0 and mike 2.1.1 --- dev/index.html | 2 +- dev/search/search_index.json | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/index.html b/dev/index.html index 80d5d06..fdc5c93 100644 --- a/dev/index.html +++ b/dev/index.html @@ -3744,7 +3744,7 @@
┌─────────┐ ┌──────────────────┐ ┌──────────┐
│ Input 1 │───────▶│ ├───────▶│ Output 1 │
-└─────────┘ │ │ └────√─────┘
+└─────────┘ │ │ └──────────┘
│ │
┌─────────┐ │ │ ┌──────────┐
│ Input 2 │───────▶│ Step │───────▶│ Output 2 │
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index c4cd013..007d764 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":""},{"location":"index.html#koheesio","title":"Koheesio","text":"CI/CD Package Meta Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
The framework is versatile, aiming to support multiple implementations and working seamlessly with various data processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.
Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type safety and structured configurations within pipeline components.
Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.
"},{"location":"index.html#what-sets-koheesio-apart-from-other-libraries","title":"What sets Koheesio apart from other libraries?\"","text":"Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.
Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus on utility in a multitude of scenarios. Our preference is for integration, not competition...
We invite contributions from all, promoting collaboration and innovation in the data engineering community.
"},{"location":"index.html#koheesio-core-components","title":"Koheesio Core Components","text":"Here are the key components included in Koheesio:
- Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u221a\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
- Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.
- Logger: This is a class for logging messages at different levels.
"},{"location":"index.html#installation","title":"Installation","text":"You can install Koheesio using either pip or poetry.
"},{"location":"index.html#using-pip","title":"Using Pip","text":"To install Koheesio using pip, run the following command in your terminal:
pip install koheesio\n
"},{"location":"index.html#using-hatch","title":"Using Hatch","text":"If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your pyproject.toml
.
"},{"location":"index.html#using-poetry","title":"Using Poetry","text":"If you're using poetry for package management, you can add Koheesio to your project with the following command:
poetry add koheesio\n
or add the following line to your pyproject.toml
(under [tool.poetry.dependencies]
), making sure to replace ...
with the version you want to have installed:
koheesio = {version = \"...\"}\n
"},{"location":"index.html#extras","title":"Extras","text":"Koheesio also provides some additional features that can be useful in certain scenarios. These include:
-
Spark Expectations: Available through the koheesio.steps.integration.spark.dq.spark_expectations
module; installable through the se
extra.
- SE Provides Data Quality checks for Spark DataFrames.
- For more information, refer to the Spark Expectations docs.
-
Box: Available through the koheesio.steps.integration.box
module; installable through the box
extra.
- Box is a cloud content management and file sharing service for businesses.
-
SFTP: Available through the koheesio.steps.integration.spark.sftp
module; installable through the sftp
extra.
- SFTP is a network protocol used for secure file transfer over a secure shell.
Note: Some of the steps require extra dependencies. See the Extras section for additional info. Extras can be added to Poetry by adding extras=['name_of_the_extra']
to the toml entry mentioned above
"},{"location":"index.html#contributing","title":"Contributing","text":""},{"location":"index.html#how-to-contribute","title":"How to Contribute","text":"We welcome contributions to our project! Here's a brief overview of our development process:
-
Code Standards: We use pylint
, black
, and mypy
to maintain code standards. Please ensure your code passes these checks by running make check
. No errors or warnings should be reported by the linter before you submit a pull request.
-
Testing: We use pytest
for testing. Run the tests with make test
and ensure all tests pass before submitting a pull request.
-
Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with admin rights will create a new release on GitHub and publish the new version to PyPI.
For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.
"},{"location":"index.html#additional-resources","title":"Additional Resources","text":" - General GitHub documentation
- GitHub pull request documentation
- Nike OSS
"},{"location":"api_reference/index.html","title":"API Reference","text":""},{"location":"api_reference/index.html#koheesio.ABOUT","title":"koheesio.ABOUT module-attribute
","text":"ABOUT = _about()\n
"},{"location":"api_reference/index.html#koheesio.VERSION","title":"koheesio.VERSION module-attribute
","text":"VERSION = __version__\n
"},{"location":"api_reference/index.html#koheesio.BaseModel","title":"koheesio.BaseModel","text":"Base model for all models.
Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.
Additional methods and properties: Different Modes This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.
-
Normal mode: you need to know the values ahead of time
normal_mode = YourOwnModel(a=\"foo\", b=42)\n
-
Lazy mode: being able to defer the validation until later
lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end. -
With statements: With statements are also allowed. The validate_output
method from the earlier example will run upon exit of the with-statement.
with YourOwnModel.lazy() as with_output:\n with_output.a = \"foo\"\n with_output.b = 42\n
Note: that a lazy mode BaseModel object is required to work with a with-statement.
Examples:
from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n name: str\n age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n
In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output
method is then called to validate the instance.
Koheesio specific configuration: Koheesio models are configured differently from Pydantic defaults. The following configuration is used:
-
extra=\"allow\"
This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.
-
arbitrary_types_allowed=True
This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.
-
populate_by_name=True
This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.
-
validate_assignment=False
This setting determines whether the model should be revalidated when the data is changed. If set to True
, every time a field is assigned a new value, the entire model is validated again.
Pydantic default is (also) False
, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.
-
revalidate_instances=\"subclass-instances\"
This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never
, which means that the model and dataclass instances are not revalidated during validation.
-
validate_default=True
This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.
-
frozen=False
This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.
-
coerce_numbers_to_str=True
This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number
type to str
. Pydantic doesn't allow number types (int
, float
, Decimal
) to be coerced as type str
by default.
-
use_enum_values=True
This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.
"},{"location":"api_reference/index.html#koheesio.BaseModel--fields","title":"Fields","text":"Every Koheesio BaseModel has two fields: name
and description
. These fields are used to provide a name and a description to the model.
-
name
: This is the name of the Model. If not provided, it defaults to the class name.
-
description
: This is the description of the Model. It has several default behaviors:
- If not provided, it defaults to the docstring of the class.
- If the docstring is not provided, it defaults to the name of the class.
- For multi-line descriptions, it has the following behaviors:
- Only the first non-empty line is used.
- Empty lines are removed.
- Only the first 3 lines are considered.
- Only the first 120 characters are considered.
"},{"location":"api_reference/index.html#koheesio.BaseModel--validators","title":"Validators","text":" _set_name_and_description
: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/index.html#koheesio.BaseModel--properties","title":"Properties","text":" log
: Returns a logger with the name of the class.
"},{"location":"api_reference/index.html#koheesio.BaseModel--class-methods","title":"Class Methods","text":" from_basemodel
: Returns a new BaseModel instance based on the data of another BaseModel. from_context
: Creates BaseModel instance from a given Context. from_dict
: Creates BaseModel instance from a given dictionary. from_json
: Creates BaseModel instance from a given JSON string. from_toml
: Creates BaseModel object from a given toml file. from_yaml
: Creates BaseModel object from a given yaml file. lazy
: Constructs the model without doing validation.
"},{"location":"api_reference/index.html#koheesio.BaseModel--dunder-methods","title":"Dunder Methods","text":" __add__
: Allows to add two BaseModel instances together. __enter__
: Allows for using the model in a with-statement. __exit__
: Allows for using the model in a with-statement. __setitem__
: Set Item dunder method for BaseModel. __getitem__
: Get Item dunder method for BaseModel.
"},{"location":"api_reference/index.html#koheesio.BaseModel--instance-methods","title":"Instance Methods","text":" hasattr
: Check if given key is present in the model. get
: Get an attribute of the model, but don't fail if not present. merge
: Merge key,value map with self. set
: Allows for subscribing / assigning to class[key]
. to_context
: Converts the BaseModel instance to a Context object. to_dict
: Converts the BaseModel instance to a dictionary. to_json
: Converts the BaseModel instance to a JSON string. to_yaml
: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/index.html#koheesio.BaseModel.description","title":"description class-attribute
instance-attribute
","text":"description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.log","title":"log property
","text":"log: Logger\n
Returns a logger with the name of the class
"},{"location":"api_reference/index.html#koheesio.BaseModel.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.name","title":"name class-attribute
instance-attribute
","text":"name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_basemodel","title":"from_basemodel classmethod
","text":"from_basemodel(basemodel: BaseModel, **kwargs)\n
Returns a new BaseModel instance based on the data of another BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n kwargs = {**basemodel.model_dump(), **kwargs}\n return cls(**kwargs)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_context","title":"from_context classmethod
","text":"from_context(context: Context) -> BaseModel\n
Creates BaseModel instance from a given Context
You have to make sure that the Context object has the necessary attributes to create the model.
Examples:
class SomeStep(BaseModel):\n foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo) # prints 'bar'\n
Parameters:
Name Type Description Default context
Context
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given Context\n\n You have to make sure that the Context object has the necessary attributes to create the model.\n\n Examples\n --------\n ```python\n class SomeStep(BaseModel):\n foo: str\n\n\n context = Context(foo=\"bar\")\n some_step = SomeStep.from_context(context)\n print(some_step.foo) # prints 'bar'\n ```\n\n Parameters\n ----------\n context: Context\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_dict","title":"from_dict classmethod
","text":"from_dict(data: Dict[str, Any]) -> BaseModel\n
Creates BaseModel instance from a given dictionary
Parameters:
Name Type Description Default data
Dict[str, Any]
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given dictionary\n\n Parameters\n ----------\n data: Dict[str, Any]\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**data)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel instance from a given JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.from_json : Deserializes a JSON string to a Context object
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.from_json : Deserializes a JSON string to a Context object\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_json(json_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel object from a given toml file
Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file, or string containing toml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel object from a given toml file\n\n Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n Parameters\n ----------\n toml_file_or_str: str or Path\n Pathlike string or Path that points to the toml file, or string containing toml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_toml(toml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> BaseModel\n
Creates BaseModel object from a given yaml file
Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n \"\"\"Creates BaseModel object from a given yaml file\n\n Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_yaml(yaml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.get","title":"get","text":"get(key: str, default: Optional[Any] = None)\n
Get an attribute of the model, but don't fail if not present
Similar to dict.get()
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\") # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n
Parameters:
Name Type Description Default key
str
name of the key to get
required default
Optional[Any]
Default value in case the attribute does not exist
None
Returns:
Type Description Any
The value of the attribute
Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n \"\"\"Get an attribute of the model, but don't fail if not present\n\n Similar to dict.get()\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.get(\"foo\") # returns 'bar'\n step_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n ```\n\n Parameters\n ----------\n key: str\n name of the key to get\n default: Optional[Any]\n Default value in case the attribute does not exist\n\n Returns\n -------\n Any\n The value of the attribute\n \"\"\"\n if self.hasattr(key):\n return self.__getitem__(key)\n return default\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.hasattr","title":"hasattr","text":"hasattr(key: str) -> bool\n
Check if given key is present in the model
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n \"\"\"Check if given key is present in the model\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n return hasattr(self, key)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.lazy","title":"lazy classmethod
","text":"lazy()\n
Constructs the model without doing validation
Essentially an alias to BaseModel.construct()
Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n \"\"\"Constructs the model without doing validation\n\n Essentially an alias to BaseModel.construct()\n \"\"\"\n return cls.model_construct()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.merge","title":"merge","text":"merge(other: Union[Dict, BaseModel])\n
Merge key,value map with self
Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n
Parameters:
Name Type Description Default other
Union[Dict, BaseModel]
Dict or another instance of a BaseModel class that will be added to self
required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n \"\"\"Merge key,value map with self\n\n Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n ```\n\n Parameters\n ----------\n other: Union[Dict, BaseModel]\n Dict or another instance of a BaseModel class that will be added to self\n \"\"\"\n if isinstance(other, BaseModel):\n other = other.model_dump() # ensures we really have a dict\n\n for k, v in other.items():\n self.set(k, v)\n\n return self\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.set","title":"set","text":"set(key: str, value: Any)\n
Allows for subscribing / assigning to class[key]
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n
Parameters:
Name Type Description Default key
str
The key of the attribute to assign to
required value
Any
Value that should be assigned to the given key
required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n ```\n\n Parameters\n ----------\n key: str\n The key of the attribute to assign to\n value: Any\n Value that should be assigned to the given key\n \"\"\"\n self.__setitem__(key, value)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_context","title":"to_context","text":"to_context() -> Context\n
Converts the BaseModel instance to a Context object
Returns:
Type Description Context
Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n \"\"\"Converts the BaseModel instance to a Context object\n\n Returns\n -------\n Context\n \"\"\"\n return Context(**self.to_dict())\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Converts the BaseModel instance to a dictionary
Returns:
Type Description Dict[str, Any]
Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Converts the BaseModel instance to a dictionary\n\n Returns\n -------\n Dict[str, Any]\n \"\"\"\n return self.model_dump()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_json","title":"to_json","text":"to_json(pretty: bool = False)\n
Converts the BaseModel instance to a JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.to_json : Serializes a Context object to a JSON string
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n \"\"\"Converts the BaseModel instance to a JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.to_json : Serializes a Context object to a JSON string\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Converts the BaseModel instance to a YAML string
BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Converts the BaseModel instance to a YAML string\n\n BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.validate","title":"validate","text":"validate() -> BaseModel\n
Validate the BaseModel instance
This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.
This method is intended to be used with the lazy
method. The lazy
method is used to create an instance of the BaseModel without immediate validation. The validate
method is then used to validate the instance after.
Note: in the Pydantic BaseModel, the validate
method throws a deprecated warning. This is because Pydantic recommends using the validate_model
method instead. However, we are using the validate
method here in a different context and a slightly different way.
Examples:
class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model
instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate
method is then called to validate the instance. Returns:
Type Description BaseModel
The BaseModel instance
Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n \"\"\"Validate the BaseModel instance\n\n This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n validate the instance after all the attributes have been set.\n\n This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n different context and a slightly different way.\n\n Examples\n --------\n ```python\n class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\n foo_model = FooModel.lazy()\n foo_model.foo = \"bar\"\n foo_model.lorem = \"ipsum\"\n foo_model.validate()\n ```\n In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n are set afterward. The `validate` method is then called to validate the instance.\n\n Returns\n -------\n BaseModel\n The BaseModel instance\n \"\"\"\n return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/index.html#koheesio.Context","title":"koheesio.Context","text":"Context(*args, **kwargs)\n
The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.
Key Features - Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
- Recursive merging: Merges two Contexts together, with the incoming Context having priority.
- Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
- Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.
For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.
Methods:
Name Description add
Add a key/value pair to the context.
get
Get value of a given key.
get_item
Acts just like .get
, except that it returns the key also.
contains
Check if the context contains a given key.
merge
Merge this context with the context of another, where the incoming context has priority.
to_dict
Returns all parameters of the context as a dict.
from_dict
Creates Context object from the given dict.
from_yaml
Creates Context object from a given yaml file.
from_json
Creates Context object from a given json file.
Dunder methods - _
_iter__()
: Allows for iteration across a Context. __len__()
: Returns the length of the Context. __getitem__(item)
: Makes class subscriptable.
Inherited from Mapping items()
: Returns all items of the Context. keys()
: Returns all keys of the Context. values()
: Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n \"\"\"Initializes the Context object with given arguments.\"\"\"\n for arg in args:\n if isinstance(arg, dict):\n kwargs.update(arg)\n if isinstance(arg, Context):\n kwargs = kwargs.update(arg.to_dict())\n\n for key, value in kwargs.items():\n self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/index.html#koheesio.Context.add","title":"add","text":"add(key: str, value: Any) -> Context\n
Add a key/value pair to the context
Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n \"\"\"Add a key/value pair to the context\"\"\"\n self.__dict__[key] = value\n return self\n
"},{"location":"api_reference/index.html#koheesio.Context.contains","title":"contains","text":"contains(key: str) -> bool\n
Check if the context contains a given key
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n \"\"\"Check if the context contains a given key\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n try:\n self.get(key, safe=False)\n return True\n except KeyError:\n return False\n
"},{"location":"api_reference/index.html#koheesio.Context.from_dict","title":"from_dict classmethod
","text":"from_dict(kwargs: dict) -> Context\n
Creates Context object from the given dict
Parameters:
Name Type Description Default kwargs
dict
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n \"\"\"Creates Context object from the given dict\n\n Parameters\n ----------\n kwargs: dict\n\n Returns\n -------\n Context\n \"\"\"\n return cls(kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given json file
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Security (from https://jsonpickle.github.io/)
jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given json file\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Security\n --------\n (from https://jsonpickle.github.io/)\n\n > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n ### ! Warning !\n > The jsonpickle module is not secure. Only unpickle data you trust.\n It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n untrusted data.\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n Context\n \"\"\"\n json_str = json_file_or_str\n\n # check if json_str is pathlike\n if (json_file := Path(json_file_or_str)).exists():\n json_str = json_file.read_text(encoding=\"utf-8\")\n\n json_dict = jsonpickle.loads(json_str)\n return cls.from_dict(json_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json--warning","title":"! Warning !","text":"The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.
"},{"location":"api_reference/index.html#koheesio.Context.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given toml file
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file or string containing toml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given toml file\n\n Parameters\n ----------\n toml_file_or_str: Union[str, Path]\n Pathlike string or Path that points to the toml file or string containing toml\n\n Returns\n -------\n Context\n \"\"\"\n toml_str = toml_file_or_str\n\n # check if toml_str is pathlike\n if (toml_file := Path(toml_file_or_str)).exists():\n toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n toml_dict = tomli.loads(toml_str)\n return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> Context\n
Creates Context object from a given yaml file
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n \"\"\"Creates Context object from a given yaml file\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n Context\n \"\"\"\n yaml_str = yaml_file_or_str\n\n # check if yaml_str is pathlike\n if (yaml_file := Path(yaml_file_or_str)).exists():\n yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n # Bandit: disable yaml.load warning\n yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader) # nosec B506: yaml_load\n\n return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.get","title":"get","text":"get(key: str, default: Any = None, safe: bool = True) -> Any\n
Get value of a given key
The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get()
method otherwise.
Parameters:
Name Type Description Default key
str
Can be a real key, or can be a dotted notation of a nested key
required default
Any
Default value to return
None
safe
bool
Toggles whether to fail or not when item cannot be found
True
Returns:
Type Description Any
Value of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n
Returns c
Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n \"\"\"Get value of a given key\n\n The key can either be an actual key (top level) or the key of a nested value.\n Behaves a lot like a dict's `.get()` method otherwise.\n\n Parameters\n ----------\n key:\n Can be a real key, or can be a dotted notation of a nested key\n default:\n Default value to return\n safe:\n Toggles whether to fail or not when item cannot be found\n\n Returns\n -------\n Any\n Value of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get(\"a.b\")\n ```\n\n Returns `c`\n \"\"\"\n try:\n if \".\" not in key:\n return self.__dict__[key]\n\n # handle nested keys\n nested_keys = key.split(\".\")\n value = self # parent object\n for k in nested_keys:\n value = value[k] # iterate through nested values\n return value\n\n except (AttributeError, KeyError, TypeError) as e:\n if not safe:\n raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n return default\n
"},{"location":"api_reference/index.html#koheesio.Context.get_all","title":"get_all","text":"get_all() -> dict\n
alias to to_dict()
Source code in src/koheesio/context.py
def get_all(self) -> dict:\n \"\"\"alias to to_dict()\"\"\"\n return self.to_dict()\n
"},{"location":"api_reference/index.html#koheesio.Context.get_item","title":"get_item","text":"get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n
Acts just like .get
, except that it returns the key also
Returns:
Type Description Dict[str, Any]
key/value-pair of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n
Returns {'a.b': 'c'}
Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n \"\"\"Acts just like `.get`, except that it returns the key also\n\n Returns\n -------\n Dict[str, Any]\n key/value-pair of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get_item(\"a.b\")\n ```\n\n Returns `{'a.b': 'c'}`\n \"\"\"\n value = self.get(key, default, safe)\n return {key: value}\n
"},{"location":"api_reference/index.html#koheesio.Context.merge","title":"merge","text":"merge(context: Context, recursive: bool = False) -> Context\n
Merge this context with the context of another, where the incoming context has priority.
Parameters:
Name Type Description Default context
Context
Another Context class
required recursive
bool
Recursively merge two dictionaries to an arbitrary depth
False
Returns:
Type Description Context
updated context
Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n Parameters\n ----------\n context: Context\n Another Context class\n recursive: bool\n Recursively merge two dictionaries to an arbitrary depth\n\n Returns\n -------\n Context\n updated context\n \"\"\"\n if recursive:\n return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n # just merge on the top level keys\n return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/index.html#koheesio.Context.process_value","title":"process_value","text":"process_value(value: Any) -> Any\n
Processes the given value, converting dictionaries to Context objects as needed.
Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n if isinstance(value, dict):\n return self.from_dict(value)\n\n if isinstance(value, (list, set)):\n return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n return value\n
"},{"location":"api_reference/index.html#koheesio.Context.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Returns all parameters of the context as a dict
Returns:
Type Description dict
containing all parameters of the context
Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Returns all parameters of the context as a dict\n\n Returns\n -------\n dict\n containing all parameters of the context\n \"\"\"\n result = {}\n\n for key, value in self.__dict__.items():\n if isinstance(value, Context):\n result[key] = value.to_dict()\n elif isinstance(value, list):\n result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n else:\n result[key] = value\n\n return result\n
"},{"location":"api_reference/index.html#koheesio.Context.to_json","title":"to_json","text":"to_json(pretty: bool = False) -> str\n
Returns all parameters of the context as a json string
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a json string\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n d = self.to_dict()\n return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/index.html#koheesio.Context.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Returns all parameters of the context as a yaml string
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a yaml string\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n # sort_keys=False to preserve order of keys\n yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n # remove `!!python/object:...` from yaml\n if clean:\n remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n return yaml_str\n
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin","title":"koheesio.ExtraParamsMixin","text":"Mixin class that adds support for arbitrary keyword arguments to Pydantic models.
The keyword arguments are extracted from the model's values
and moved to a params
dictionary.
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.extra_params","title":"extra_params cached
property
","text":"extra_params: Dict[str, Any]\n
Extract params (passed as arbitrary kwargs) from values and move them to params dict
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.params","title":"params class-attribute
instance-attribute
","text":"params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory","title":"koheesio.LoggingFactory","text":"LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n
Logging factory to be used to generate logger instances.
Parameters:
Name Type Description Default name
Optional[str]
None
env
Optional[str]
None
logger_id
Optional[str]
None
Source code in src/koheesio/logger.py
def __init__(\n self,\n name: Optional[str] = None,\n env: Optional[str] = None,\n level: Optional[str] = None,\n logger_id: Optional[str] = None,\n):\n \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n Parameters\n ----------\n name logger name.\n env environment (\"local\", \"qa\", \"prod).\n logger_id unique identifier for the logger.\n \"\"\"\n\n LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n LoggingFactory.ENV = env or LoggingFactory.ENV\n\n console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n # WARNING is default level for root logger in python\n logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n LoggingFactory.CONSOLE_HANDLER = console_handler\n\n logger = getLogger(LoggingFactory.LOGGER_NAME)\n logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute
instance-attribute
","text":"CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.ENV","title":"ENV class-attribute
instance-attribute
","text":"ENV: Optional[str] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER","title":"LOGGER class-attribute
instance-attribute
","text":"LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute
instance-attribute
","text":"LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute
instance-attribute
","text":"LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute
instance-attribute
","text":"LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute
instance-attribute
","text":"LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute
instance-attribute
","text":"LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute
instance-attribute
","text":"LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.add_handlers","title":"add_handlers staticmethod
","text":"add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n
Add handlers to existing root logger.
Parameters:
Name Type Description Default handler_class
required handlers_config
required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n \"\"\"Add handlers to existing root logger.\n\n Parameters\n ----------\n handler_class handler module and class for importing.\n handlers_config configuration for handler.\n\n \"\"\"\n for handler_module_class, handler_conf in handlers:\n handler_class: logging.Handler = import_class(handler_module_class)\n handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n # noinspection PyCallingNonCallable\n handler = handler_class(**handler_conf)\n handler.setLevel(handler_level)\n handler.addFilter(LoggingFactory.LOGGER_FILTER)\n handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.get_logger","title":"get_logger staticmethod
","text":"get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n
Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.
Parameters:
Name Type Description Default name
str
required inherit_from_koheesio
bool
False
Returns:
Name Type Description logger
Logger
Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n Parameters\n ----------\n name: Name of logger.\n inherit_from_koheesio: Inherit logger from koheesio\n\n Returns\n -------\n logger: Logger\n\n \"\"\"\n if inherit_from_koheesio:\n LoggingFactory.__check_koheesio_logger_initialized()\n name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n return getLogger(name)\n
"},{"location":"api_reference/index.html#koheesio.Step","title":"koheesio.Step","text":"Base class for a step
A custom unit of logic that can be executed.
The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self)
method, specifying the expected inputs and outputs.
Note: since the Step class is meta classed, the execute method is wrapped with the do_execute
function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.
Methods and Attributes The Step class has several attributes and methods.
Background A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.
A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!
The diagram serves to illustrate the concept of a Step:
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.
- Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
- Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the
execute
method of the Step class with the _execute_wrapper
function. This ensures that the execute
method always returns the output of the Step along with providing logging and validation of the output. - Step has an
Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. - The
Output
class can be extended to add additional fields to the output of the Step.
Examples:
class MyStep(Step):\n a: str # input\n\n class Output(StepOutput): # output\n b: str\n\n def execute(self) -> MyStep.Output:\n self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/index.html#koheesio.Step--input","title":"INPUT","text":"The following fields are available by default on the Step class: - name
: Name of the Step. If not set, the name of the class will be used. - description
: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.
When subclassing a Step, any additional pydantic field will be treated as input
to the Step. See also the explanation on the .execute()
method below.
"},{"location":"api_reference/index.html#koheesio.Step--output","title":"OUTPUT","text":"Every Step has an Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. The Output
class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute()
.
Output
: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class. output
: Allows you to interact with the Output of the Step lazily (see above and StepOutput)
When subclassing a Step, any additional pydantic field added to the nested Output
class will be treated as output
of the Step. See also the description of StepOutput
for more information.
"},{"location":"api_reference/index.html#koheesio.Step--methods","title":"Methods:","text":" execute
: Abstract method to implement for new steps. - The Inputs of the step can be accessed, using
self.input_name
. - The output of the step can be accessed, using
self.output.output_name
.
run
: Alias to .execute() method. You can use this to run the step, but execute is preferred. to_yaml
: YAML dump the step get_description
: Get the description of the Step
When subclassing a Step, execute
is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.
Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute
function making it always return a StepOutput. See also the explanation on the do_execute
function.
"},{"location":"api_reference/index.html#koheesio.Step--class-methods","title":"class methods:","text":" from_step
: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
get_description
: Get the description of the Step
"},{"location":"api_reference/index.html#koheesio.Step--dunder-methods","title":"dunder methods:","text":" __getattr__
: Allows input to be accessed through self.input_name
__repr__
and __str__
: String representation of a step
"},{"location":"api_reference/index.html#koheesio.Step.output","title":"output property
writable
","text":"output: Output\n
Interact with the output of the Step
"},{"location":"api_reference/index.html#koheesio.Step.Output","title":"Output","text":"Output class for Step
"},{"location":"api_reference/index.html#koheesio.Step.execute","title":"execute abstractmethod
","text":"execute()\n
Abstract method to implement for new steps.
The Inputs of the step can be accessed, using self.input_name
Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute
function making it always return the Steps output
Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Abstract method to implement for new steps.\n\n The Inputs of the step can be accessed, using `self.input_name`\n\n Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n it always return the Steps output\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/index.html#koheesio.Step.from_step","title":"from_step classmethod
","text":"from_step(step: Step, **kwargs)\n
Returns a new Step instance based on the data of another Step or BaseModel instance
Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_json","title":"repr_json","text":"repr_json(simple=False) -> str\n
dump the step to json, meant for representation
Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid json
Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n \"\"\"dump the step to json, meant for representation\n\n Note: use to_json if you want to dump the step to json for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_json())\n {\"input\": {\"a\": \"foo\"}}\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid json\n \"\"\"\n model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n _result = {}\n\n # extract input\n _input = self.model_dump(**model_dump_options)\n\n # remove name and description from input and add to result if simple is not set\n name = _input.pop(\"name\", None)\n description = _input.pop(\"description\", None)\n if not simple:\n if name:\n _result[\"name\"] = name\n if description:\n _result[\"description\"] = description\n else:\n model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n # extract output\n _output = self.output.model_dump(**model_dump_options)\n\n # add output to result\n if _output:\n _result[\"output\"] = _output\n\n # add input to result\n _result[\"input\"] = _input\n\n class MyEncoder(json.JSONEncoder):\n \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n def default(self, o: Any) -> Any:\n try:\n return super().default(o)\n except TypeError:\n return o.__class__.__name__\n\n # Use MyEncoder when converting the dictionary to a JSON string\n json_str = json.dumps(_result, cls=MyEncoder)\n\n return json_str\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_yaml","title":"repr_yaml","text":"repr_yaml(simple=False) -> str\n
dump the step to yaml, meant for representation
Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n a: foo\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid yaml
Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n \"\"\"dump the step to yaml, meant for representation\n\n Note: use to_yaml if you want to dump the step to yaml for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_yaml())\n input:\n a: foo\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid yaml\n \"\"\"\n json_str = self.repr_json(simple=simple)\n\n # Parse the JSON string back into a dictionary\n _result = json.loads(json_str)\n\n return yaml.dump(_result)\n
"},{"location":"api_reference/index.html#koheesio.Step.run","title":"run","text":"run()\n
Alias to .execute()
Source code in src/koheesio/steps/__init__.py
def run(self):\n \"\"\"Alias to .execute()\"\"\"\n return self.execute()\n
"},{"location":"api_reference/index.html#koheesio.StepOutput","title":"koheesio.StepOutput","text":"Class for the StepOutput model
Usage Setting up the StepOutputs class is done like this:
class YourOwnOutput(StepOutput):\n a: str\n b: int\n
"},{"location":"api_reference/index.html#koheesio.StepOutput.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/index.html#koheesio.StepOutput.validate_output","title":"validate_output","text":"validate_output() -> StepOutput\n
Validate the output of the Step
Essentially, this method is a wrapper around the validate method of the BaseModel class
Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n \"\"\"Validate the output of the Step\n\n Essentially, this method is a wrapper around the validate method of the BaseModel class\n \"\"\"\n validated_model = self.validate()\n return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/index.html#koheesio.print_logo","title":"koheesio.print_logo","text":"print_logo()\n
Source code in src/koheesio/__init__.py
def print_logo():\n global _logo_printed\n global _koheesio_print_logo\n\n if not _logo_printed and _koheesio_print_logo:\n print(ABOUT)\n _logo_printed = True\n
"},{"location":"api_reference/context.html","title":"Context","text":"The Context module is a part of the Koheesio framework and is primarily used for managing the environment configuration where a Task or Step runs. It helps in adapting the behavior of a Task/Step based on the environment it operates in, thereby avoiding the repetition of configuration values across different tasks.
The Context class, which is a key component of this module, functions similarly to a dictionary but with additional features. It supports operations like handling nested keys, recursive merging of contexts, and serialization/deserialization to and from various formats like JSON, YAML, and TOML.
For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.
"},{"location":"api_reference/context.html#koheesio.context.Context","title":"koheesio.context.Context","text":"Context(*args, **kwargs)\n
The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.
Key Features - Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
- Recursive merging: Merges two Contexts together, with the incoming Context having priority.
- Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
- Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.
For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.
Methods:
Name Description add
Add a key/value pair to the context.
get
Get value of a given key.
get_item
Acts just like .get
, except that it returns the key also.
contains
Check if the context contains a given key.
merge
Merge this context with the context of another, where the incoming context has priority.
to_dict
Returns all parameters of the context as a dict.
from_dict
Creates Context object from the given dict.
from_yaml
Creates Context object from a given yaml file.
from_json
Creates Context object from a given json file.
Dunder methods - _
_iter__()
: Allows for iteration across a Context. __len__()
: Returns the length of the Context. __getitem__(item)
: Makes class subscriptable.
Inherited from Mapping items()
: Returns all items of the Context. keys()
: Returns all keys of the Context. values()
: Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n \"\"\"Initializes the Context object with given arguments.\"\"\"\n for arg in args:\n if isinstance(arg, dict):\n kwargs.update(arg)\n if isinstance(arg, Context):\n kwargs = kwargs.update(arg.to_dict())\n\n for key, value in kwargs.items():\n self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.add","title":"add","text":"add(key: str, value: Any) -> Context\n
Add a key/value pair to the context
Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n \"\"\"Add a key/value pair to the context\"\"\"\n self.__dict__[key] = value\n return self\n
"},{"location":"api_reference/context.html#koheesio.context.Context.contains","title":"contains","text":"contains(key: str) -> bool\n
Check if the context contains a given key
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n \"\"\"Check if the context contains a given key\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n try:\n self.get(key, safe=False)\n return True\n except KeyError:\n return False\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_dict","title":"from_dict classmethod
","text":"from_dict(kwargs: dict) -> Context\n
Creates Context object from the given dict
Parameters:
Name Type Description Default kwargs
dict
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n \"\"\"Creates Context object from the given dict\n\n Parameters\n ----------\n kwargs: dict\n\n Returns\n -------\n Context\n \"\"\"\n return cls(kwargs)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given json file
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Security (from https://jsonpickle.github.io/)
jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given json file\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Security\n --------\n (from https://jsonpickle.github.io/)\n\n > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n ### ! Warning !\n > The jsonpickle module is not secure. Only unpickle data you trust.\n It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n untrusted data.\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n Context\n \"\"\"\n json_str = json_file_or_str\n\n # check if json_str is pathlike\n if (json_file := Path(json_file_or_str)).exists():\n json_str = json_file.read_text(encoding=\"utf-8\")\n\n json_dict = jsonpickle.loads(json_str)\n return cls.from_dict(json_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json--warning","title":"! Warning !","text":"The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.
"},{"location":"api_reference/context.html#koheesio.context.Context.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given toml file
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file or string containing toml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given toml file\n\n Parameters\n ----------\n toml_file_or_str: Union[str, Path]\n Pathlike string or Path that points to the toml file or string containing toml\n\n Returns\n -------\n Context\n \"\"\"\n toml_str = toml_file_or_str\n\n # check if toml_str is pathlike\n if (toml_file := Path(toml_file_or_str)).exists():\n toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n toml_dict = tomli.loads(toml_str)\n return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> Context\n
Creates Context object from a given yaml file
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n \"\"\"Creates Context object from a given yaml file\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n Context\n \"\"\"\n yaml_str = yaml_file_or_str\n\n # check if yaml_str is pathlike\n if (yaml_file := Path(yaml_file_or_str)).exists():\n yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n # Bandit: disable yaml.load warning\n yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader) # nosec B506: yaml_load\n\n return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get","title":"get","text":"get(key: str, default: Any = None, safe: bool = True) -> Any\n
Get value of a given key
The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get()
method otherwise.
Parameters:
Name Type Description Default key
str
Can be a real key, or can be a dotted notation of a nested key
required default
Any
Default value to return
None
safe
bool
Toggles whether to fail or not when item cannot be found
True
Returns:
Type Description Any
Value of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n
Returns c
Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n \"\"\"Get value of a given key\n\n The key can either be an actual key (top level) or the key of a nested value.\n Behaves a lot like a dict's `.get()` method otherwise.\n\n Parameters\n ----------\n key:\n Can be a real key, or can be a dotted notation of a nested key\n default:\n Default value to return\n safe:\n Toggles whether to fail or not when item cannot be found\n\n Returns\n -------\n Any\n Value of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get(\"a.b\")\n ```\n\n Returns `c`\n \"\"\"\n try:\n if \".\" not in key:\n return self.__dict__[key]\n\n # handle nested keys\n nested_keys = key.split(\".\")\n value = self # parent object\n for k in nested_keys:\n value = value[k] # iterate through nested values\n return value\n\n except (AttributeError, KeyError, TypeError) as e:\n if not safe:\n raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n return default\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_all","title":"get_all","text":"get_all() -> dict\n
alias to to_dict()
Source code in src/koheesio/context.py
def get_all(self) -> dict:\n \"\"\"alias to to_dict()\"\"\"\n return self.to_dict()\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_item","title":"get_item","text":"get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n
Acts just like .get
, except that it returns the key also
Returns:
Type Description Dict[str, Any]
key/value-pair of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n
Returns {'a.b': 'c'}
Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n \"\"\"Acts just like `.get`, except that it returns the key also\n\n Returns\n -------\n Dict[str, Any]\n key/value-pair of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get_item(\"a.b\")\n ```\n\n Returns `{'a.b': 'c'}`\n \"\"\"\n value = self.get(key, default, safe)\n return {key: value}\n
"},{"location":"api_reference/context.html#koheesio.context.Context.merge","title":"merge","text":"merge(context: Context, recursive: bool = False) -> Context\n
Merge this context with the context of another, where the incoming context has priority.
Parameters:
Name Type Description Default context
Context
Another Context class
required recursive
bool
Recursively merge two dictionaries to an arbitrary depth
False
Returns:
Type Description Context
updated context
Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n Parameters\n ----------\n context: Context\n Another Context class\n recursive: bool\n Recursively merge two dictionaries to an arbitrary depth\n\n Returns\n -------\n Context\n updated context\n \"\"\"\n if recursive:\n return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n # just merge on the top level keys\n return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/context.html#koheesio.context.Context.process_value","title":"process_value","text":"process_value(value: Any) -> Any\n
Processes the given value, converting dictionaries to Context objects as needed.
Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n if isinstance(value, dict):\n return self.from_dict(value)\n\n if isinstance(value, (list, set)):\n return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n return value\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Returns all parameters of the context as a dict
Returns:
Type Description dict
containing all parameters of the context
Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Returns all parameters of the context as a dict\n\n Returns\n -------\n dict\n containing all parameters of the context\n \"\"\"\n result = {}\n\n for key, value in self.__dict__.items():\n if isinstance(value, Context):\n result[key] = value.to_dict()\n elif isinstance(value, list):\n result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n else:\n result[key] = value\n\n return result\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_json","title":"to_json","text":"to_json(pretty: bool = False) -> str\n
Returns all parameters of the context as a json string
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a json string\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n d = self.to_dict()\n return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Returns all parameters of the context as a yaml string
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a yaml string\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n # sort_keys=False to preserve order of keys\n yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n # remove `!!python/object:...` from yaml\n if clean:\n remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n return yaml_str\n
"},{"location":"api_reference/intro_api.html","title":"Intro api","text":""},{"location":"api_reference/intro_api.html#api-reference","title":"API Reference","text":"You can navigate the API by clicking on the modules listed on the left to access the documentation.
"},{"location":"api_reference/logger.html","title":"Logger","text":"Loggers are used to log messages from your application.
For a comprehensive guide on the usage, examples, and additional features of the logging classes, please refer to the reference/concepts/logging section of the Koheesio documentation.
Classes:
Name Description LoggingFactory
Logging factory to be used to generate logger instances.
Masked
Represents a masked value.
MaskedString
Represents a masked string value.
MaskedInt
Represents a masked integer value.
MaskedFloat
Represents a masked float value.
MaskedDict
Represents a masked dictionary value.
LoggerIDFilter
Filter which injects run_id information into the log.
Functions:
Name Description warn
Issue a warning.
"},{"location":"api_reference/logger.html#koheesio.logger.T","title":"koheesio.logger.T module-attribute
","text":"T = TypeVar('T')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter","title":"koheesio.logger.LoggerIDFilter","text":"Filter which injects run_id information into the log.
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.LOGGER_ID","title":"LOGGER_ID class-attribute
instance-attribute
","text":"LOGGER_ID: str = str(uuid4())\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.filter","title":"filter","text":"filter(record)\n
Source code in src/koheesio/logger.py
def filter(self, record):\n record.logger_id = LoggerIDFilter.LOGGER_ID\n\n return True\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory","title":"koheesio.logger.LoggingFactory","text":"LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n
Logging factory to be used to generate logger instances.
Parameters:
Name Type Description Default name
Optional[str]
None
env
Optional[str]
None
logger_id
Optional[str]
None
Source code in src/koheesio/logger.py
def __init__(\n self,\n name: Optional[str] = None,\n env: Optional[str] = None,\n level: Optional[str] = None,\n logger_id: Optional[str] = None,\n):\n \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n Parameters\n ----------\n name logger name.\n env environment (\"local\", \"qa\", \"prod).\n logger_id unique identifier for the logger.\n \"\"\"\n\n LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n LoggingFactory.ENV = env or LoggingFactory.ENV\n\n console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n # WARNING is default level for root logger in python\n logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n LoggingFactory.CONSOLE_HANDLER = console_handler\n\n logger = getLogger(LoggingFactory.LOGGER_NAME)\n logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute
instance-attribute
","text":"CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.ENV","title":"ENV class-attribute
instance-attribute
","text":"ENV: Optional[str] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER","title":"LOGGER class-attribute
instance-attribute
","text":"LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute
instance-attribute
","text":"LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute
instance-attribute
","text":"LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute
instance-attribute
","text":"LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute
instance-attribute
","text":"LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute
instance-attribute
","text":"LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute
instance-attribute
","text":"LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.add_handlers","title":"add_handlers staticmethod
","text":"add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n
Add handlers to existing root logger.
Parameters:
Name Type Description Default handler_class
required handlers_config
required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n \"\"\"Add handlers to existing root logger.\n\n Parameters\n ----------\n handler_class handler module and class for importing.\n handlers_config configuration for handler.\n\n \"\"\"\n for handler_module_class, handler_conf in handlers:\n handler_class: logging.Handler = import_class(handler_module_class)\n handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n # noinspection PyCallingNonCallable\n handler = handler_class(**handler_conf)\n handler.setLevel(handler_level)\n handler.addFilter(LoggingFactory.LOGGER_FILTER)\n handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.get_logger","title":"get_logger staticmethod
","text":"get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n
Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.
Parameters:
Name Type Description Default name
str
required inherit_from_koheesio
bool
False
Returns:
Name Type Description logger
Logger
Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n Parameters\n ----------\n name: Name of logger.\n inherit_from_koheesio: Inherit logger from koheesio\n\n Returns\n -------\n logger: Logger\n\n \"\"\"\n if inherit_from_koheesio:\n LoggingFactory.__check_koheesio_logger_initialized()\n name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n return getLogger(name)\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked","title":"koheesio.logger.Masked","text":"Masked(value: T)\n
Represents a masked value.
Parameters:
Name Type Description Default value
T
The value to be masked.
required Attributes:
Name Type Description _value
T
The original value.
Methods:
Name Description __repr__
Returns a string representation of the masked value.
__str__
Returns a string representation of the masked value.
__get_validators__
Returns a generator of validators for the masked value.
validate
Validates the masked value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked.validate","title":"validate classmethod
","text":"validate(v: Any, _values)\n
Validate the input value and return an instance of the class.
Parameters:
Name Type Description Default v
Any
The input value to validate.
required _values
Any
Additional values used for validation.
required Returns:
Name Type Description instance
cls
An instance of the class.
Source code in src/koheesio/logger.py
@classmethod\ndef validate(cls, v: Any, _values):\n \"\"\"\n Validate the input value and return an instance of the class.\n\n Parameters\n ----------\n v : Any\n The input value to validate.\n _values : Any\n Additional values used for validation.\n\n Returns\n -------\n instance : cls\n An instance of the class.\n\n \"\"\"\n return cls(v)\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedDict","title":"koheesio.logger.MaskedDict","text":"MaskedDict(value: T)\n
Represents a masked dictionary value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedFloat","title":"koheesio.logger.MaskedFloat","text":"MaskedFloat(value: T)\n
Represents a masked float value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedInt","title":"koheesio.logger.MaskedInt","text":"MaskedInt(value: T)\n
Represents a masked integer value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedString","title":"koheesio.logger.MaskedString","text":"MaskedString(value: T)\n
Represents a masked string value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/utils.html","title":"Utils","text":"Utility functions
"},{"location":"api_reference/utils.html#koheesio.utils.convert_str_to_bool","title":"koheesio.utils.convert_str_to_bool","text":"convert_str_to_bool(value) -> Any\n
Converts a string to a boolean if the string is either 'true' or 'false'
Source code in src/koheesio/utils.py
def convert_str_to_bool(value) -> Any:\n \"\"\"Converts a string to a boolean if the string is either 'true' or 'false'\"\"\"\n if isinstance(value, str) and (v := value.lower()) in [\"true\", \"false\"]:\n value = v == \"true\"\n return value\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_args_for_func","title":"koheesio.utils.get_args_for_func","text":"get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]\n
Helper function that matches keyword arguments (params) on a given function
This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to construct a new Callable (partial) function on which the input was mapped.
Example input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\ndef example_func(a: str):\n return a\n\n\nfunc, kwargs = get_args_for_func(example_func, input_dict)\n
In this example, - func
would be a callable with the input mapped toward it (i.e. can be called like any normal function) - kwargs
would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})
Parameters:
Name Type Description Default func
Callable
The function to inspect
required params
Dict
Dictionary with keyword values that will be mapped on the 'func'
required Returns:
Type Description Tuple[Callable, Dict[str, Any]]
- Callable a partial() func with the found keyword values mapped toward it
- Dict[str, Any] the keyword args that match the func
Source code in src/koheesio/utils.py
def get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]:\n \"\"\"Helper function that matches keyword arguments (params) on a given function\n\n This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to\n construct a new Callable (partial) function on which the input was mapped.\n\n Example\n -------\n ```python\n input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\n def example_func(a: str):\n return a\n\n\n func, kwargs = get_args_for_func(example_func, input_dict)\n ```\n\n In this example,\n - `func` would be a callable with the input mapped toward it (i.e. can be called like any normal function)\n - `kwargs` would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})\n\n Parameters\n ----------\n func: Callable\n The function to inspect\n params: Dict\n Dictionary with keyword values that will be mapped on the 'func'\n\n Returns\n -------\n Tuple[Callable, Dict[str, Any]]\n - Callable\n a partial() func with the found keyword values mapped toward it\n - Dict[str, Any]\n the keyword args that match the func\n \"\"\"\n _kwargs = {k: v for k, v in params.items() if k in inspect.getfullargspec(func).args}\n return (\n partial(func, **_kwargs),\n _kwargs,\n )\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_project_root","title":"koheesio.utils.get_project_root","text":"get_project_root() -> Path\n
Returns project root path.
Source code in src/koheesio/utils.py
def get_project_root() -> Path:\n \"\"\"Returns project root path.\"\"\"\n cmd = Path(__file__)\n return Path([i for i in cmd.parents if i.as_uri().endswith(\"src\")][0]).parent\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_random_string","title":"koheesio.utils.get_random_string","text":"get_random_string(length: int = 64, prefix: Optional[str] = None) -> str\n
Generate a random string of specified length
Source code in src/koheesio/utils.py
def get_random_string(length: int = 64, prefix: Optional[str] = None) -> str:\n \"\"\"Generate a random string of specified length\"\"\"\n if prefix:\n return f\"{prefix}_{uuid.uuid4().hex}\"[0:length]\n return f\"{uuid.uuid4().hex}\"[0:length]\n
"},{"location":"api_reference/utils.html#koheesio.utils.import_class","title":"koheesio.utils.import_class","text":"import_class(module_class: str) -> Any\n
Import class and module based on provided string.
Parameters:
Name Type Description Default module_class
str
required Returns:
Type Description object Class from specified input string.
Source code in src/koheesio/utils.py
def import_class(module_class: str) -> Any:\n \"\"\"Import class and module based on provided string.\n\n Parameters\n ----------\n module_class module+class to be imported.\n\n Returns\n -------\n object Class from specified input string.\n\n \"\"\"\n module_path, class_name = module_class.rsplit(\".\", 1)\n module = import_module(module_path)\n\n return getattr(module, class_name)\n
"},{"location":"api_reference/asyncio/index.html","title":"Asyncio","text":"This module provides classes for asynchronous steps in the koheesio package.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep","title":"koheesio.asyncio.AsyncStep","text":"Asynchronous step class that inherits from Step and uses the AsyncStepMetaClass metaclass.
Attributes:
Name Type Description Output
AsyncStepOutput
The output class for the asynchronous step.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep.Output","title":"Output","text":"Output class for asyncio step.
This class represents the output of the asyncio step. It inherits from the AsyncStepOutput class.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepMetaClass","title":"koheesio.asyncio.AsyncStepMetaClass","text":"Metaclass for asynchronous steps.
This metaclass is used to define asynchronous steps in the Koheesio framework. It inherits from the StepMetaClass and provides additional functionality for executing asynchronous steps.
Attributes: None
Methods: _execute_wrapper: Wrapper method for executing asynchronous steps.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput","title":"koheesio.asyncio.AsyncStepOutput","text":"Represents the output of an asynchronous step.
This class extends the base Step.Output
class and provides additional functionality for merging key-value maps.
Attributes:
Name Type Description ...
Methods:
Name Description merge
Merge key-value map with self.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput.merge","title":"merge","text":"merge(other: Union[Dict, StepOutput])\n
Merge key,value map with self
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n
Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}
.
Parameters:
Name Type Description Default other
Union[Dict, StepOutput]
Dict or another instance of a StepOutputs class that will be added to self
required Source code in src/koheesio/asyncio/__init__.py
def merge(self, other: Union[Dict, StepOutput]):\n \"\"\"Merge key,value map with self\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n ```\n\n Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n Parameters\n ----------\n other: Union[Dict, StepOutput]\n Dict or another instance of a StepOutputs class that will be added to self\n \"\"\"\n if isinstance(other, StepOutput):\n other = other.model_dump() # ensures we really have a dict\n\n if not iscoroutine(other):\n for k, v in other.items():\n self.set(k, v)\n\n return self\n
"},{"location":"api_reference/asyncio/http.html","title":"Http","text":"This module contains async implementation of HTTP step.
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep","title":"koheesio.asyncio.http.AsyncHttpGetStep","text":"Represents an asynchronous HTTP GET step.
This class inherits from the AsyncHttpStep class and specifies the HTTP method as GET.
Attributes: method (HttpMethod): The HTTP method for the step, set to HttpMethod.GET.
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = GET\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep","title":"koheesio.asyncio.http.AsyncHttpStep","text":"Asynchronous HTTP step for making HTTP requests using aiohttp.
Parameters:
Name Type Description Default client_session
Optional[ClientSession]
Aiohttp ClientSession.
required url
List[URL]
List of yarl.URL.
required retry_options
Optional[RetryOptionsBase]
Retry options for the request.
required connector
Optional[BaseConnector]
Connector for the aiohttp request.
required headers
Optional[Dict[str, Union[str, SecretStr]]]
Request headers.
required Output responses_urls : Optional[List[Tuple[Dict[str, Any], yarl.URL]]] List of responses from the API and request URL.
Examples:
>>> import asyncio\n>>> from aiohttp import ClientSession\n>>> from aiohttp.connector import TCPConnector\n>>> from aiohttp_retry import ExponentialRetry\n>>> from koheesio.steps.async.http import AsyncHttpStep\n>>> from yarl import URL\n>>> from typing import Dict, Any, Union, List, Tuple\n>>>\n>>> # Initialize the AsyncHttpStep\n>>> async def main():\n>>> session = ClientSession()\n>>> urls = [URL('https://example.com/api/1'), URL('https://example.com/api/2')]\n>>> retry_options = ExponentialRetry()\n>>> connector = TCPConnector(limit=10)\n>>> headers = {'Content-Type': 'application/json'}\n>>> step = AsyncHttpStep(\n>>> client_session=session,\n>>> url=urls,\n>>> retry_options=retry_options,\n>>> connector=connector,\n>>> headers=headers\n>>> )\n>>>\n>>> # Execute the step\n>>> responses_urls= await step.get()\n>>>\n>>> return responses_urls\n>>>\n>>> # Run the main function\n>>> responses_urls = asyncio.run(main())\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.client_session","title":"client_session class-attribute
instance-attribute
","text":"client_session: Optional[ClientSession] = Field(default=None, description='Aiohttp ClientSession', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.connector","title":"connector class-attribute
instance-attribute
","text":"connector: Optional[BaseConnector] = Field(default=None, description='Connector for the aiohttp request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.headers","title":"headers class-attribute
instance-attribute
","text":"headers: Dict[str, Union[str, SecretStr]] = Field(default_factory=dict, description='Request headers', alias='header', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.method","title":"method class-attribute
instance-attribute
","text":"method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.retry_options","title":"retry_options class-attribute
instance-attribute
","text":"retry_options: Optional[RetryOptionsBase] = Field(default=None, description='Retry options for the request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.timeout","title":"timeout class-attribute
instance-attribute
","text":"timeout: None = Field(default=None, description='[Optional] Request timeout')\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.url","title":"url class-attribute
instance-attribute
","text":"url: List[URL] = Field(default=None, alias='urls', description='Expecting list, as there is no value in executing async request for one value.\\n yarl.URL is preferable, because params/data can be injected into URL instance', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output","title":"Output","text":"Output class for Step
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output.responses_urls","title":"responses_urls class-attribute
instance-attribute
","text":"responses_urls: Optional[List[Tuple[Dict[str, Any], URL]]] = Field(default=None, description='List of responses from the API and request URL', repr=False)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.delete","title":"delete async
","text":"delete() -> List[Tuple[Dict[str, Any], URL]]\n
Make DELETE requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def delete(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make DELETE requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.DELETE)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.execute","title":"execute","text":"execute() -> Output\n
Execute the step.
Raises:
Type Description ValueError
If the specified HTTP method is not implemented in AsyncHttpStep.
Source code in src/koheesio/asyncio/http.py
def execute(self) -> AsyncHttpStep.Output:\n \"\"\"\n Execute the step.\n\n Raises\n ------\n ValueError\n If the specified HTTP method is not implemented in AsyncHttpStep.\n \"\"\"\n # By design asyncio does not allow its event loop to be nested. This presents a practical problem:\n # When in an environment where the event loop is already running\n # it\u2019s impossible to run tasks and wait for the result.\n # Trying to do so will give the error \u201cRuntimeError: This event loop is already running\u201d.\n # The issue pops up in various environments, such as web servers, GUI applications and in\n # Jupyter/DataBricks notebooks.\n nest_asyncio.apply()\n\n map_method_func = {\n HttpMethod.GET: self.get,\n HttpMethod.POST: self.post,\n HttpMethod.PUT: self.put,\n HttpMethod.DELETE: self.delete,\n }\n\n if self.method not in map_method_func:\n raise ValueError(f\"Method {self.method} not implemented in AsyncHttpStep.\")\n\n self.output.responses_urls = asyncio.run(map_method_func[self.method]())\n\n return self.output\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get","title":"get async
","text":"get() -> List[Tuple[Dict[str, Any], URL]]\n
Make GET requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def get(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make GET requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.GET)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_headers","title":"get_headers","text":"get_headers()\n
Get the request headers.
Returns:
Type Description Optional[Dict[str, Union[str, SecretStr]]]
The request headers.
Source code in src/koheesio/asyncio/http.py
def get_headers(self):\n \"\"\"\n Get the request headers.\n\n Returns\n -------\n Optional[Dict[str, Union[str, SecretStr]]]\n The request headers.\n \"\"\"\n _headers = None\n\n if self.headers:\n _headers = {k: v.get_secret_value() if isinstance(v, SecretStr) else v for k, v in self.headers.items()}\n\n for k, v in self.headers.items():\n if isinstance(v, SecretStr):\n self.headers[k] = v.get_secret_value()\n\n return _headers or self.headers\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_options","title":"get_options","text":"get_options()\n
Get the options of the step.
Source code in src/koheesio/asyncio/http.py
def get_options(self):\n \"\"\"\n Get the options of the step.\n \"\"\"\n warnings.warn(\"get_options is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.post","title":"post async
","text":"post() -> List[Tuple[Dict[str, Any], URL]]\n
Make POST requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def post(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make POST requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.POST)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.put","title":"put async
","text":"put() -> List[Tuple[Dict[str, Any], URL]]\n
Make PUT requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def put(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make PUT requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.PUT)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.request","title":"request async
","text":"request(method: HttpMethod, url: URL, **kwargs) -> Tuple[Dict[str, Any], URL]\n
Make an HTTP request.
Parameters:
Name Type Description Default method
HttpMethod
The HTTP method to use for the request.
required url
URL
The URL to make the request to.
required kwargs
Any
Additional keyword arguments to pass to the request.
{}
Returns:
Type Description Tuple[Dict[str, Any], URL]
A tuple containing the response data and the request URL.
Source code in src/koheesio/asyncio/http.py
async def request(\n self,\n method: HttpMethod,\n url: yarl.URL,\n **kwargs,\n) -> Tuple[Dict[str, Any], yarl.URL]:\n \"\"\"\n Make an HTTP request.\n\n Parameters\n ----------\n method : HttpMethod\n The HTTP method to use for the request.\n url : yarl.URL\n The URL to make the request to.\n kwargs : Any\n Additional keyword arguments to pass to the request.\n\n Returns\n -------\n Tuple[Dict[str, Any], yarl.URL]\n A tuple containing the response data and the request URL.\n \"\"\"\n async with self.__retry_client.request(method=method, url=url, **kwargs) as response:\n res = await response.json()\n\n return (res, response.request_info.url)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.set_outputs","title":"set_outputs","text":"set_outputs(response)\n
Set the outputs of the step.
Parameters:
Name Type Description Default response
Any
The response data.
required Source code in src/koheesio/asyncio/http.py
def set_outputs(self, response):\n \"\"\"\n Set the outputs of the step.\n\n Parameters\n ----------\n response : Any\n The response data.\n \"\"\"\n warnings.warn(\"set outputs is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.validate_timeout","title":"validate_timeout","text":"validate_timeout(timeout)\n
Validate the 'data' field.
Parameters:
Name Type Description Default data
Any
The value of the 'timeout' field.
required Raises:
Type Description ValueError
If 'data' is not allowed in AsyncHttpStep.
Source code in src/koheesio/asyncio/http.py
@field_validator(\"timeout\")\ndef validate_timeout(cls, timeout):\n \"\"\"\n Validate the 'data' field.\n\n Parameters\n ----------\n data : Any\n The value of the 'timeout' field.\n\n Raises\n ------\n ValueError\n If 'data' is not allowed in AsyncHttpStep.\n \"\"\"\n if timeout:\n raise ValueError(\"timeout is not allowed in AsyncHttpStep. Provide timeout through retry_options.\")\n
"},{"location":"api_reference/integrations/index.html","title":"Integrations","text":"Nothing to see here, move along.
"},{"location":"api_reference/integrations/box.html","title":"Box","text":"Box Module
The module is used to facilitate various interactions with Box service. The implementation is based on the functionalities available in Box Python SDK: https://github.com/box/box-python-sdk
Prerequisites - Box Application is created in the developer portal using the JWT auth method (Developer Portal - My Apps - Create)
- Application is authorized for the enterprise (Developer Portal - MyApp - Authorization)
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box","title":"koheesio.integrations.box.Box","text":"Box(**data)\n
Configuration details required for the authentication can be obtained in the Box Developer Portal by generating the Public / Private key pair in \"Application Name -> Configuration -> Add and Manage Public Keys\".
The downloaded JSON file will look like this:
{\n \"boxAppSettings\": {\n \"clientID\": \"client_id\",\n \"clientSecret\": \"client_secret\",\n \"appAuth\": {\n \"publicKeyID\": \"public_key_id\",\n \"privateKey\": \"private_key\",\n \"passphrase\": \"pass_phrase\"\n }\n },\n \"enterpriseID\": \"123456\"\n}\n
This class is used as a base for the rest of Box integrations, however it can also be used separately to obtain the Box client which is created at class initialization. Examples:
b = Box(\n client_id=\"client_id\",\n client_secret=\"client_secret\",\n enterprise_id=\"enterprise_id\",\n jwt_key_id=\"jwt_key_id\",\n rsa_private_key_data=\"rsa_private_key_data\",\n rsa_private_key_passphrase=\"rsa_private_key_passphrase\",\n)\nb.client\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.auth_options","title":"auth_options property
","text":"auth_options\n
Get a dictionary of authentication options, that can be handily used in the child classes
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client","title":"client class-attribute
instance-attribute
","text":"client: SkipValidation[Client] = None\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_id","title":"client_id class-attribute
instance-attribute
","text":"client_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientID', description='Client ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_secret","title":"client_secret class-attribute
instance-attribute
","text":"client_secret: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientSecret', description='Client Secret from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.enterprise_id","title":"enterprise_id class-attribute
instance-attribute
","text":"enterprise_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='enterpriseID', description='Enterprise ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.jwt_key_id","title":"jwt_key_id class-attribute
instance-attribute
","text":"jwt_key_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='publicKeyID', description='PublicKeyID for the public/private generated key pair.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_data","title":"rsa_private_key_data class-attribute
instance-attribute
","text":"rsa_private_key_data: Union[SecretStr, SecretBytes] = Field(default=..., alias='privateKey', description='Private key generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_passphrase","title":"rsa_private_key_passphrase class-attribute
instance-attribute
","text":"rsa_private_key_passphrase: Union[SecretStr, SecretBytes] = Field(default=..., alias='passphrase', description='Private key passphrase generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/box.py
def execute(self):\n # Plug to be able to unit test ABC\n pass\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.init_client","title":"init_client","text":"init_client()\n
Set up the Box client.
Source code in src/koheesio/integrations/box.py
def init_client(self):\n \"\"\"Set up the Box client.\"\"\"\n if not self.client:\n self.client = Client(JWTAuth(**self.auth_options))\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader","title":"koheesio.integrations.box.BoxCsvFileReader","text":"BoxCsvFileReader(**data)\n
Class facilitates reading one or multiple CSV files with the same structure directly from Box and producing Spark Dataframe.
Notes To manually identify the ID of the file in Box, open the file through Web UI, and copy ID from the page URL, e.g. https://foo.ent.box.com/file/1234567890 , where 1234567890 is the ID.
Examples:
from koheesio.steps.integrations.box import BoxCsvFileReader\nfrom pyspark.sql.types import StructType\n\nschema = StructType(...)\nb = BoxCsvFileReader(\n client_id=\"\",\n client_secret=\"\",\n enterprise_id=\"\",\n jwt_key_id=\"\",\n rsa_private_key_data=\"\",\n rsa_private_key_passphrase=\"\",\n file=[\"1\", \"2\"],\n schema=schema,\n).execute()\nb.df.show()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.file","title":"file class-attribute
instance-attribute
","text":"file: Union[str, list[str]] = Field(default=..., description='ID or list of IDs for the files to read.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.execute","title":"execute","text":"execute()\n
Loop through the list of provided file identifiers and load data into dataframe. For traceability purposes the following columns will be added to the dataframe: * meta_file_id: the identifier of the file on Box * meta_file_name: name of the file
Returns:
Type Description DataFrame
Source code in src/koheesio/integrations/box.py
def execute(self):\n \"\"\"\n Loop through the list of provided file identifiers and load data into dataframe.\n For traceability purposes the following columns will be added to the dataframe:\n * meta_file_id: the identifier of the file on Box\n * meta_file_name: name of the file\n\n Returns\n -------\n DataFrame\n \"\"\"\n df = None\n for f in self.file:\n self.log.debug(f\"Reading contents of file with the ID '{f}' into Spark DataFrame\")\n file = self.client.file(file_id=f)\n data = file.content().decode(\"utf-8\").splitlines()\n rdd = self.spark.sparkContext.parallelize(data)\n temp_df = self.spark.read.csv(rdd, header=True, schema=self.schema_, **self.params)\n temp_df = (\n temp_df\n # fmt: off\n .withColumn(\"meta_file_id\", lit(file.object_id))\n .withColumn(\"meta_file_name\", lit(file.get().name))\n .withColumn(\"meta_load_timestamp\", expr(\"to_utc_timestamp(current_timestamp(), current_timezone())\"))\n # fmt: on\n )\n\n df = temp_df if not df else df.union(temp_df)\n\n self.output.df = df\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader","title":"koheesio.integrations.box.BoxCsvPathReader","text":"BoxCsvPathReader(**data)\n
Read all CSV files from the specified path into the dataframe. Files can be filtered using the regular expression in the 'filter' parameter. The default behavior is to read all CSV / TXT files from the specified path.
Notes The class does not contain archival capability as it is presumed that the user wants to make sure that the full pipeline is successful (for example, the source data was transformed and saved) prior to moving the source files. Use BoxToBoxFileMove class instead and provide the list of IDs from 'file_id' output.
Examples:
from koheesio.steps.integrations.box import BoxCsvPathReader\n\nauth_params = {...}\nb = BoxCsvPathReader(**auth_params, path=\"foo/bar/\").execute()\nb.df # Spark Dataframe\n... # do something with the dataframe\nfrom koheesio.steps.integrations.box import BoxToBoxFileMove\n\nbm = BoxToBoxFileMove(**auth_params, file=b.file_id, path=\"/foo/bar/archive\")\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.filter","title":"filter class-attribute
instance-attribute
","text":"filter: Optional[str] = Field(default='.csv|.txt$', description='[Optional] Regexp to filter folder contents')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.path","title":"path class-attribute
instance-attribute
","text":"path: str = Field(default=..., description='Box path')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.execute","title":"execute","text":"execute()\n
Identify the list of files from the source Box path that match desired filter and load them into Dataframe
Source code in src/koheesio/integrations/box.py
def execute(self):\n \"\"\"\n Identify the list of files from the source Box path that match desired filter and load them into Dataframe\n \"\"\"\n folder = BoxFolderGet.from_step(self).execute().folder\n\n # Identify the list of files that should be processed\n files = [item for item in folder.get_items() if item.type == \"file\" and re.search(self.filter, item.name)]\n\n if len(files) > 0:\n self.log.info(\n f\"A total of {len(files)} files, that match the mask '{self.mask}' has been detected in {self.path}.\"\n f\" They will be loaded into Spark Dataframe: {files}\"\n )\n else:\n raise BoxPathIsEmptyError(f\"Path '{self.path}' is empty or none of files match the mask '{self.filter}'\")\n\n file = [file_id.object_id for file_id in files]\n self.output.df = BoxCsvFileReader.from_step(self, file=file).read()\n self.output.file = file # e.g. if files should be archived after pipeline is successful\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase","title":"koheesio.integrations.box.BoxFileBase","text":"BoxFileBase(**data)\n
Generic class to facilitate interactions with Box folders.
Box SDK provides File class that has various properties and methods to interact with Box files. The object can be obtained in multiple ways: * provide Box file identified to file
parameter (the identifier can be obtained, for example, from URL) * provide existing object to file
parameter (boxsdk.object.file.File)
Notes Refer to BoxFolderBase for mor info about folder
and path
parameters
See Also boxsdk.object.file.File
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.files","title":"files class-attribute
instance-attribute
","text":"files: conlist(Union[File, str], min_length=1) = Field(default=..., alias='file', description='List of Box file objects or identifiers')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.folder","title":"folder class-attribute
instance-attribute
","text":"folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.path","title":"path class-attribute
instance-attribute
","text":"path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.action","title":"action","text":"action(file: File, folder: Folder)\n
Abstract class for File level actions.
Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n \"\"\"\n Abstract class for File level actions.\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.execute","title":"execute","text":"execute()\n
Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects from various parameter inputs
Source code in src/koheesio/integrations/box.py
def execute(self):\n \"\"\"\n Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects\n from various parameter inputs\n \"\"\"\n if self.path:\n _folder = BoxFolderGet.from_step(self).execute().folder\n else:\n _folder = self.client.folder(folder_id=self.folder) if isinstance(self.folder, str) else self.folder\n\n for _file in self.files:\n _file = self.client.file(file_id=_file) if isinstance(_file, str) else _file\n self.action(file=_file, folder=_folder)\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter","title":"koheesio.integrations.box.BoxFileWriter","text":"BoxFileWriter(**data)\n
Write file or a file-like object to Box.
Examples:
from koheesio.steps.integrations.box import BoxFileWriter\n\nauth_params = {...}\nf1 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=\"path/to/my/file.ext\").execute()\n# or\nimport io\n\nb = io.BytesIO(b\"my-sample-data\")\nf2 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=b, name=\"file.ext\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.description","title":"description class-attribute
instance-attribute
","text":"description: Optional[str] = Field(None, description='Optional description to add to the file in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file","title":"file class-attribute
instance-attribute
","text":"file: Union[str, BytesIO] = Field(default=..., description='Path to file or a file-like object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file_name","title":"file_name class-attribute
instance-attribute
","text":"file_name: Optional[str] = Field(default=None, description=\"When file path or name is provided to 'file' parameter, this will override the original name.When binary stream is provided, the 'name' should be used to set the desired name for the Box file.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output","title":"Output","text":"Output class for BoxFileWriter.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.file","title":"file class-attribute
instance-attribute
","text":"file: File = Field(default=..., description='File object in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.shared_link","title":"shared_link class-attribute
instance-attribute
","text":"shared_link: str = Field(default=..., description='Shared link for the Box file')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.action","title":"action","text":"action()\n
Source code in src/koheesio/integrations/box.py
def action(self):\n _file = self.file\n _name = self.file_name\n\n if isinstance(_file, str):\n _name = _name if _name else PurePath(_file).name\n with open(_file, \"rb\") as f:\n _file = BytesIO(f.read())\n\n folder: Folder = BoxFolderGet.from_step(self, create_sub_folders=True).execute().folder\n folder.preflight_check(size=0, name=_name)\n\n self.log.info(f\"Uploading file '{_name}' to Box folder '{folder.get().name}'...\")\n _box_file: File = folder.upload_stream(file_stream=_file, file_name=_name, file_description=self.description)\n\n self.output.file = _box_file\n self.output.shared_link = _box_file.get_shared_link()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.execute","title":"execute","text":"execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.validate_name_for_binary_data","title":"validate_name_for_binary_data","text":"validate_name_for_binary_data(values)\n
Validate 'file_name' parameter when providing a binary input for 'file'.
Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"before\")\ndef validate_name_for_binary_data(cls, values):\n \"\"\"Validate 'file_name' parameter when providing a binary input for 'file'.\"\"\"\n file, file_name = values.get(\"file\"), values.get(\"file_name\")\n if not isinstance(file, str) and not file_name:\n raise AttributeError(\"The parameter 'file_name' is mandatory when providing a binary input for 'file'.\")\n\n return values\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase","title":"koheesio.integrations.box.BoxFolderBase","text":"BoxFolderBase(**data)\n
Generic class to facilitate interactions with Box folders.
Box SDK provides Folder class that has various properties and methods to interact with Box folders. The object can be obtained in multiple ways: * provide Box folder identified to folder
parameter (the identifier can be obtained, for example, from URL) * provide existing object to folder
parameter (boxsdk.object.folder.Folder) * provide filesystem-like path to path
parameter
See Also boxsdk.object.folder.Folder
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.folder","title":"folder class-attribute
instance-attribute
","text":"folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.path","title":"path class-attribute
instance-attribute
","text":"path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.root","title":"root class-attribute
instance-attribute
","text":"root: Optional[Union[Folder, str]] = Field(default='0', description='Folder object or identifier of the folder that should be used as root')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output","title":"Output","text":"Define outputs for the BoxFolderBase class
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output.folder","title":"folder class-attribute
instance-attribute
","text":"folder: Optional[Folder] = Field(default=None, description='Box folder object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.action","title":"action","text":"action()\n
Placeholder for 'action' method, that should be implemented in the child classes
Returns:
Type Description Folder or None
Source code in src/koheesio/integrations/box.py
def action(self):\n \"\"\"\n Placeholder for 'action' method, that should be implemented in the child classes\n\n Returns\n -------\n Folder or None\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.execute","title":"execute","text":"execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n self.output.folder = self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.validate_folder_or_path","title":"validate_folder_or_path","text":"validate_folder_or_path()\n
Validations for 'folder' and 'path' parameter usage
Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"after\")\ndef validate_folder_or_path(self):\n \"\"\"\n Validations for 'folder' and 'path' parameter usage\n \"\"\"\n folder_value = self.folder\n path_value = self.path\n\n if folder_value and path_value:\n raise AttributeError(\"Cannot user 'folder' and 'path' parameter at the same time\")\n\n if not folder_value and not path_value:\n raise AttributeError(\"Neither 'folder' nor 'path' parameters are set\")\n\n return self\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate","title":"koheesio.integrations.box.BoxFolderCreate","text":"BoxFolderCreate(**data)\n
Explicitly create the new Box folder object and parent directories.
Examples:
from koheesio.steps.integrations.box import BoxFolderCreate\n\nauth_params = {...}\nfolder = BoxFolderCreate(**auth_params, path=\"/foo/bar\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.create_sub_folders","title":"create_sub_folders class-attribute
instance-attribute
","text":"create_sub_folders: bool = Field(default=True, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.validate_folder","title":"validate_folder","text":"validate_folder(folder)\n
Validate 'folder' parameter
Source code in src/koheesio/integrations/box.py
@field_validator(\"folder\")\ndef validate_folder(cls, folder):\n \"\"\"\n Validate 'folder' parameter\n \"\"\"\n if folder:\n raise AttributeError(\"Only 'path' parameter is allowed in the context of folder creation.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete","title":"koheesio.integrations.box.BoxFolderDelete","text":"BoxFolderDelete(**data)\n
Delete existing Box folder based on object, identifier or path.
Examples:
from koheesio.steps.integrations.box import BoxFolderDelete\n\nauth_params = {...}\nBoxFolderDelete(**auth_params, path=\"/foo/bar\").execute()\n# or\nBoxFolderDelete(**auth_params, folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxFolderDelete(**auth_params, folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete.action","title":"action","text":"action()\n
Delete folder action
Returns:
Type Description None
Source code in src/koheesio/integrations/box.py
def action(self):\n \"\"\"\n Delete folder action\n\n Returns\n -------\n None\n \"\"\"\n if self.folder:\n folder = self._obj_from_id\n else: # path\n folder = BoxFolderGet.from_step(self).action()\n\n self.log.info(f\"Deleting Box folder '{folder}'...\")\n folder.delete()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet","title":"koheesio.integrations.box.BoxFolderGet","text":"BoxFolderGet(**data)\n
Get the Box folder object for an existing folder or create a new folder and parent directories.
Examples:
from koheesio.steps.integrations.box import BoxFolderGet\n\nauth_params = {...}\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\n# or\nfolder = BoxFolderGet(**auth_params, path=\"1\").execute().folder\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.create_sub_folders","title":"create_sub_folders class-attribute
instance-attribute
","text":"create_sub_folders: Optional[bool] = Field(False, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.action","title":"action","text":"action()\n
Get folder action
Returns:
Name Type Description folder
Folder
Box Folder object as specified in Box SDK
Source code in src/koheesio/integrations/box.py
def action(self):\n \"\"\"\n Get folder action\n\n Returns\n -------\n folder: Folder\n Box Folder object as specified in Box SDK\n \"\"\"\n current_folder_object = None\n\n if self.folder:\n current_folder_object = self._obj_from_id\n\n if self.path:\n cleaned_path_parts = [p for p in PurePath(self.path).parts if p.strip() not in [None, \"\", \" \", \"/\"]]\n current_folder_object = self.client.folder(folder_id=self.root) if isinstance(self.root, str) else self.root\n\n for next_folder_name in cleaned_path_parts:\n current_folder_object = self._get_or_create_folder(current_folder_object, next_folder_name)\n\n self.log.info(f\"Folder identified or created: {current_folder_object}\")\n return current_folder_object\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderNotFoundError","title":"koheesio.integrations.box.BoxFolderNotFoundError","text":"Error when a provided box path does not exist.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxPathIsEmptyError","title":"koheesio.integrations.box.BoxPathIsEmptyError","text":"Exception when provided Box path is empty or no files matched the mask.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase","title":"koheesio.integrations.box.BoxReaderBase","text":"BoxReaderBase(**data)\n
Base class for Box readers.
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the Spark reader.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.schema_","title":"schema_ class-attribute
instance-attribute
","text":"schema_: Optional[StructType] = Field(None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output","title":"Output","text":"Make default reader output optional to gracefully handle 'no-files / folder' cases.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.execute","title":"execute abstractmethod
","text":"execute() -> Output\n
Source code in src/koheesio/integrations/box.py
@abstractmethod\ndef execute(self) -> Output:\n raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy","title":"koheesio.integrations.box.BoxToBoxFileCopy","text":"BoxToBoxFileCopy(**data)\n
Copy one or multiple files to the target Box path.
Examples:
from koheesio.steps.integrations.box import BoxToBoxFileCopy\n\nauth_params = {...}\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileCopy(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy.action","title":"action","text":"action(file: File, folder: Folder)\n
Copy file to the desired destination and extend file description with the processing info
Parameters:
Name Type Description Default file
File
File object as specified in Box SDK
required folder
Folder
Folder object as specified in Box SDK
required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n \"\"\"\n Copy file to the desired destination and extend file description with the processing info\n\n Parameters\n ----------\n file: File\n File object as specified in Box SDK\n folder: Folder\n Folder object as specified in Box SDK\n \"\"\"\n self.log.info(f\"Copying '{file.get()}' to '{folder.get()}'...\")\n file.copy(parent_folder=folder).update_info(\n data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n )\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove","title":"koheesio.integrations.box.BoxToBoxFileMove","text":"BoxToBoxFileMove(**data)\n
Move one or multiple files to the target Box path
Examples:
from koheesio.steps.integrations.box import BoxToBoxFileMove\n\nauth_params = {...}\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileMove(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove.action","title":"action","text":"action(file: File, folder: Folder)\n
Move file to the desired destination and extend file description with the processing info
Parameters:
Name Type Description Default file
File
File object as specified in Box SDK
required folder
Folder
Folder object as specified in Box SDK
required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n \"\"\"\n Move file to the desired destination and extend file description with the processing info\n\n Parameters\n ----------\n file: File\n File object as specified in Box SDK\n folder: Folder\n Folder object as specified in Box SDK\n \"\"\"\n self.log.info(f\"Moving '{file.get()}' to '{folder.get()}'...\")\n file.move(parent_folder=folder).update_info(\n data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n )\n
"},{"location":"api_reference/integrations/spark/index.html","title":"Spark","text":""},{"location":"api_reference/integrations/spark/sftp.html","title":"Sftp","text":"This module contains the SFTPWriter class and the SFTPWriteMode enum.
The SFTPWriter class is used to write data to a file on an SFTP server. It uses the Paramiko library to establish an SFTP connection and write data to the server. The data to be written is provided by a BufferWriter, which generates the data in a buffer. See the docstring of the SFTPWriter class for more details. Refer to koheesio.spark.writers.buffer for more details on the BufferWriter interface.
The SFTPWriteMode enum defines the different write modes that the SFTPWriter can use. These modes determine how the SFTPWriter behaves when the file it is trying to write to already exists on the server. For more details on each mode, see the docstring of the SFTPWriteMode enum.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode","title":"koheesio.integrations.spark.sftp.SFTPWriteMode","text":"The different write modes for the SFTPWriter.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--overwrite","title":"OVERWRITE:","text":" - If the file exists, it will be overwritten.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--append","title":"APPEND:","text":" - If the file exists, the new data will be appended to it.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--ignore","title":"IGNORE:","text":" - If the file exists, the method will return without writing anything.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--exclusive","title":"EXCLUSIVE:","text":" - If the file exists, an error will be raised.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--backup","title":"BACKUP:","text":" - If the file exists and the new data is different from the existing data, a backup will be created and the file will be overwritten.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--update","title":"UPDATE:","text":" - If the file exists and the new data is different from the existing data, the file will be overwritten.
- If the file exists and the new data is the same as the existing data, the method will return without writing anything.
- If the file does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.BACKUP","title":"BACKUP class-attribute
instance-attribute
","text":"BACKUP = 'backup'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.EXCLUSIVE","title":"EXCLUSIVE class-attribute
instance-attribute
","text":"EXCLUSIVE = 'exclusive'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.IGNORE","title":"IGNORE class-attribute
instance-attribute
","text":"IGNORE = 'ignore'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.OVERWRITE","title":"OVERWRITE class-attribute
instance-attribute
","text":"OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.UPDATE","title":"UPDATE class-attribute
instance-attribute
","text":"UPDATE = 'update'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.write_mode","title":"write_mode property
","text":"write_mode\n
Return the write mode for the given SFTPWriteMode.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.from_string","title":"from_string classmethod
","text":"from_string(mode: str)\n
Return the SFTPWriteMode for the given string.
Source code in src/koheesio/integrations/spark/sftp.py
@classmethod\ndef from_string(cls, mode: str):\n \"\"\"Return the SFTPWriteMode for the given string.\"\"\"\n return cls[mode.upper()]\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter","title":"koheesio.integrations.spark.sftp.SFTPWriter","text":"Write a Dataframe to SFTP through a BufferWriter
Concept - This class uses Paramiko to connect to an SFTP server and write the contents of a buffer to a file on the server.
- This implementation takes inspiration from https://github.com/springml/spark-sftp
Parameters:
Name Type Description Default path
Union[str, Path]
Path to the folder to write to
required file_name
Optional[str]
Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension.
None
host
str
SFTP Host
required port
int
SFTP Port
required username
SecretStr
SFTP Server Username
None
password
SecretStr
SFTP Server Password
None
buffer_writer
BufferWriter
This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.
required mode
Write mode: overwrite, append, ignore, exclusive, backup, or update. See the docstring of SFTPWriteMode for more details.
required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.buffer_writer","title":"buffer_writer class-attribute
instance-attribute
","text":"buffer_writer: InstanceOf[BufferWriter] = Field(default=..., description='This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.client","title":"client property
","text":"client: SFTPClient\n
Return the SFTP client. If it doesn't exist, create it.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.file_name","title":"file_name class-attribute
instance-attribute
","text":"file_name: Optional[str] = Field(default=None, description='Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension!', alias='filename')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.host","title":"host class-attribute
instance-attribute
","text":"host: str = Field(default=..., description='SFTP Host')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.mode","title":"mode class-attribute
instance-attribute
","text":"mode: SFTPWriteMode = Field(default=OVERWRITE, description='Write mode: overwrite, append, ignore, exclusive, backup, or update.' + __doc__)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.password","title":"password class-attribute
instance-attribute
","text":"password: Optional[SecretStr] = Field(default=None, description='SFTP Server Password')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.path","title":"path class-attribute
instance-attribute
","text":"path: Union[str, Path] = Field(default=..., description='Path to the folder to write to', alias='prefix')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.port","title":"port class-attribute
instance-attribute
","text":"port: int = Field(default=..., description='SFTP Port')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.transport","title":"transport property
","text":"transport\n
Return the transport for the SFTP connection. If it doesn't exist, create it.
If the username and password are provided, use them to connect to the SFTP server.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.username","title":"username class-attribute
instance-attribute
","text":"username: Optional[SecretStr] = Field(default=None, description='SFTP Server Username')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_mode","title":"write_mode property
","text":"write_mode\n
Return the write mode for the given SFTPWriteMode.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.check_file_exists","title":"check_file_exists","text":"check_file_exists(file_path: str) -> bool\n
Check if a file exists on the SFTP server.
Source code in src/koheesio/integrations/spark/sftp.py
def check_file_exists(self, file_path: str) -> bool:\n \"\"\"\n Check if a file exists on the SFTP server.\n \"\"\"\n try:\n self.client.stat(file_path)\n return True\n except IOError:\n return False\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n buffer_output: InstanceOf[BufferWriter.Output] = self.buffer_writer.write(self.df)\n\n # write buffer to the SFTP server\n try:\n self._handle_write_mode(self.path.as_posix(), buffer_output)\n finally:\n self._close_client()\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_path_and_file_name","title":"validate_path_and_file_name","text":"validate_path_and_file_name(data: dict) -> dict\n
Validate the path, make sure path and file_name are Path objects.
Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"before\")\ndef validate_path_and_file_name(cls, data: dict) -> dict:\n \"\"\"Validate the path, make sure path and file_name are Path objects.\"\"\"\n path_or_str = data.get(\"path\")\n\n if isinstance(path_or_str, str):\n # make sure the path is a Path object\n path_or_str = Path(path_or_str)\n\n if not isinstance(path_or_str, Path):\n raise ValueError(f\"Invalid path: {path_or_str}\")\n\n if file_name := data.get(\"file_name\", data.get(\"filename\")):\n path_or_str = path_or_str / file_name\n try:\n del data[\"filename\"]\n except KeyError:\n pass\n data[\"file_name\"] = file_name\n\n data[\"path\"] = path_or_str\n return data\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_sftp_host","title":"validate_sftp_host","text":"validate_sftp_host(v) -> str\n
Validate the host
Source code in src/koheesio/integrations/spark/sftp.py
@field_validator(\"host\")\ndef validate_sftp_host(cls, v) -> str:\n \"\"\"Validate the host\"\"\"\n # remove the sftp:// prefix if present\n if v.startswith(\"sftp://\"):\n v = v.replace(\"sftp://\", \"\")\n\n # remove the trailing slash if present\n if v.endswith(\"/\"):\n v = v[:-1]\n\n return v\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_file","title":"write_file","text":"write_file(file_path: str, buffer_output: InstanceOf[Output])\n
Using Paramiko, write the data in the buffer to SFTP.
Source code in src/koheesio/integrations/spark/sftp.py
def write_file(self, file_path: str, buffer_output: InstanceOf[BufferWriter.Output]):\n \"\"\"\n Using Paramiko, write the data in the buffer to SFTP.\n \"\"\"\n with self.client.open(file_path, self.write_mode) as file:\n self.log.debug(f\"Writing file {file_path} to SFTP...\")\n file.write(buffer_output.read())\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp","title":"koheesio.integrations.spark.sftp.SendCsvToSftp","text":"Write a DataFrame to an SFTP server as a CSV file.
This class uses the PandasCsvBufferWriter to generate the CSV data and the SFTPWriter to write the data to the SFTP server.
Example from koheesio.spark.writers import SendCsvToSftp\n\nwriter = SendCsvToSftp(\n # SFTP Parameters\n host=\"sftp.example.com\",\n port=22,\n username=\"user\",\n password=\"password\",\n path=\"/path/to/folder\",\n file_name=\"file.tsv.gz\",\n # CSV Parameters\n header=True,\n sep=\" \",\n quote='\"',\n timestampFormat=\"%Y-%m-%d\",\n lineSep=os.linesep,\n compression=\"gzip\",\n index=False,\n)\n\nwriter.write(df)\n
In this example, the DataFrame df
is written to the file file.csv.gz
in the folder /path/to/folder
on the SFTP server. The file is written as a CSV file with a tab delimiter (TSV), double quotes as the quote character, and gzip compression.
Parameters:
Name Type Description Default path
Union[str, Path]
Path to the folder to write to.
required file_name
Optional[str]
Name of the file. If not provided, it's expected to be part of the path.
required host
str
SFTP Host.
required port
int
SFTP Port.
required username
SecretStr
SFTP Server Username.
required password
SecretStr
SFTP Server Password.
required mode
Write mode: overwrite, append, ignore, exclusive, backup, or update.
required header
Whether to write column names as the first line. Default is True.
required sep
Field delimiter for the output file. Default is ','.
required quote
Character used to quote fields. Default is '\"'.
required quoteAll
Whether all values should be enclosed in quotes. Default is False.
required escape
Character used to escape sep and quote when needed. Default is '\\'.
required timestampFormat
Date format for datetime objects. Default is '%Y-%m-%dT%H:%M:%S.%f'.
required lineSep
Character used as line separator. Default is os.linesep.
required compression
Compression to use for the output data. Default is None.
required For
required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.buffer_writer","title":"buffer_writer class-attribute
instance-attribute
","text":"buffer_writer: PandasCsvBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"set_up_buffer_writer() -> SendCsvToSftp\n
Set up the buffer writer, passing all CSV related options to it.
Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendCsvToSftp\":\n \"\"\"Set up the buffer writer, passing all CSV related options to it.\"\"\"\n self.buffer_writer = PandasCsvBufferWriter(**self.get_options(options_type=\"kohesio_pandas_buffer_writer\"))\n return self\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp","title":"koheesio.integrations.spark.sftp.SendJsonToSftp","text":"Write a DataFrame to an SFTP server as a JSON file.
This class uses the PandasJsonBufferWriter to generate the JSON data and the SFTPWriter to write the data to the SFTP server.
Example from koheesio.spark.writers import SendJsonToSftp\n\nwriter = SendJsonToSftp(\n # SFTP Parameters (Inherited from SFTPWriter)\n host=\"sftp.example.com\",\n port=22,\n username=\"user\",\n password=\"password\",\n path=\"/path/to/folder\",\n file_name=\"file.json.gz\",\n # JSON Parameters (Inherited from PandasJsonBufferWriter)\n orient=\"records\",\n date_format=\"iso\",\n double_precision=2,\n date_unit=\"ms\",\n lines=False,\n compression=\"gzip\",\n index=False,\n)\n\nwriter.write(df)\n
In this example, the DataFrame df
is written to the file file.json.gz
in the folder /path/to/folder
on the SFTP server. The file is written as a JSON file with gzip compression.
Parameters:
Name Type Description Default path
Union[str, Path]
Path to the folder on the SFTP server.
required file_name
Optional[str]
Name of the file, including extension. If not provided, expected to be part of the path.
required host
str
SFTP Host.
required port
int
SFTP Port.
required username
SecretStr
SFTP Server Username.
required password
SecretStr
SFTP Server Password.
required mode
Write mode: overwrite, append, ignore, exclusive, backup, or update.
required orient
Format of the JSON string. Default is 'records'.
required lines
If True, output is one JSON object per line. Only used when orient='records'. Default is True.
required date_format
Type of date conversion. Default is 'iso'.
required double_precision
Decimal places for encoding floating point values. Default is 10.
required force_ascii
If True, encoded string is ASCII. Default is True.
required compression
Compression to use for output data. Default is None.
required See Also For more details on the JSON parameters, refer to the PandasJsonBufferWriter class documentation.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.buffer_writer","title":"buffer_writer class-attribute
instance-attribute
","text":"buffer_writer: PandasJsonBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"set_up_buffer_writer() -> SendJsonToSftp\n
Set up the buffer writer, passing all JSON related options to it.
Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendJsonToSftp\":\n \"\"\"Set up the buffer writer, passing all JSON related options to it.\"\"\"\n self.buffer_writer = PandasJsonBufferWriter(\n **self.get_options(), compression=self.compression, columns=self.columns\n )\n return self\n
"},{"location":"api_reference/integrations/spark/dq/index.html","title":"Dq","text":""},{"location":"api_reference/integrations/spark/dq/spark_expectations.html","title":"Spark expectations","text":"Koheesio step for running data quality rules with Spark Expectations engine.
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","title":"koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","text":"Run DQ rules for an input dataframe with Spark Expectations engine.
References Spark Expectations: https://engineering.nike.com/spark-expectations/1.0.0/
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.drop_meta_column","title":"drop_meta_column class-attribute
instance-attribute
","text":"drop_meta_column: bool = Field(default=False, alias='drop_meta_columns', description='Whether to drop meta columns added by spark expectations on the output df')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.enable_debugger","title":"enable_debugger class-attribute
instance-attribute
","text":"enable_debugger: bool = Field(default=False, alias='debugger', description='...')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_format","title":"error_writer_format class-attribute
instance-attribute
","text":"error_writer_format: Optional[str] = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_mode","title":"error_writer_mode class-attribute
instance-attribute
","text":"error_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writing_options","title":"error_writing_options class-attribute
instance-attribute
","text":"error_writing_options: Optional[Dict[str, str]] = Field(default_factory=dict, alias='error_writing_options', description='Options for writing to the error table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the stats and err table. Separate output formats can be specified for each table using the error_writer_format and stats_writer_format params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.mode","title":"mode class-attribute
instance-attribute
","text":"mode: Union[str, BatchOutputMode] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err and stats table. Separate output modes can be specified for each table using the error_writer_mode and stats_writer_mode params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.product_id","title":"product_id class-attribute
instance-attribute
","text":"product_id: str = Field(default=..., description='Spark Expectations product identifier')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.rules_table","title":"rules_table class-attribute
instance-attribute
","text":"rules_table: str = Field(default=..., alias='product_rules_table', description='DQ rules table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.se_user_conf","title":"se_user_conf class-attribute
instance-attribute
","text":"se_user_conf: Dict[str, Any] = Field(default={se_notifications_enable_email: False, se_notifications_enable_slack: False}, alias='user_conf', description='SE user provided confs', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_streaming","title":"statistics_streaming class-attribute
instance-attribute
","text":"statistics_streaming: Dict[str, Any] = Field(default={se_enable_streaming: False}, alias='stats_streaming_options', description='SE stats streaming options ', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_table","title":"statistics_table class-attribute
instance-attribute
","text":"statistics_table: str = Field(default=..., alias='dq_stats_table_name', description='DQ stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_format","title":"stats_writer_format class-attribute
instance-attribute
","text":"stats_writer_format: Optional[str] = Field(default='delta', alias='stats_writer_format', description='The format used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_mode","title":"stats_writer_mode class-attribute
instance-attribute
","text":"stats_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='stats_writer_mode', description='The write mode that will be used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.target_table","title":"target_table class-attribute
instance-attribute
","text":"target_table: str = Field(default=..., alias='target_table_name', description=\"The table that will contain good records. Won't write to it, but will write to the err table with same name plus _err suffix\")\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output","title":"Output","text":"Output of the SparkExpectationsTransformation step.
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.error_table_writer","title":"error_table_writer class-attribute
instance-attribute
","text":"error_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations error table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.rules_df","title":"rules_df class-attribute
instance-attribute
","text":"rules_df: DataFrame = Field(default=..., description='Output dataframe')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.se","title":"se class-attribute
instance-attribute
","text":"se: SparkExpectations = Field(default=..., description='Spark Expectations object')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.stats_table_writer","title":"stats_table_writer class-attribute
instance-attribute
","text":"stats_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations stats table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.execute","title":"execute","text":"execute() -> Output\n
Apply data quality rules to a dataframe using the out-of-the-box SE decorator
Source code in src/koheesio/integrations/spark/dq/spark_expectations.py
def execute(self) -> Output:\n \"\"\"\n Apply data quality rules to a dataframe using the out-of-the-box SE decorator\n \"\"\"\n # read rules table\n rules_df = self.spark.read.table(self.rules_table).cache()\n self.output.rules_df = rules_df\n\n @self._se.with_expectations(\n target_table=self.target_table,\n user_conf=self.se_user_conf,\n # Below params are `False` by default, however exposing them here for extra visibility\n # The writes can be handled by downstream Koheesio steps\n write_to_table=False,\n write_to_temp_table=False,\n )\n def inner(df: DataFrame) -> DataFrame:\n \"\"\"Just a wrapper to be able to use Spark Expectations decorator\"\"\"\n return df\n\n output_df = inner(self.df)\n\n if self.drop_meta_column:\n output_df = output_df.drop(\"meta_dq_run_id\", \"meta_dq_run_datetime\")\n\n self.output.df = output_df\n
"},{"location":"api_reference/models/index.html","title":"Models","text":"Models package creates models that can be used to base other classes on.
- Every model should be at least a pydantic BaseModel, but can also be a Step, or a StepOutput.
- Every model is expected to be an ABC (Abstract Base Class)
- Optionally a model can inherit ExtraParamsMixin that provides unpacking of kwargs into
extra_params
dict property removing need to create a dict before passing kwargs to a model initializer.
A Model class can be exceptionally handy when you need similar Pydantic models in multiple places, for example across Transformation and Reader classes.
"},{"location":"api_reference/models/index.html#koheesio.models.ListOfColumns","title":"koheesio.models.ListOfColumns module-attribute
","text":"ListOfColumns = Annotated[List[str], BeforeValidator(_list_of_columns_validation)]\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel","title":"koheesio.models.BaseModel","text":"Base model for all models.
Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.
Additional methods and properties: Different Modes This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.
-
Normal mode: you need to know the values ahead of time
normal_mode = YourOwnModel(a=\"foo\", b=42)\n
-
Lazy mode: being able to defer the validation until later
lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end. -
With statements: With statements are also allowed. The validate_output
method from the earlier example will run upon exit of the with-statement.
with YourOwnModel.lazy() as with_output:\n with_output.a = \"foo\"\n with_output.b = 42\n
Note: that a lazy mode BaseModel object is required to work with a with-statement.
Examples:
from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n name: str\n age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n
In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output
method is then called to validate the instance.
Koheesio specific configuration: Koheesio models are configured differently from Pydantic defaults. The following configuration is used:
-
extra=\"allow\"
This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.
-
arbitrary_types_allowed=True
This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.
-
populate_by_name=True
This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.
-
validate_assignment=False
This setting determines whether the model should be revalidated when the data is changed. If set to True
, every time a field is assigned a new value, the entire model is validated again.
Pydantic default is (also) False
, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.
-
revalidate_instances=\"subclass-instances\"
This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never
, which means that the model and dataclass instances are not revalidated during validation.
-
validate_default=True
This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.
-
frozen=False
This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.
-
coerce_numbers_to_str=True
This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number
type to str
. Pydantic doesn't allow number types (int
, float
, Decimal
) to be coerced as type str
by default.
-
use_enum_values=True
This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--fields","title":"Fields","text":"Every Koheesio BaseModel has two fields: name
and description
. These fields are used to provide a name and a description to the model.
-
name
: This is the name of the Model. If not provided, it defaults to the class name.
-
description
: This is the description of the Model. It has several default behaviors:
- If not provided, it defaults to the docstring of the class.
- If the docstring is not provided, it defaults to the name of the class.
- For multi-line descriptions, it has the following behaviors:
- Only the first non-empty line is used.
- Empty lines are removed.
- Only the first 3 lines are considered.
- Only the first 120 characters are considered.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--validators","title":"Validators","text":" _set_name_and_description
: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--properties","title":"Properties","text":" log
: Returns a logger with the name of the class.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--class-methods","title":"Class Methods","text":" from_basemodel
: Returns a new BaseModel instance based on the data of another BaseModel. from_context
: Creates BaseModel instance from a given Context. from_dict
: Creates BaseModel instance from a given dictionary. from_json
: Creates BaseModel instance from a given JSON string. from_toml
: Creates BaseModel object from a given toml file. from_yaml
: Creates BaseModel object from a given yaml file. lazy
: Constructs the model without doing validation.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--dunder-methods","title":"Dunder Methods","text":" __add__
: Allows to add two BaseModel instances together. __enter__
: Allows for using the model in a with-statement. __exit__
: Allows for using the model in a with-statement. __setitem__
: Set Item dunder method for BaseModel. __getitem__
: Get Item dunder method for BaseModel.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--instance-methods","title":"Instance Methods","text":" hasattr
: Check if given key is present in the model. get
: Get an attribute of the model, but don't fail if not present. merge
: Merge key,value map with self. set
: Allows for subscribing / assigning to class[key]
. to_context
: Converts the BaseModel instance to a Context object. to_dict
: Converts the BaseModel instance to a dictionary. to_json
: Converts the BaseModel instance to a JSON string. to_yaml
: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.description","title":"description class-attribute
instance-attribute
","text":"description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.log","title":"log property
","text":"log: Logger\n
Returns a logger with the name of the class
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.name","title":"name class-attribute
instance-attribute
","text":"name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_basemodel","title":"from_basemodel classmethod
","text":"from_basemodel(basemodel: BaseModel, **kwargs)\n
Returns a new BaseModel instance based on the data of another BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n kwargs = {**basemodel.model_dump(), **kwargs}\n return cls(**kwargs)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_context","title":"from_context classmethod
","text":"from_context(context: Context) -> BaseModel\n
Creates BaseModel instance from a given Context
You have to make sure that the Context object has the necessary attributes to create the model.
Examples:
class SomeStep(BaseModel):\n foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo) # prints 'bar'\n
Parameters:
Name Type Description Default context
Context
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given Context\n\n You have to make sure that the Context object has the necessary attributes to create the model.\n\n Examples\n --------\n ```python\n class SomeStep(BaseModel):\n foo: str\n\n\n context = Context(foo=\"bar\")\n some_step = SomeStep.from_context(context)\n print(some_step.foo) # prints 'bar'\n ```\n\n Parameters\n ----------\n context: Context\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_dict","title":"from_dict classmethod
","text":"from_dict(data: Dict[str, Any]) -> BaseModel\n
Creates BaseModel instance from a given dictionary
Parameters:
Name Type Description Default data
Dict[str, Any]
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given dictionary\n\n Parameters\n ----------\n data: Dict[str, Any]\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**data)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel instance from a given JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.from_json : Deserializes a JSON string to a Context object
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.from_json : Deserializes a JSON string to a Context object\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_json(json_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel object from a given toml file
Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file, or string containing toml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel object from a given toml file\n\n Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n Parameters\n ----------\n toml_file_or_str: str or Path\n Pathlike string or Path that points to the toml file, or string containing toml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_toml(toml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> BaseModel\n
Creates BaseModel object from a given yaml file
Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n \"\"\"Creates BaseModel object from a given yaml file\n\n Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_yaml(yaml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.get","title":"get","text":"get(key: str, default: Optional[Any] = None)\n
Get an attribute of the model, but don't fail if not present
Similar to dict.get()
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\") # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n
Parameters:
Name Type Description Default key
str
name of the key to get
required default
Optional[Any]
Default value in case the attribute does not exist
None
Returns:
Type Description Any
The value of the attribute
Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n \"\"\"Get an attribute of the model, but don't fail if not present\n\n Similar to dict.get()\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.get(\"foo\") # returns 'bar'\n step_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n ```\n\n Parameters\n ----------\n key: str\n name of the key to get\n default: Optional[Any]\n Default value in case the attribute does not exist\n\n Returns\n -------\n Any\n The value of the attribute\n \"\"\"\n if self.hasattr(key):\n return self.__getitem__(key)\n return default\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.hasattr","title":"hasattr","text":"hasattr(key: str) -> bool\n
Check if given key is present in the model
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n \"\"\"Check if given key is present in the model\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n return hasattr(self, key)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.lazy","title":"lazy classmethod
","text":"lazy()\n
Constructs the model without doing validation
Essentially an alias to BaseModel.construct()
Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n \"\"\"Constructs the model without doing validation\n\n Essentially an alias to BaseModel.construct()\n \"\"\"\n return cls.model_construct()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.merge","title":"merge","text":"merge(other: Union[Dict, BaseModel])\n
Merge key,value map with self
Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n
Parameters:
Name Type Description Default other
Union[Dict, BaseModel]
Dict or another instance of a BaseModel class that will be added to self
required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n \"\"\"Merge key,value map with self\n\n Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n ```\n\n Parameters\n ----------\n other: Union[Dict, BaseModel]\n Dict or another instance of a BaseModel class that will be added to self\n \"\"\"\n if isinstance(other, BaseModel):\n other = other.model_dump() # ensures we really have a dict\n\n for k, v in other.items():\n self.set(k, v)\n\n return self\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.set","title":"set","text":"set(key: str, value: Any)\n
Allows for subscribing / assigning to class[key]
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n
Parameters:
Name Type Description Default key
str
The key of the attribute to assign to
required value
Any
Value that should be assigned to the given key
required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n ```\n\n Parameters\n ----------\n key: str\n The key of the attribute to assign to\n value: Any\n Value that should be assigned to the given key\n \"\"\"\n self.__setitem__(key, value)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_context","title":"to_context","text":"to_context() -> Context\n
Converts the BaseModel instance to a Context object
Returns:
Type Description Context
Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n \"\"\"Converts the BaseModel instance to a Context object\n\n Returns\n -------\n Context\n \"\"\"\n return Context(**self.to_dict())\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Converts the BaseModel instance to a dictionary
Returns:
Type Description Dict[str, Any]
Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Converts the BaseModel instance to a dictionary\n\n Returns\n -------\n Dict[str, Any]\n \"\"\"\n return self.model_dump()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_json","title":"to_json","text":"to_json(pretty: bool = False)\n
Converts the BaseModel instance to a JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.to_json : Serializes a Context object to a JSON string
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n \"\"\"Converts the BaseModel instance to a JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.to_json : Serializes a Context object to a JSON string\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Converts the BaseModel instance to a YAML string
BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Converts the BaseModel instance to a YAML string\n\n BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.validate","title":"validate","text":"validate() -> BaseModel\n
Validate the BaseModel instance
This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.
This method is intended to be used with the lazy
method. The lazy
method is used to create an instance of the BaseModel without immediate validation. The validate
method is then used to validate the instance after.
Note: in the Pydantic BaseModel, the validate
method throws a deprecated warning. This is because Pydantic recommends using the validate_model
method instead. However, we are using the validate
method here in a different context and a slightly different way.
Examples:
class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model
instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate
method is then called to validate the instance. Returns:
Type Description BaseModel
The BaseModel instance
Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n \"\"\"Validate the BaseModel instance\n\n This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n validate the instance after all the attributes have been set.\n\n This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n different context and a slightly different way.\n\n Examples\n --------\n ```python\n class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\n foo_model = FooModel.lazy()\n foo_model.foo = \"bar\"\n foo_model.lorem = \"ipsum\"\n foo_model.validate()\n ```\n In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n are set afterward. The `validate` method is then called to validate the instance.\n\n Returns\n -------\n BaseModel\n The BaseModel instance\n \"\"\"\n return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin","title":"koheesio.models.ExtraParamsMixin","text":"Mixin class that adds support for arbitrary keyword arguments to Pydantic models.
The keyword arguments are extracted from the model's values
and moved to a params
dictionary.
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.extra_params","title":"extra_params cached
property
","text":"extra_params: Dict[str, Any]\n
Extract params (passed as arbitrary kwargs) from values and move them to params dict
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.params","title":"params class-attribute
instance-attribute
","text":"params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/models/sql.html","title":"Sql","text":"This module contains the base class for SQL steps.
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep","title":"koheesio.models.sql.SqlBaseStep","text":"Base class for SQL steps
params
are used as placeholders for templating. These are identified with ${placeholder} in the SQL script.
Parameters:
Name Type Description Default sql_path
Path to a SQL file
required sql
SQL script to apply
required params
Placeholders (parameters) for templating. These are identified with ${placeholder}
in the SQL script.
Note: any arbitrary kwargs passed to the class will be added to params.
required"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.params","title":"params class-attribute
instance-attribute
","text":"params: Dict[str, Any] = Field(default_factory=dict, description='Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script. Note: any arbitrary kwargs passed to the class will be added to params.')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.query","title":"query property
","text":"query\n
Returns the query while performing params replacement
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql","title":"sql class-attribute
instance-attribute
","text":"sql: Optional[str] = Field(default=None, description='SQL script to apply')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql_path","title":"sql_path class-attribute
instance-attribute
","text":"sql_path: Optional[Union[Path, str]] = Field(default=None, description='Path to a SQL file')\n
"},{"location":"api_reference/notifications/index.html","title":"Notifications","text":"Notification module for sending messages to notification services (e.g. Slack, Email, etc.)
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity","title":"koheesio.notifications.NotificationSeverity","text":"Enumeration of allowed message severities
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.ERROR","title":"ERROR class-attribute
instance-attribute
","text":"ERROR = 'error'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.INFO","title":"INFO class-attribute
instance-attribute
","text":"INFO = 'info'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.SUCCESS","title":"SUCCESS class-attribute
instance-attribute
","text":"SUCCESS = 'success'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.WARN","title":"WARN class-attribute
instance-attribute
","text":"WARN = 'warn'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.alert_icon","title":"alert_icon property
","text":"alert_icon: str\n
Return a colored circle in slack markup
"},{"location":"api_reference/notifications/slack.html","title":"Slack","text":"Classes to ease interaction with Slack
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification","title":"koheesio.notifications.slack.SlackNotification","text":"Generic Slack notification class via the Blocks
API
NOTE: channel
parameter is used only with Slack Web API: https://api.slack.com/messaging/sending If webhook is used, the channel specification is not required
Example:
s = SlackNotification(\n url=\"slack-webhook-url\",\n channel=\"channel\",\n message=\"Some *markdown* compatible text\",\n)\ns.execute()\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.channel","title":"channel class-attribute
instance-attribute
","text":"channel: Optional[str] = Field(default=None, description='Slack channel id')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.headers","title":"headers class-attribute
instance-attribute
","text":"headers: Optional[Dict[str, Any]] = {'Content-type': 'application/json'}\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.message","title":"message class-attribute
instance-attribute
","text":"message: str = Field(default=..., description='The message that gets posted to Slack')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.execute","title":"execute","text":"execute()\n
Generate payload and send post request
Source code in src/koheesio/notifications/slack.py
def execute(self):\n \"\"\"\n Generate payload and send post request\n \"\"\"\n self.data = self.get_payload()\n HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.get_payload","title":"get_payload","text":"get_payload()\n
Generate payload with Block Kit
. More details: https://api.slack.com/block-kit
Source code in src/koheesio/notifications/slack.py
def get_payload(self):\n \"\"\"\n Generate payload with `Block Kit`.\n More details: https://api.slack.com/block-kit\n \"\"\"\n payload = {\n \"attachments\": [\n {\n \"blocks\": [\n {\n \"type\": \"section\",\n \"text\": {\n \"type\": \"mrkdwn\",\n \"text\": self.message,\n },\n }\n ],\n }\n ]\n }\n\n if self.channel:\n payload[\"channel\"] = self.channel\n\n return json.dumps(payload)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity","title":"koheesio.notifications.slack.SlackNotificationWithSeverity","text":"Slack notification class via the Blocks
API with etra severity information and predefined extra fields
Example: from koheesio.steps.integrations.notifications import NotificationSeverity
s = SlackNotificationWithSeverity(\n url=\"slack-webhook-url\",\n channel=\"channel\",\n message=\"Some *markdown* compatible text\"\n severity=NotificationSeverity.ERROR,\n title=\"Title\",\n environment=\"dev\",\n application=\"Application\"\n)\ns.execute()\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.application","title":"application class-attribute
instance-attribute
","text":"application: str = Field(default=..., description='Pipeline or application name')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.environment","title":"environment class-attribute
instance-attribute
","text":"environment: str = Field(default=..., description='Environment description, e.g. dev / qa /prod')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(use_enum_values=False)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.severity","title":"severity class-attribute
instance-attribute
","text":"severity: NotificationSeverity = Field(default=..., description='Severity of the message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.timestamp","title":"timestamp class-attribute
instance-attribute
","text":"timestamp: datetime = Field(default=utcnow(), alias='execution_timestamp', description='Pipeline or application execution timestamp')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.title","title":"title class-attribute
instance-attribute
","text":"title: str = Field(default=..., description='Title of your message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.execute","title":"execute","text":"execute()\n
Generate payload and send post request
Source code in src/koheesio/notifications/slack.py
def execute(self):\n \"\"\"\n Generate payload and send post request\n \"\"\"\n self.message = self.get_payload_message()\n self.data = self.get_payload()\n HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.get_payload_message","title":"get_payload_message","text":"get_payload_message()\n
Generate payload message based on the predefined set of parameters
Source code in src/koheesio/notifications/slack.py
def get_payload_message(self):\n \"\"\"\n Generate payload message based on the predefined set of parameters\n \"\"\"\n return dedent(\n f\"\"\"\n {self.severity.alert_icon} *{self.severity.name}:* {self.title}\n *Environment:* {self.environment}\n *Application:* {self.application}\n *Message:* {self.message}\n *Timestamp:* {self.timestamp}\n \"\"\"\n )\n
"},{"location":"api_reference/secrets/index.html","title":"Secrets","text":"Module for secret integrations.
Contains abstract class for various secret integrations also known as SecretContext.
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret","title":"koheesio.secrets.Secret","text":"Abstract class for various secret integrations. All secrets are wrapped into Context class for easy access. Either existing context can be provided, or new context will be created and returned at runtime.
Secrets are wrapped into the pydantic.SecretStr.
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.context","title":"context class-attribute
instance-attribute
","text":"context: Optional[Context] = Field(Context({}), description='Existing `Context` instance can be used for secrets, otherwise new empty context will be created.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.parent","title":"parent class-attribute
instance-attribute
","text":"parent: Optional[str] = Field(default=..., description='Group secrets from one secure path under this friendly name', pattern='^[a-zA-Z0-9_]+$')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.root","title":"root class-attribute
instance-attribute
","text":"root: Optional[str] = Field(default='secrets', description='All secrets will be grouped under this root.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output","title":"Output","text":"Output class for Secret.
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output.context","title":"context class-attribute
instance-attribute
","text":"context: Context = Field(default=..., description='Koheesio context')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.encode_secret_values","title":"encode_secret_values classmethod
","text":"encode_secret_values(data: dict)\n
Encode secret values in the dictionary.
Ensures that all values in the dictionary are wrapped in SecretStr.
Source code in src/koheesio/secrets/__init__.py
@classmethod\ndef encode_secret_values(cls, data: dict):\n \"\"\"Encode secret values in the dictionary.\n\n Ensures that all values in the dictionary are wrapped in SecretStr.\n \"\"\"\n encoded_dict = {}\n for key, value in data.items():\n if isinstance(value, dict):\n encoded_dict[key] = cls.encode_secret_values(value)\n else:\n encoded_dict[key] = SecretStr(value)\n return encoded_dict\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.execute","title":"execute","text":"execute()\n
Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.
Source code in src/koheesio/secrets/__init__.py
def execute(self):\n \"\"\"\n Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.\n \"\"\"\n context = Context(self.encode_secret_values(data={self.root: {self.parent: self._get_secrets()}}))\n self.output.context = self.context.merge(context=context)\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.get","title":"get","text":"get() -> Context\n
Convenience method to return context with secrets.
Source code in src/koheesio/secrets/__init__.py
def get(self) -> Context:\n \"\"\"\n Convenience method to return context with secrets.\n \"\"\"\n self.execute()\n return self.output.context\n
"},{"location":"api_reference/secrets/cerberus.html","title":"Cerberus","text":"Module for retrieving secrets from Cerberus.
Secrets are stored as SecretContext and can be accessed accordingly.
See CerberusSecret for more information.
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret","title":"koheesio.secrets.cerberus.CerberusSecret","text":"Retrieve secrets from Cerberus and wrap them into Context class for easy access. All secrets are stored under the \"secret\" root and \"parent\". \"Parent\" either derived from the secure data path by replacing \"/\" and \"-\", or manually provided by the user. Secrets are wrapped into the pydantic.SecretStr.
Example:
context = {\n \"secrets\": {\n \"parent\": {\n \"webhook\": SecretStr(\"**********\"),\n \"description\": SecretStr(\"**********\"),\n }\n }\n}\n
Values can be decoded like this:
context.secrets.parent.webhook.get_secret_value()\n
or if working with dictionary is preferable: for key, value in context.get_all().items():\n value.get_secret_value()\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.aws_session","title":"aws_session class-attribute
instance-attribute
","text":"aws_session: Optional[Session] = Field(default=None, description='AWS Session to pass to Cerberus client, can be used for local execution.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.path","title":"path class-attribute
instance-attribute
","text":"path: str = Field(default=..., description=\"Secure data path, eg. 'app/my-sdb/my-secrets'\")\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.token","title":"token class-attribute
instance-attribute
","text":"token: Optional[SecretStr] = Field(default=get('CERBERUS_TOKEN', None), description='Cerberus token, can be used for local development without AWS auth mechanism.Note: Token has priority over AWS session.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., description='Cerberus URL, eg. https://cerberus.domain.com')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.verbose","title":"verbose class-attribute
instance-attribute
","text":"verbose: bool = Field(default=False, description='Enable verbose for Cerberus client')\n
"},{"location":"api_reference/spark/index.html","title":"Spark","text":"Spark step module
"},{"location":"api_reference/spark/index.html#koheesio.spark.AnalysisException","title":"koheesio.spark.AnalysisException module-attribute
","text":"AnalysisException = AnalysisException\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.DataFrame","title":"koheesio.spark.DataFrame module-attribute
","text":"DataFrame = DataFrame\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkSession","title":"koheesio.spark.SparkSession module-attribute
","text":"SparkSession = SparkSession\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep","title":"koheesio.spark.SparkStep","text":"Base class for a Spark step
Extends the Step class with SparkSession support. The following: - Spark steps are expected to return a Spark DataFrame as output. - spark property is available to access the active SparkSession instance.
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.spark","title":"spark property
","text":"spark: Optional[SparkSession]\n
Get active SparkSession instance
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output","title":"Output","text":"Output class for SparkStep
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.current_timestamp_utc","title":"koheesio.spark.current_timestamp_utc","text":"current_timestamp_utc(spark: SparkSession) -> Column\n
Get the current timestamp in UTC
Source code in src/koheesio/spark/__init__.py
def current_timestamp_utc(spark: SparkSession) -> Column:\n \"\"\"Get the current timestamp in UTC\"\"\"\n return F.to_utc_timestamp(F.current_timestamp(), spark.conf.get(\"spark.sql.session.timeZone\"))\n
"},{"location":"api_reference/spark/delta.html","title":"Delta","text":"Module for creating and managing Delta tables.
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep","title":"koheesio.spark.delta.DeltaTableStep","text":"Class for creating and managing Delta tables.
DeltaTable aims to provide a simple interface to create and manage Delta tables. It is a wrapper around the Spark SQL API for Delta tables.
Example from koheesio.steps import DeltaTableStep\n\nDeltaTableStep(\n table=\"my_table\",\n database=\"my_database\",\n catalog=\"my_catalog\",\n create_if_not_exists=True,\n default_create_properties={\n \"delta.randomizeFilePrefixes\": \"true\",\n \"delta.checkpoint.writeStatsAsStruct\": \"true\",\n \"delta.minReaderVersion\": \"2\",\n \"delta.minWriterVersion\": \"5\",\n },\n)\n
Methods:
Name Description get_persisted_properties
Get persisted properties of table.
add_property
Alter table and set table property.
add_properties
Alter table and add properties.
execute
Nothing to execute on a Table.
max_version_ts_of_last_execution
Max version timestamp of last execution. If no timestamp is found, returns 1900-01-01 00:00:00. Note: will raise an error if column VERSION_TIMESTAMP
does not exist.
Properties - name -> str Deprecated. Use
.table_name
instead. - table_name -> str Table name.
- dataframe -> DataFrame Returns a DataFrame to be able to interact with this table.
- columns -> Optional[List[str]] Returns all column names as a list.
- has_change_type -> bool Checks if a column named
_change_type
is present in the table. - exists -> bool Check if table exists.
Parameters:
Name Type Description Default table
str
Table name.
required database
str
Database or Schema name.
None
catalog
str
Catalog name.
None
create_if_not_exists
bool
Force table creation if it doesn't exist. Note: Default properties will be applied to the table during CREATION.
False
default_create_properties
Dict[str, str]
Default table properties to be applied during CREATION if force_creation
True.
{\"delta.randomizeFilePrefixes\": \"true\", \"delta.checkpoint.writeStatsAsStruct\": \"true\", \"delta.minReaderVersion\": \"2\", \"delta.minWriterVersion\": \"5\"}
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.catalog","title":"catalog class-attribute
instance-attribute
","text":"catalog: Optional[str] = Field(default=None, description='Catalog name. Note: Can be ignored if using a SparkCatalog that does not support catalog notation (e.g. Hive)')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.columns","title":"columns property
","text":"columns: Optional[List[str]]\n
Returns all column names as a list.
Example DeltaTableStep(...).columns\n
Would for example return ['age', 'name']
if the table has columns age
and name
."},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.create_if_not_exists","title":"create_if_not_exists class-attribute
instance-attribute
","text":"create_if_not_exists: bool = Field(default=False, alias='force_creation', description=\"Force table creation if it doesn't exist.Note: Default properties will be applied to the table during CREATION.\")\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.database","title":"database class-attribute
instance-attribute
","text":"database: Optional[str] = Field(default=None, description='Database or Schema name.')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.dataframe","title":"dataframe property
","text":"dataframe: DataFrame\n
Returns a DataFrame to be able to interact with this table
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.default_create_properties","title":"default_create_properties class-attribute
instance-attribute
","text":"default_create_properties: Dict[str, Union[str, bool, int]] = Field(default={'delta.randomizeFilePrefixes': 'true', 'delta.checkpoint.writeStatsAsStruct': 'true', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'}, description='Default table properties to be applied during CREATION if `create_if_not_exists` True')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.exists","title":"exists property
","text":"exists: bool\n
Check if table exists
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.has_change_type","title":"has_change_type property
","text":"has_change_type: bool\n
Checks if a column named _change_type
is present in the table
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.is_cdf_active","title":"is_cdf_active property
","text":"is_cdf_active: bool\n
Check if CDF property is set and activated
Returns:
Type Description bool
delta.enableChangeDataFeed property is set to 'true'
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table","title":"table instance-attribute
","text":"table: str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table_name","title":"table_name property
","text":"table_name: str\n
Fully qualified table name in the form of catalog.database.table
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_properties","title":"add_properties","text":"add_properties(properties: Dict[str, Union[str, bool, int]], override: bool = False)\n
Alter table and add properties.
Parameters:
Name Type Description Default properties
Dict[str, Union[str, int, bool]]
Properties to be added to table.
required override
bool
Enable override of existing value for property in table.
False
Source code in src/koheesio/spark/delta.py
def add_properties(self, properties: Dict[str, Union[str, bool, int]], override: bool = False):\n \"\"\"Alter table and add properties.\n\n Parameters\n ----------\n properties : Dict[str, Union[str, int, bool]]\n Properties to be added to table.\n override : bool, optional, default=False\n Enable override of existing value for property in table.\n\n \"\"\"\n for k, v in properties.items():\n v_str = str(v) if not isinstance(v, bool) else str(v).lower()\n self.add_property(key=k, value=v_str, override=override)\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_property","title":"add_property","text":"add_property(key: str, value: Union[str, int, bool], override: bool = False)\n
Alter table and set table property.
Parameters:
Name Type Description Default key
str
Property key(name).
required value
Union[str, int, bool]
Property value.
required override
bool
Enable override of existing value for property in table.
False
Source code in src/koheesio/spark/delta.py
def add_property(self, key: str, value: Union[str, int, bool], override: bool = False):\n \"\"\"Alter table and set table property.\n\n Parameters\n ----------\n key: str\n Property key(name).\n value: Union[str, int, bool]\n Property value.\n override: bool\n Enable override of existing value for property in table.\n\n \"\"\"\n persisted_properties = self.get_persisted_properties()\n v_str = str(value) if not isinstance(value, bool) else str(value).lower()\n\n def _alter_table() -> None:\n property_pair = f\"'{key}'='{v_str}'\"\n\n try:\n # noinspection SqlNoDataSourceInspection\n self.spark.sql(f\"ALTER TABLE {self.table_name} SET TBLPROPERTIES ({property_pair})\")\n self.log.debug(f\"Table `{self.table_name}` has been altered. Property `{property_pair}` added.\")\n except Py4JJavaError as e:\n msg = f\"Property `{key}` can not be applied to table `{self.table_name}`. Exception: {e}\"\n self.log.warning(msg)\n warnings.warn(msg)\n\n if self.exists:\n if key in persisted_properties and persisted_properties[key] != v_str:\n if override:\n self.log.debug(\n f\"Property `{key}` presents in `{self.table_name}` and has value `{persisted_properties[key]}`.\"\n f\"Override is enabled.The value will be changed to `{v_str}`.\"\n )\n _alter_table()\n else:\n self.log.debug(\n f\"Skipping adding property `{key}`, because it is already set \"\n f\"for table `{self.table_name}` to `{v_str}`. To override it, provide override=True\"\n )\n else:\n _alter_table()\n else:\n self.default_create_properties[key] = v_str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.execute","title":"execute","text":"execute()\n
Nothing to execute on a Table
Source code in src/koheesio/spark/delta.py
def execute(self):\n \"\"\"Nothing to execute on a Table\"\"\"\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_column_type","title":"get_column_type","text":"get_column_type(column: str) -> Optional[DataType]\n
Get the type of a column in the table.
Parameters:
Name Type Description Default column
str
Column name.
required Returns:
Type Description Optional[DataType]
Column type.
Source code in src/koheesio/spark/delta.py
def get_column_type(self, column: str) -> Optional[DataType]:\n \"\"\"Get the type of a column in the table.\n\n Parameters\n ----------\n column : str\n Column name.\n\n Returns\n -------\n Optional[DataType]\n Column type.\n \"\"\"\n return self.dataframe.schema[column].dataType if self.columns and column in self.columns else None\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_persisted_properties","title":"get_persisted_properties","text":"get_persisted_properties() -> Dict[str, str]\n
Get persisted properties of table.
Returns:
Type Description Dict[str, str]
Persisted properties as a dictionary.
Source code in src/koheesio/spark/delta.py
def get_persisted_properties(self) -> Dict[str, str]:\n \"\"\"Get persisted properties of table.\n\n Returns\n -------\n Dict[str, str]\n Persisted properties as a dictionary.\n \"\"\"\n persisted_properties = {}\n raw_options = self.spark.sql(f\"SHOW TBLPROPERTIES {self.table_name}\").collect()\n\n for ro in raw_options:\n key, value = ro.asDict().values()\n persisted_properties[key] = value\n\n return persisted_properties\n
"},{"location":"api_reference/spark/etl_task.html","title":"Etl task","text":"ETL Task
Extract -> Transform -> Load
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask","title":"koheesio.spark.etl_task.EtlTask","text":"ETL Task
Etl stands for: Extract -> Transform -> Load
This task is a composition of a Reader (extract), a series of Transformations (transform) and a Writer (load). In other words, it reads data from a source, applies a series of transformations, and writes the result to a target.
Parameters:
Name Type Description Default name
str
Name of the task
required description
str
Description of the task
required source
Reader
Source to read from [extract]
required transformations
list[Transformation]
Series of transformations [transform]. The order of the transformations is important!
required target
Writer
Target to write to [load]
required Example from koheesio.tasks import EtlTask\n\nfrom koheesio.steps.readers import CsvReader\nfrom koheesio.steps.transformations.repartition import Repartition\nfrom koheesio.steps.writers import CsvWriter\n\netl_task = EtlTask(\n name=\"My ETL Task\",\n description=\"This is an example ETL task\",\n source=CsvReader(path=\"path/to/source.csv\"),\n transformations=[Repartition(num_partitions=2)],\n target=DummyWriter(),\n)\n\netl_task.execute()\n
This code will read from a CSV file, repartition the DataFrame to 2 partitions, and write the result to the console.
Extending the EtlTask The EtlTask is designed to be a simple and flexible way to define ETL processes. It is not designed to be a one-size-fits-all solution, but rather a starting point for building more complex ETL processes. If you need more complex functionality, you can extend the EtlTask class and override the extract
, transform
and load
methods. You can also implement your own execute
method to define the entire ETL process from scratch should you need more flexibility.
Advantages of using the EtlTask - It is a simple way to define ETL processes
- It is easy to understand and extend
- It is easy to test and debug
- It is easy to maintain and refactor
- It is easy to integrate with other tools and libraries
- It is easy to use in a production environment
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.etl_date","title":"etl_date class-attribute
instance-attribute
","text":"etl_date: datetime = Field(default=utcnow(), description=\"Date time when this object was created as iso format. Example: '2023-01-24T09:39:23.632374'\")\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.source","title":"source class-attribute
instance-attribute
","text":"source: InstanceOf[Reader] = Field(default=..., description='Source to read from [extract]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.target","title":"target class-attribute
instance-attribute
","text":"target: InstanceOf[Writer] = Field(default=..., description='Target to write to [load]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transformations","title":"transformations class-attribute
instance-attribute
","text":"transformations: conlist(min_length=0, item_type=InstanceOf[Transformation]) = Field(default_factory=list, description='Series of transformations', alias='transforms')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output","title":"Output","text":"Output class for EtlTask
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.source_df","title":"source_df class-attribute
instance-attribute
","text":"source_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .extract() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.target_df","title":"target_df class-attribute
instance-attribute
","text":"target_df: DataFrame = Field(default=..., description='The Spark DataFrame used by .load() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.transform_df","title":"transform_df class-attribute
instance-attribute
","text":"transform_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .transform() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.execute","title":"execute","text":"execute()\n
Run the ETL process
Source code in src/koheesio/spark/etl_task.py
def execute(self):\n \"\"\"Run the ETL process\"\"\"\n self.log.info(f\"Task started at {self.etl_date}\")\n\n # extract from source\n self.output.source_df = self.extract()\n\n # transform\n self.output.transform_df = self.transform(self.output.source_df)\n\n # load to target\n self.output.target_df = self.load(self.output.transform_df)\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.extract","title":"extract","text":"extract() -> DataFrame\n
Read from Source
logging is handled by the Reader.execute()-method's @do_execute decorator
Source code in src/koheesio/spark/etl_task.py
def extract(self) -> DataFrame:\n \"\"\"Read from Source\n\n logging is handled by the Reader.execute()-method's @do_execute decorator\n \"\"\"\n reader: Reader = self.source\n return reader.read()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.load","title":"load","text":"load(df: DataFrame) -> DataFrame\n
Write to Target
logging is handled by the Writer.execute()-method's @do_execute decorator
Source code in src/koheesio/spark/etl_task.py
def load(self, df: DataFrame) -> DataFrame:\n \"\"\"Write to Target\n\n logging is handled by the Writer.execute()-method's @do_execute decorator\n \"\"\"\n writer: Writer = self.target\n writer.write(df)\n return df\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.run","title":"run","text":"run()\n
alias of execute
Source code in src/koheesio/spark/etl_task.py
def run(self):\n \"\"\"alias of execute\"\"\"\n return self.execute()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transform","title":"transform","text":"transform(df: DataFrame) -> DataFrame\n
Transform recursively
logging is handled by the Transformation.execute()-method's @do_execute decorator
Source code in src/koheesio/spark/etl_task.py
def transform(self, df: DataFrame) -> DataFrame:\n \"\"\"Transform recursively\n\n logging is handled by the Transformation.execute()-method's @do_execute decorator\n \"\"\"\n for t in self.transformations:\n df = t.transform(df)\n return df\n
"},{"location":"api_reference/spark/snowflake.html","title":"Snowflake","text":"Snowflake steps and tasks for Koheesio
Every class in this module is a subclass of Step
or Task
and is used to perform operations on Snowflake.
Notes Every Step in this module is based on SnowflakeBaseModel. The following parameters are available for every Step.
Parameters:
Name Type Description Default url
str
Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL
. required user
str
Login name for the Snowflake user. Alias for sfUser
.
required password
SecretStr
Password for the Snowflake user. Alias for sfPassword
.
required database
str
The database to use for the session after connecting. Alias for sfDatabase
.
required sfSchema
str
The schema to use for the session after connecting. Alias for schema
(\"schema\" is a reserved name in Pydantic, so we use sfSchema
as main name instead).
required role
str
The default security role to use for the session after connecting. Alias for sfRole
.
required warehouse
str
The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse
.
required authenticator
Optional[str]
Authenticator for the Snowflake user. Example: \"okta.com\".
None
options
Optional[Dict[str, Any]]
Extra options to pass to the Snowflake connector.
{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}
format
str
The default snowflake
format can be used natively in Databricks, use net.snowflake.spark.snowflake
in other environments and make sure to install required JARs.
\"snowflake\"
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn","title":"koheesio.spark.snowflake.AddColumn","text":"Add an empty column to a Snowflake table with given name and DataType
Example AddColumn(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"MY_TABLE\",\n col=\"MY_COL\",\n dataType=StringType(),\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.column","title":"column class-attribute
instance-attribute
","text":"column: str = Field(default=..., description='The name of the new column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The name of the Snowflake table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.type","title":"type class-attribute
instance-attribute
","text":"type: DataType = Field(default=..., description='The DataType represented as a Spark DataType')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output","title":"Output","text":"Output class for AddColumn
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='Query that was executed to add the column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n query = f\"ALTER TABLE {self.table} ADD COLUMN {self.column} {map_spark_type(self.type)}\".upper()\n self.output.query = query\n RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","title":"koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","text":"Create (or Replace) a Snowflake table which has the same schema as a Spark DataFrame
Can be used as any Transformation. The DataFrame is however left unchanged, and only used for determining the schema of the Snowflake Table that is to be created (or replaced).
Example CreateOrReplaceTableFromDataFrame(\n database=\"MY_DB\",\n schema=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=\"super-secret-password\",\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"MY_TABLE\",\n df=df,\n).execute()\n
Or, as a Transformation:
CreateOrReplaceTableFromDataFrame(\n ...\n table=\"MY_TABLE\",\n).transform(df)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., alias='table_name', description='The name of the (new) table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output","title":"Output","text":"Output class for CreateOrReplaceTableFromDataFrame
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.input_schema","title":"input_schema class-attribute
instance-attribute
","text":"input_schema: StructType = Field(default=..., description='The original schema from the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='Query that was executed to create the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.snowflake_schema","title":"snowflake_schema class-attribute
instance-attribute
","text":"snowflake_schema: str = Field(default=..., description='Derived Snowflake table schema based on the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n self.output.df = self.df\n\n input_schema = self.df.schema\n self.output.input_schema = input_schema\n\n snowflake_schema = \", \".join([f\"{c.name} {map_spark_type(c.dataType)}\" for c in input_schema])\n self.output.snowflake_schema = snowflake_schema\n\n table_name = f\"{self.database}.{self.sfSchema}.{self.table}\"\n query = f\"CREATE OR REPLACE TABLE {table_name} ({snowflake_schema})\"\n self.output.query = query\n\n RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery","title":"koheesio.spark.snowflake.DbTableQuery","text":"Read table from Snowflake using the dbtable
option instead of query
Example DbTableQuery(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"user\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"db.schema.table\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery.dbtable","title":"dbtable class-attribute
instance-attribute
","text":"dbtable: str = Field(default=..., alias='table', description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema","title":"koheesio.spark.snowflake.GetTableSchema","text":"Get the schema from a Snowflake table as a Spark Schema
Notes - This Step will execute a
SELECT * FROM <table> LIMIT 1
query to get the schema of the table. - The schema will be stored in the
table_schema
attribute of the output. table_schema
is used as the attribute name to avoid conflicts with the schema
attribute of Pydantic's BaseModel.
Example schema = (\n GetTableSchema(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=\"super-secret-password\",\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"MY_TABLE\",\n )\n .execute()\n .table_schema\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The Snowflake table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output","title":"Output","text":"Output class for GetTableSchema
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output.table_schema","title":"table_schema class-attribute
instance-attribute
","text":"table_schema: StructType = Field(default=..., serialization_alias='schema', description='The Spark Schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.execute","title":"execute","text":"execute() -> Output\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> Output:\n query = f\"SELECT * FROM {self.table} LIMIT 1\" # nosec B608: hardcoded_sql_expressions\n df = Query(**self.get_options(), query=query).execute().df\n self.output.table_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","text":"Grant Snowflake privileges to a set of roles on a fully qualified object, i.e. database.schema.object_name
This class is a subclass of GrantPrivilegesOnObject
and is used to grant privileges on a fully qualified object. The advantage of using this class is that it sets the object name to be fully qualified, i.e. database.schema.object_name
.
Meaning, you can set the database
, schema
and object
separately and the object name will be set to be fully qualified, i.e. database.schema.object_name
.
Example GrantPrivilegesOnFullyQualifiedObject(\n database=\"MY_DB\",\n schema=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n ...\n object=\"MY_TABLE\",\n type=\"TABLE\",\n ...\n)\n
In this example, the object name will be set to be fully qualified, i.e. MY_DB.MY_SCHEMA.MY_TABLE
. If you were to use GrantPrivilegesOnObject
instead, you would have to set the object name to be fully qualified yourself.
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject.set_object_name","title":"set_object_name","text":"set_object_name()\n
Set the object name to be fully qualified, i.e. database.schema.object_name
Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef set_object_name(self):\n \"\"\"Set the object name to be fully qualified, i.e. database.schema.object_name\"\"\"\n # database, schema, obj_name\n db = self.database\n schema = self.model_dump()[\"sfSchema\"] # since \"schema\" is a reserved name\n obj_name = self.object\n\n self.object = f\"{db}.{schema}.{obj_name}\"\n\n return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnObject","text":"A wrapper on Snowflake GRANT privileges
With this Step, you can grant Snowflake privileges to a set of roles on a table, a view, or an object
See Also https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html
Parameters:
Name Type Description Default warehouse
str
The name of the warehouse. Alias for sfWarehouse
required user
str
The username. Alias for sfUser
required password
SecretStr
The password. Alias for sfPassword
required role
str
The role name
required object
str
The name of the object to grant privileges on
required type
str
The type of object to grant privileges on, e.g. TABLE, VIEW
required privileges
Union[conlist(str, min_length=1), str]
The Privilege/Permission or list of Privileges/Permissions to grant on the given object.
required roles
Union[conlist(str, min_length=1), str]
The Role or list of Roles to grant the privileges to
required Example GrantPermissionsOnTable(\n object=\"MY_TABLE\",\n type=\"TABLE\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n permissions=[\"SELECT\", \"INSERT\"],\n).execute()\n
In this example, the APPLICATION.SNOWFLAKE.ADMIN
role will be granted SELECT
and INSERT
privileges on the MY_TABLE
table using the MY_WH
warehouse.
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.object","title":"object class-attribute
instance-attribute
","text":"object: str = Field(default=..., description='The name of the object to grant privileges on')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.privileges","title":"privileges class-attribute
instance-attribute
","text":"privileges: Union[conlist(str, min_length=1), str] = Field(default=..., alias='permissions', description='The Privilege/Permission or list of Privileges/Permissions to grant on the given object. See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.roles","title":"roles class-attribute
instance-attribute
","text":"roles: Union[conlist(str, min_length=1), str] = Field(default=..., alias='role', validation_alias='roles', description='The Role or list of Roles to grant the privileges to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.type","title":"type class-attribute
instance-attribute
","text":"type: str = Field(default=..., description='The type of object to grant privileges on, e.g. TABLE, VIEW')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output","title":"Output","text":"Output class for GrantPrivilegesOnObject
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output.query","title":"query class-attribute
instance-attribute
","text":"query: conlist(str, min_length=1) = Field(default=..., description='Query that was executed to grant privileges', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n self.output.query = []\n roles = self.roles\n\n for role in roles:\n query = self.get_query(role)\n self.output.query.append(query)\n RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.get_query","title":"get_query","text":"get_query(role: str)\n
Build the GRANT query
Parameters:
Name Type Description Default role
str
The role name
required Returns:
Name Type Description query
str
The Query that performs the grant
Source code in src/koheesio/spark/snowflake.py
def get_query(self, role: str):\n \"\"\"Build the GRANT query\n\n Parameters\n ----------\n role: str\n The role name\n\n Returns\n -------\n query : str\n The Query that performs the grant\n \"\"\"\n query = f\"GRANT {','.join(self.privileges)} ON {self.type} {self.object} TO ROLE {role}\".upper()\n return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.set_roles_privileges","title":"set_roles_privileges","text":"set_roles_privileges(values)\n
Coerce roles and privileges to be lists if they are not already.
Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"before\")\ndef set_roles_privileges(cls, values):\n \"\"\"Coerce roles and privileges to be lists if they are not already.\"\"\"\n roles_value = values.get(\"roles\") or values.get(\"role\")\n privileges_value = values.get(\"privileges\")\n\n if not (roles_value and privileges_value):\n raise ValueError(\"You have to specify roles AND privileges when using 'GrantPrivilegesOnObject'.\")\n\n # coerce values to be lists\n values[\"roles\"] = [roles_value] if isinstance(roles_value, str) else roles_value\n values[\"role\"] = values[\"roles\"][0] # hack to keep the validator happy\n values[\"privileges\"] = [privileges_value] if isinstance(privileges_value, str) else privileges_value\n\n return values\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.validate_object_and_object_type","title":"validate_object_and_object_type","text":"validate_object_and_object_type()\n
Validate that the object and type are set.
Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef validate_object_and_object_type(self):\n \"\"\"Validate that the object and type are set.\"\"\"\n object_value = self.object\n if not object_value:\n raise ValueError(\"You must provide an `object`, this should be the name of the object. \")\n\n object_type = self.type\n if not object_type:\n raise ValueError(\n \"You must provide a `type`, e.g. TABLE, VIEW, DATABASE. \"\n \"See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html\"\n )\n\n return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable","title":"koheesio.spark.snowflake.GrantPrivilegesOnTable","text":"Grant Snowflake privileges to a set of roles on a table
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.object","title":"object class-attribute
instance-attribute
","text":"object: str = Field(default=..., alias='table', description='The name of the Table to grant Privileges on. This should be just the name of the table; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.type","title":"type class-attribute
instance-attribute
","text":"type: str = 'TABLE'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView","title":"koheesio.spark.snowflake.GrantPrivilegesOnView","text":"Grant Snowflake privileges to a set of roles on a view
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.object","title":"object class-attribute
instance-attribute
","text":"object: str = Field(default=..., alias='view', description='The name of the View to grant Privileges on. This should be just the name of the view; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.type","title":"type class-attribute
instance-attribute
","text":"type: str = 'VIEW'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query","title":"koheesio.spark.snowflake.Query","text":"Query data from Snowflake and return the result as a DataFrame
Example Query(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n query=\"SELECT * FROM MY_TABLE\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='The query to run')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.get_options","title":"get_options","text":"get_options()\n
add query to options
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n \"\"\"add query to options\"\"\"\n options = super().get_options()\n options[\"query\"] = self.query\n return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.validate_query","title":"validate_query","text":"validate_query(query)\n
Replace escape characters
Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n \"\"\"Replace escape characters\"\"\"\n query = query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery","title":"koheesio.spark.snowflake.RunQuery","text":"Run a query on Snowflake that does not return a result, e.g. create table statement
This is a wrapper around 'net.snowflake.spark.snowflake.Utils.runQuery' on the JVM
Example RunQuery(\n database=\"MY_DB\",\n schema=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"account\",\n password=\"***\",\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n query=\"CREATE TABLE test (col1 string)\",\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='The query to run', alias='sql')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.execute","title":"execute","text":"execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n if not self.query:\n self.log.warning(\"Empty string given as query input, skipping execution\")\n return\n # noinspection PyProtectedMember\n self.spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(self.get_options(), self.query)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.get_options","title":"get_options","text":"get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n # Executing the RunQuery without `host` option in Databricks throws:\n # An error occurred while calling z:net.snowflake.spark.snowflake.Utils.runQuery.\n # : java.util.NoSuchElementException: key not found: host\n options = super().get_options()\n options[\"host\"] = options[\"sfURL\"]\n return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.validate_query","title":"validate_query","text":"validate_query(query)\n
Replace escape characters
Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n \"\"\"Replace escape characters\"\"\"\n return query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel","title":"koheesio.spark.snowflake.SnowflakeBaseModel","text":"BaseModel for setting up Snowflake Driver options.
Notes - Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
- Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
- Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector
Parameters:
Name Type Description Default url
str
Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL
. required user
str
Login name for the Snowflake user. Alias for sfUser
.
required password
SecretStr
Password for the Snowflake user. Alias for sfPassword
.
required database
str
The database to use for the session after connecting. Alias for sfDatabase
.
required sfSchema
str
The schema to use for the session after connecting. Alias for schema
(\"schema\" is a reserved name in Pydantic, so we use sfSchema
as main name instead).
required role
str
The default security role to use for the session after connecting. Alias for sfRole
.
required warehouse
str
The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse
.
required authenticator
Optional[str]
Authenticator for the Snowflake user. Example: \"okta.com\".
None
options
Optional[Dict[str, Any]]
Extra options to pass to the Snowflake connector.
{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}
format
str
The default snowflake
format can be used natively in Databricks, use net.snowflake.spark.snowflake
in other environments and make sure to install required JARs.
\"snowflake\"
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.authenticator","title":"authenticator class-attribute
instance-attribute
","text":"authenticator: Optional[str] = Field(default=None, description='Authenticator for the Snowflake user', examples=['okta.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.database","title":"database class-attribute
instance-attribute
","text":"database: str = Field(default=..., alias='sfDatabase', description='The database to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='snowflake', description='The default `snowflake` format can be used natively in Databricks, use `net.snowflake.spark.snowflake` in other environments and make sure to install required JARs.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field(default={'sfCompress': 'on', 'continue_on_error': 'off'}, description='Extra options to pass to the Snowflake connector')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.password","title":"password class-attribute
instance-attribute
","text":"password: SecretStr = Field(default=..., alias='sfPassword', description='Password for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.role","title":"role class-attribute
instance-attribute
","text":"role: str = Field(default=..., alias='sfRole', description='The default security role to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.sfSchema","title":"sfSchema class-attribute
instance-attribute
","text":"sfSchema: str = Field(default=..., alias='schema', description='The schema to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., alias='sfURL', description='Hostname for the Snowflake account, e.g. <account>.snowflakecomputing.com', examples=['example.snowflakecomputing.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.user","title":"user class-attribute
instance-attribute
","text":"user: str = Field(default=..., alias='sfUser', description='Login name for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.warehouse","title":"warehouse class-attribute
instance-attribute
","text":"warehouse: str = Field(default=..., alias='sfWarehouse', description='The default virtual warehouse to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.get_options","title":"get_options","text":"get_options()\n
Get the sfOptions as a dictionary.
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n \"\"\"Get the sfOptions as a dictionary.\"\"\"\n return {\n key: value\n for key, value in {\n \"sfURL\": self.url,\n \"sfUser\": self.user,\n \"sfPassword\": self.password.get_secret_value(),\n \"authenticator\": self.authenticator,\n \"sfDatabase\": self.database,\n \"sfSchema\": self.sfSchema,\n \"sfRole\": self.role,\n \"sfWarehouse\": self.warehouse,\n **self.options,\n }.items()\n if value is not None\n }\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader","title":"koheesio.spark.snowflake.SnowflakeReader","text":"Wrapper around JdbcReader for Snowflake.
Example sr = SnowflakeReader(\n url=\"foo.snowflakecomputing.com\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n database=\"db\",\n schema=\"schema\",\n)\ndf = sr.read()\n
Notes - Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
- Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
- Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: Optional[str] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeStep","title":"koheesio.spark.snowflake.SnowflakeStep","text":"Expands the SnowflakeBaseModel so that it can be used as a Step
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep","title":"koheesio.spark.snowflake.SnowflakeTableStep","text":"Expands the SnowflakeStep, adding a 'table' parameter
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.get_options","title":"get_options","text":"get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n options = super().get_options()\n options[\"table\"] = self.table\n return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTransformation","title":"koheesio.spark.snowflake.SnowflakeTransformation","text":"Adds Snowflake parameters to the Transformation class
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter","title":"koheesio.spark.snowflake.SnowflakeWriter","text":"Class for writing to Snowflake
See Also - koheesio.steps.writers.Writer
- koheesio.steps.writers.BatchOutputMode
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.insert_type","title":"insert_type class-attribute
instance-attribute
","text":"insert_type: Optional[BatchOutputMode] = Field(APPEND, alias='mode', description='The insertion type, append or overwrite')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='Target table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.execute","title":"execute","text":"execute()\n
Write to Snowflake
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n \"\"\"Write to Snowflake\"\"\"\n self.log.debug(f\"writing to {self.table} with mode {self.insert_type}\")\n self.df.write.format(self.format).options(**self.get_options()).option(\"dbtable\", self.table).mode(\n self.insert_type\n ).save()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema","title":"koheesio.spark.snowflake.SyncTableAndDataFrameSchema","text":"Sync the schema's of a Snowflake table and a DataFrame. This will add NULL columns for the columns that are not in both and perform type casts where needed.
The Snowflake table will take priority in case of type conflicts.
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.df","title":"df class-attribute
instance-attribute
","text":"df: DataFrame = Field(default=..., description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.dry_run","title":"dry_run class-attribute
instance-attribute
","text":"dry_run: Optional[bool] = Field(default=False, description='Only show schema differences, do not apply changes')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output","title":"Output","text":"Output class for SyncTableAndDataFrameSchema
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_df_schema","title":"new_df_schema class-attribute
instance-attribute
","text":"new_df_schema: StructType = Field(default=..., description='New DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_sf_schema","title":"new_sf_schema class-attribute
instance-attribute
","text":"new_sf_schema: StructType = Field(default=..., description='New Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_df_schema","title":"original_df_schema class-attribute
instance-attribute
","text":"original_df_schema: StructType = Field(default=..., description='Original DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_sf_schema","title":"original_sf_schema class-attribute
instance-attribute
","text":"original_sf_schema: StructType = Field(default=..., description='Original Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.sf_table_altered","title":"sf_table_altered class-attribute
instance-attribute
","text":"sf_table_altered: bool = Field(default=False, description='Flag to indicate whether Snowflake schema has been altered')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n self.log.warning(\"Snowflake table will always take a priority in case of data type conflicts!\")\n\n # spark side\n df_schema = self.df.schema\n self.output.original_df_schema = deepcopy(df_schema) # using deepcopy to avoid storing in place changes\n df_cols = [c.name.lower() for c in df_schema]\n\n # snowflake side\n sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n self.output.original_sf_schema = sf_schema\n sf_cols = [c.name.lower() for c in sf_schema]\n\n if self.dry_run:\n # Display differences between Spark DataFrame and Snowflake schemas\n # and provide dummy values that are expected as class outputs.\n self.log.warning(f\"Columns to be added to Snowflake table: {set(df_cols) - set(sf_cols)}\")\n self.log.warning(f\"Columns to be added to Spark DataFrame: {set(sf_cols) - set(df_cols)}\")\n\n self.output.new_df_schema = t.StructType()\n self.output.new_sf_schema = t.StructType()\n self.output.df = self.df\n self.output.sf_table_altered = False\n\n else:\n # Add columns to SnowFlake table that exist in DataFrame\n for df_column in df_schema:\n if df_column.name.lower() not in sf_cols:\n AddColumn(\n **self.get_options(),\n table=self.table,\n column=df_column.name,\n type=df_column.dataType,\n ).execute()\n self.output.sf_table_altered = True\n\n if self.output.sf_table_altered:\n sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n sf_cols = [c.name.lower() for c in sf_schema]\n\n self.output.new_sf_schema = sf_schema\n\n # Add NULL columns to the DataFrame if they exist in SnowFlake but not in the df\n df = self.df\n for sf_col in self.output.original_sf_schema:\n sf_col_name = sf_col.name.lower()\n if sf_col_name not in df_cols:\n sf_col_type = sf_col.dataType\n df = df.withColumn(sf_col_name, f.lit(None).cast(sf_col_type))\n\n # Put DataFrame columns in the same order as the Snowflake table\n df = df.select(*sf_cols)\n\n self.output.df = df\n self.output.new_df_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","title":"koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","text":"Synchronize a Delta table to a Snowflake table
- Overwrite - only in batch mode
- Append - supports batch and streaming mode
- Merge - only in streaming mode
Example SynchronizeDeltaToSnowflakeTask(\n url=\"acme.snowflakecomputing.com\",\n user=\"admin\",\n role=\"ADMIN\",\n warehouse=\"SF_WAREHOUSE\",\n database=\"SF_DATABASE\",\n schema=\"SF_SCHEMA\",\n source_table=DeltaTableStep(...),\n target_table=\"my_sf_table\",\n key_columns=[\n \"id\",\n ],\n streaming=False,\n).run()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.checkpoint_location","title":"checkpoint_location class-attribute
instance-attribute
","text":"checkpoint_location: Optional[str] = Field(default=None, description='Checkpoint location to use')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.enable_deletion","title":"enable_deletion class-attribute
instance-attribute
","text":"enable_deletion: Optional[bool] = Field(default=False, description='In case of merge synchronisation_mode add deletion statement in merge query.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.key_columns","title":"key_columns class-attribute
instance-attribute
","text":"key_columns: Optional[List[str]] = Field(default_factory=list, description='Key columns on which merge statements will be MERGE statement will be applied.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.non_key_columns","title":"non_key_columns property
","text":"non_key_columns: List[str]\n
Columns of source table that aren't part of the (composite) primary key
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.persist_staging","title":"persist_staging class-attribute
instance-attribute
","text":"persist_staging: Optional[bool] = Field(default=False, description='In case of debugging, set `persist_staging` to True to retain the staging table for inspection after synchronization.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.reader","title":"reader property
","text":"reader\n
DeltaTable reader
Returns: DeltaTableReader the will yield source delta table\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.schema_tracking_location","title":"schema_tracking_location class-attribute
instance-attribute
","text":"schema_tracking_location: Optional[str] = Field(default=None, description='Schema tracking location to use. Info: https://docs.delta.io/latest/delta-streaming.html#-schema-tracking')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.source_table","title":"source_table class-attribute
instance-attribute
","text":"source_table: DeltaTableStep = Field(default=..., description='Source delta table to synchronize')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table","title":"staging_table property
","text":"staging_table\n
Intermediate table on snowflake where staging results are stored
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table_name","title":"staging_table_name class-attribute
instance-attribute
","text":"staging_table_name: Optional[str] = Field(default=None, alias='staging_table', description='Optional snowflake staging name', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: Optional[bool] = Field(default=False, description=\"Should synchronisation happen in streaming or in batch mode. Streaming is supported in 'APPEND' and 'MERGE' mode. Batch is supported in 'OVERWRITE' and 'APPEND' mode.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.synchronisation_mode","title":"synchronisation_mode class-attribute
instance-attribute
","text":"synchronisation_mode: BatchOutputMode = Field(default=MERGE, description=\"Determines if synchronisation will 'overwrite' any existing table, 'append' new rows or 'merge' with existing rows.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.target_table","title":"target_table class-attribute
instance-attribute
","text":"target_table: str = Field(default=..., description='Target table in snowflake to synchronize to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer","title":"writer property
","text":"writer: Union[ForEachBatchStreamWriter, SnowflakeWriter]\n
Writer to persist to snowflake
Depending on configured options, this returns an SnowflakeWriter or ForEachBatchStreamWriter: - OVERWRITE/APPEND mode yields SnowflakeWriter - MERGE mode yields ForEachBatchStreamWriter
Returns:
Type Description Union[ForEachBatchStreamWriter, SnowflakeWriter]
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer_","title":"writer_ class-attribute
instance-attribute
","text":"writer_: Optional[Union[ForEachBatchStreamWriter, SnowflakeWriter]] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.drop_table","title":"drop_table","text":"drop_table(snowflake_table)\n
Drop a given snowflake table
Source code in src/koheesio/spark/snowflake.py
def drop_table(self, snowflake_table):\n \"\"\"Drop a given snowflake table\"\"\"\n self.log.warning(f\"Dropping table {snowflake_table} from snowflake\")\n drop_table_query = f\"\"\"DROP TABLE IF EXISTS {snowflake_table}\"\"\"\n query_executor = RunQuery(**self.get_options(), query=drop_table_query)\n query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.execute","title":"execute","text":"execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n # extract\n df = self.extract()\n self.output.source_df = df\n\n # synchronize\n self.output.target_df = df\n self.load(df)\n if not self.persist_staging:\n # If it's a streaming job, await for termination before dropping staging table\n if self.streaming:\n self.writer.await_termination()\n self.drop_table(self.staging_table)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.extract","title":"extract","text":"extract() -> DataFrame\n
Extract source table
Source code in src/koheesio/spark/snowflake.py
def extract(self) -> DataFrame:\n \"\"\"\n Extract source table\n \"\"\"\n if self.synchronisation_mode == BatchOutputMode.MERGE:\n if not self.source_table.is_cdf_active:\n raise RuntimeError(\n f\"Source table {self.source_table.table_name} does not have CDF enabled. \"\n f\"Set TBLPROPERTIES ('delta.enableChangeDataFeed' = true) to enable. \"\n f\"Current properties = {self.source_table_properties}\"\n )\n\n df = self.reader.read()\n self.output.source_df = df\n return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.load","title":"load","text":"load(df) -> DataFrame\n
Load source table into snowflake
Source code in src/koheesio/spark/snowflake.py
def load(self, df) -> DataFrame:\n \"\"\"Load source table into snowflake\"\"\"\n if self.synchronisation_mode == BatchOutputMode.MERGE:\n self.log.info(f\"Truncating staging table {self.staging_table}\")\n self.truncate_table(self.staging_table)\n self.writer.write(df)\n self.output.target_df = df\n return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.run","title":"run","text":"run()\n
alias of execute
Source code in src/koheesio/spark/snowflake.py
def run(self):\n \"\"\"alias of execute\"\"\"\n return self.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.truncate_table","title":"truncate_table","text":"truncate_table(snowflake_table)\n
Truncate a given snowflake table
Source code in src/koheesio/spark/snowflake.py
def truncate_table(self, snowflake_table):\n \"\"\"Truncate a given snowflake table\"\"\"\n truncate_query = f\"\"\"TRUNCATE TABLE IF EXISTS {snowflake_table}\"\"\"\n query_executor = RunQuery(\n **self.get_options(),\n query=truncate_query,\n )\n query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists","title":"koheesio.spark.snowflake.TableExists","text":"Check if the table exists in Snowflake by using INFORMATION_SCHEMA.
Example k = TableExists(\n url=\"foo.snowflakecomputing.com\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n database=\"db\",\n schema=\"schema\",\n table=\"table\",\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output","title":"Output","text":"Output class for TableExists
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output.exists","title":"exists class-attribute
instance-attribute
","text":"exists: bool = Field(default=..., description='Whether or not the table exists')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n query = (\n dedent(\n # Force upper case, due to case-sensitivity of where clause\n f\"\"\"\n SELECT *\n FROM INFORMATION_SCHEMA.TABLES\n WHERE TABLE_CATALOG = '{self.database}'\n AND TABLE_SCHEMA = '{self.sfSchema}'\n AND TABLE_TYPE = 'BASE TABLE'\n AND upper(TABLE_NAME) = '{self.table.upper()}'\n \"\"\" # nosec B608: hardcoded_sql_expressions\n )\n .upper()\n .strip()\n )\n\n self.log.debug(f\"Query that was executed to check if the table exists:\\n{query}\")\n\n df = Query(**self.get_options(), query=query).read()\n\n exists = df.count() > 0\n self.log.info(f\"Table {self.table} {'exists' if exists else 'does not exist'}\")\n self.output.exists = exists\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery","title":"koheesio.spark.snowflake.TagSnowflakeQuery","text":"Provides Snowflake query tag pre-action that can be used to easily find queries through SF history search and further group them for debugging and cost tracking purposes.
Takes in query tag attributes as kwargs and additional Snowflake options dict that can optionally contain other set of pre-actions to be applied to a query, in that case existing pre-action aren't dropped, query tag pre-action will be added to them.
Passed Snowflake options dictionary is not modified in-place, instead anew dictionary containing updated pre-actions is returned.
Notes See this article for explanation: https://select.dev/posts/snowflake-query-tags
Arbitrary tags can be applied, such as team, dataset names, business capability, etc.
Example query_tag = AddQueryTag(\n options={\"preactions\": ...},\n task_name=\"cleanse_task\",\n pipeline_name=\"ingestion-pipeline\",\n etl_date=\"2022-01-01\",\n pipeline_execution_time=\"2022-01-01T00:00:00\",\n task_execution_time=\"2022-01-01T01:00:00\",\n environment=\"dev\",\n trace_id=\"e0fdec43-a045-46e5-9705-acd4f3f96045\",\n span_id=\"cb89abea-1c12-471f-8b12-546d2d66f6cb\",\n ),\n).execute().options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.options","title":"options class-attribute
instance-attribute
","text":"options: Dict = Field(default_factory=dict, description='Additional Snowflake options, optionally containing additional preactions')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output","title":"Output","text":"Output class for AddQueryTag
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output.options","title":"options class-attribute
instance-attribute
","text":"options: Dict = Field(default=..., description='Copy of provided SF options, with added query tag preaction')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.execute","title":"execute","text":"execute()\n
Add query tag preaction to Snowflake options
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n \"\"\"Add query tag preaction to Snowflake options\"\"\"\n tag_json = json.dumps(self.extra_params, indent=4, sort_keys=True)\n tag_preaction = f\"ALTER SESSION SET QUERY_TAG = '{tag_json}';\"\n preactions = self.options.get(\"preactions\", \"\")\n preactions = f\"{preactions}\\n{tag_preaction}\".strip()\n updated_options = dict(self.options)\n updated_options[\"preactions\"] = preactions\n self.output.options = updated_options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.map_spark_type","title":"koheesio.spark.snowflake.map_spark_type","text":"map_spark_type(spark_type: DataType)\n
Translates Spark DataFrame Schema type to SnowFlake type
Basic Types Snowflake Type StringType STRING NullType STRING BooleanType BOOLEAN Numeric Types Snowflake Type LongType BIGINT IntegerType INT ShortType SMALLINT DoubleType DOUBLE FloatType FLOAT NumericType FLOAT ByteType BINARY Date / Time Types Snowflake Type DateType DATE TimestampType TIMESTAMP Advanced Types Snowflake Type DecimalType DECIMAL MapType VARIANT ArrayType VARIANT StructType VARIANT References - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
- Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html
Parameters:
Name Type Description Default spark_type
DataType
DataType taken out of the StructField
required Returns:
Type Description str
The Snowflake data type
Source code in src/koheesio/spark/snowflake.py
def map_spark_type(spark_type: t.DataType):\n \"\"\"\n Translates Spark DataFrame Schema type to SnowFlake type\n\n | Basic Types | Snowflake Type |\n |-------------------|----------------|\n | StringType | STRING |\n | NullType | STRING |\n | BooleanType | BOOLEAN |\n\n | Numeric Types | Snowflake Type |\n |-------------------|----------------|\n | LongType | BIGINT |\n | IntegerType | INT |\n | ShortType | SMALLINT |\n | DoubleType | DOUBLE |\n | FloatType | FLOAT |\n | NumericType | FLOAT |\n | ByteType | BINARY |\n\n | Date / Time Types | Snowflake Type |\n |-------------------|----------------|\n | DateType | DATE |\n | TimestampType | TIMESTAMP |\n\n | Advanced Types | Snowflake Type |\n |-------------------|----------------|\n | DecimalType | DECIMAL |\n | MapType | VARIANT |\n | ArrayType | VARIANT |\n | StructType | VARIANT |\n\n References\n ----------\n - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html\n - Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html\n\n Parameters\n ----------\n spark_type : pyspark.sql.types.DataType\n DataType taken out of the StructField\n\n Returns\n -------\n str\n The Snowflake data type\n \"\"\"\n # StructField means that the entire Field was passed, we need to extract just the dataType before continuing\n if isinstance(spark_type, t.StructField):\n spark_type = spark_type.dataType\n\n # Check if the type is DayTimeIntervalType\n if isinstance(spark_type, t.DayTimeIntervalType):\n warn(\n \"DayTimeIntervalType is being converted to STRING. \"\n \"Consider converting to a more supported date/time/timestamp type in Snowflake.\"\n )\n\n # fmt: off\n # noinspection PyUnresolvedReferences\n data_type_map = {\n # Basic Types\n t.StringType: \"STRING\",\n t.NullType: \"STRING\",\n t.BooleanType: \"BOOLEAN\",\n\n # Numeric Types\n t.LongType: \"BIGINT\",\n t.IntegerType: \"INT\",\n t.ShortType: \"SMALLINT\",\n t.DoubleType: \"DOUBLE\",\n t.FloatType: \"FLOAT\",\n t.NumericType: \"FLOAT\",\n t.ByteType: \"BINARY\",\n t.BinaryType: \"VARBINARY\",\n\n # Date / Time Types\n t.DateType: \"DATE\",\n t.TimestampType: \"TIMESTAMP\",\n t.DayTimeIntervalType: \"STRING\",\n\n # Advanced Types\n t.DecimalType:\n f\"DECIMAL({spark_type.precision},{spark_type.scale})\" # pylint: disable=no-member\n if isinstance(spark_type, t.DecimalType) else \"DECIMAL(38,0)\",\n t.MapType: \"VARIANT\",\n t.ArrayType: \"VARIANT\",\n t.StructType: \"VARIANT\",\n }\n return data_type_map.get(type(spark_type), 'STRING')\n
"},{"location":"api_reference/spark/utils.html","title":"Utils","text":"Spark Utility functions
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_minor_version","title":"koheesio.spark.utils.spark_minor_version module-attribute
","text":"spark_minor_version: float = get_spark_minor_version()\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype","title":"koheesio.spark.utils.SparkDatatype","text":"Allowed spark datatypes
The following table lists the data types that are supported by Spark SQL.
Data type SQL name ByteType BYTE, TINYINT ShortType SHORT, SMALLINT IntegerType INT, INTEGER LongType LONG, BIGINT FloatType FLOAT, REAL DoubleType DOUBLE DecimalType DECIMAL, DEC, NUMERIC StringType STRING BinaryType BINARY BooleanType BOOLEAN TimestampType TIMESTAMP, TIMESTAMP_LTZ DateType DATE ArrayType ARRAY MapType MAP NullType VOID Not supported yet - TimestampNTZType TIMESTAMP_NTZ
- YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
- DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
See Also https://spark.apache.org/docs/latest/sql-ref-datatypes.html#supported-data-types
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.ARRAY","title":"ARRAY class-attribute
instance-attribute
","text":"ARRAY = 'array'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BIGINT","title":"BIGINT class-attribute
instance-attribute
","text":"BIGINT = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BINARY","title":"BINARY class-attribute
instance-attribute
","text":"BINARY = 'binary'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BOOLEAN","title":"BOOLEAN class-attribute
instance-attribute
","text":"BOOLEAN = 'boolean'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BYTE","title":"BYTE class-attribute
instance-attribute
","text":"BYTE = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DATE","title":"DATE class-attribute
instance-attribute
","text":"DATE = 'date'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DEC","title":"DEC class-attribute
instance-attribute
","text":"DEC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DECIMAL","title":"DECIMAL class-attribute
instance-attribute
","text":"DECIMAL = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DOUBLE","title":"DOUBLE class-attribute
instance-attribute
","text":"DOUBLE = 'double'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.FLOAT","title":"FLOAT class-attribute
instance-attribute
","text":"FLOAT = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INT","title":"INT class-attribute
instance-attribute
","text":"INT = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INTEGER","title":"INTEGER class-attribute
instance-attribute
","text":"INTEGER = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.LONG","title":"LONG class-attribute
instance-attribute
","text":"LONG = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.MAP","title":"MAP class-attribute
instance-attribute
","text":"MAP = 'map'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.NUMERIC","title":"NUMERIC class-attribute
instance-attribute
","text":"NUMERIC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.REAL","title":"REAL class-attribute
instance-attribute
","text":"REAL = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SHORT","title":"SHORT class-attribute
instance-attribute
","text":"SHORT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SMALLINT","title":"SMALLINT class-attribute
instance-attribute
","text":"SMALLINT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.STRING","title":"STRING class-attribute
instance-attribute
","text":"STRING = 'string'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP","title":"TIMESTAMP class-attribute
instance-attribute
","text":"TIMESTAMP = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP_LTZ","title":"TIMESTAMP_LTZ class-attribute
instance-attribute
","text":"TIMESTAMP_LTZ = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TINYINT","title":"TINYINT class-attribute
instance-attribute
","text":"TINYINT = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.VOID","title":"VOID class-attribute
instance-attribute
","text":"VOID = 'void'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.spark_type","title":"spark_type property
","text":"spark_type: DataType\n
Returns the spark type for the given enum value
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.from_string","title":"from_string classmethod
","text":"from_string(value: str) -> SparkDatatype\n
Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive
Source code in src/koheesio/spark/utils.py
@classmethod\ndef from_string(cls, value: str) -> \"SparkDatatype\":\n \"\"\"Allows for getting the right Enum value by simply passing a string value\n This method is not case-sensitive\n \"\"\"\n return getattr(cls, value.upper())\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.get_spark_minor_version","title":"koheesio.spark.utils.get_spark_minor_version","text":"get_spark_minor_version() -> float\n
Returns the minor version of the spark instance.
For example, if the spark version is 3.3.2, this function would return 3.3
Source code in src/koheesio/spark/utils.py
def get_spark_minor_version() -> float:\n \"\"\"Returns the minor version of the spark instance.\n\n For example, if the spark version is 3.3.2, this function would return 3.3\n \"\"\"\n return float(\".\".join(spark_version.split(\".\")[:2]))\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.on_databricks","title":"koheesio.spark.utils.on_databricks","text":"on_databricks() -> bool\n
Retrieve if we're running on databricks or elsewhere
Source code in src/koheesio/spark/utils.py
def on_databricks() -> bool:\n \"\"\"Retrieve if we're running on databricks or elsewhere\"\"\"\n dbr_version = os.getenv(\"DATABRICKS_RUNTIME_VERSION\", None)\n return dbr_version is not None and dbr_version != \"\"\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.schema_struct_to_schema_str","title":"koheesio.spark.utils.schema_struct_to_schema_str","text":"schema_struct_to_schema_str(schema: StructType) -> str\n
Converts a StructType to a schema str
Source code in src/koheesio/spark/utils.py
def schema_struct_to_schema_str(schema: StructType) -> str:\n \"\"\"Converts a StructType to a schema str\"\"\"\n if not schema:\n return \"\"\n return \",\\n\".join([f\"{field.name} {field.dataType.typeName().upper()}\" for field in schema.fields])\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_array","title":"koheesio.spark.utils.spark_data_type_is_array","text":"spark_data_type_is_array(data_type: DataType) -> bool\n
Check if the column's dataType is of type ArrayType
Source code in src/koheesio/spark/utils.py
def spark_data_type_is_array(data_type: DataType) -> bool:\n \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n return isinstance(data_type, ArrayType)\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_numeric","title":"koheesio.spark.utils.spark_data_type_is_numeric","text":"spark_data_type_is_numeric(data_type: DataType) -> bool\n
Check if the column's dataType is of type ArrayType
Source code in src/koheesio/spark/utils.py
def spark_data_type_is_numeric(data_type: DataType) -> bool:\n \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n return isinstance(data_type, (IntegerType, LongType, FloatType, DoubleType, DecimalType))\n
"},{"location":"api_reference/spark/readers/index.html","title":"Readers","text":"Readers are a type of Step that read data from a source based on the input parameters and stores the result in self.output.df.
For a comprehensive guide on the usage, examples, and additional features of Reader classes, please refer to the reference/concepts/steps/readers section of the Koheesio documentation.
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader","title":"koheesio.spark.readers.Reader","text":"Base class for all Readers
A Reader is a Step that reads data from a source based on the input parameters and stores the result in self.output.df (DataFrame).
When implementing a Reader, the execute() method should be implemented. The execute() method should read from the source and store the result in self.output.df.
The Reader class implements a standard read() method that calls the execute() method and returns the result. This method can be used to read data from a Reader without having to call the execute() method directly. Read method does not need to be implemented in the child class.
Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession.
The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the output.df is None, .execute() will be run first.
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.df","title":"df property
","text":"df: Optional[DataFrame]\n
Shorthand for accessing self.output.df If the output.df is None, .execute() will be run first
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.execute","title":"execute abstractmethod
","text":"execute()\n
Execute on a Reader should handle self.output.df (output) as a minimum Read from whichever source -> store result in self.output.df
Source code in src/koheesio/spark/readers/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Execute on a Reader should handle self.output.df (output) as a minimum\n Read from whichever source -> store result in self.output.df\n \"\"\"\n # self.output.df # output dataframe\n ...\n
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.read","title":"read","text":"read() -> Optional[DataFrame]\n
Read from a Reader without having to call the execute() method directly
Source code in src/koheesio/spark/readers/__init__.py
def read(self) -> Optional[DataFrame]:\n \"\"\"Read from a Reader without having to call the execute() method directly\"\"\"\n self.execute()\n return self.output.df\n
"},{"location":"api_reference/spark/readers/delta.html","title":"Delta","text":"Read data from a Delta table and return a DataFrame or DataStream
Classes:
Name Description DeltaTableReader
Reads data from a Delta table and returns a DataFrame
DeltaTableStreamReader
Reads data from a Delta table and returns a DataStream
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS","title":"koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS module-attribute
","text":"STREAMING_ONLY_OPTIONS = ['ignore_deletes', 'ignore_changes', 'starting_version', 'starting_timestamp', 'schema_tracking_location']\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING","title":"koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING module-attribute
","text":"STREAMING_SCHEMA_WARNING = '\\nImportant!\\nAlthough you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the data with an incorrect schema.'\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader","title":"koheesio.spark.readers.delta.DeltaTableReader","text":"Reads data from a Delta table and returns a DataFrame Delta Table can be read in batch or streaming mode It also supports reading change data feed (CDF) in both batch mode and streaming mode
Parameters:
Name Type Description Default table
Union[DeltaTableStep, str]
The table to read
required filter_cond
Optional[Union[Column, str]]
Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions. For example: f.col('state') == 'Ohio'
, state = 'Ohio'
or (col('col1') > 3) & (col('col2') < 9)
required columns
Columns to select from the table. One or many columns can be provided as strings. For example: ['col1', 'col2']
, ['col1']
or 'col1'
required streaming
Optional[bool]
Whether to read the table as a Stream or not
required read_change_feed
bool
readChangeFeed: Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html
required starting_version
str
startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.
required starting_timestamp
str
startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)
required ignore_deletes
bool
ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
required ignore_changes
bool
ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.
required"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[ListOfColumns] = Field(default=None, description=\"Columns to select from the table. One or many columns can be provided as strings. For example: `['col1', 'col2']`, `['col1']` or `'col1'` \")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.filter_cond","title":"filter_cond class-attribute
instance-attribute
","text":"filter_cond: Optional[Union[Column, str]] = Field(default=None, alias='filterCondition', description=\"Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions For example: `f.col('state') == 'Ohio'`, `state = 'Ohio'` or `(col('col1') > 3) & (col('col2') < 9)`\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_changes","title":"ignore_changes class-attribute
instance-attribute
","text":"ignore_changes: bool = Field(default=False, alias='ignoreChanges', description='ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_deletes","title":"ignore_deletes class-attribute
instance-attribute
","text":"ignore_deletes: bool = Field(default=False, alias='ignoreDeletes', description='ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.read_change_feed","title":"read_change_feed class-attribute
instance-attribute
","text":"read_change_feed: bool = Field(default=False, alias='readChangeFeed', description=\"Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.reader","title":"reader property
","text":"reader: Union[DataStreamReader, DataFrameReader]\n
Return the reader for the DeltaTableReader based on the streaming
attribute
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.schema_tracking_location","title":"schema_tracking_location class-attribute
instance-attribute
","text":"schema_tracking_location: Optional[str] = Field(default=None, alias='schemaTrackingLocation', description='schemaTrackingLocation: Track the location of source schema. Note: Recommend to enable Delta reader version: 3 and writer version: 7 for this option. For more info see https://docs.delta.io/latest/delta-column-mapping.html' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.skip_change_commits","title":"skip_change_commits class-attribute
instance-attribute
","text":"skip_change_commits: bool = Field(default=False, alias='skipChangeCommits', description='skipChangeCommits: Skip processing of change commits. Note: Only supported for streaming tables. (not supported in Open Source Delta Implementation). Prefer using skipChangeCommits over ignoreDeletes and ignoreChanges starting DBR12.1 and above. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#skip-change-commits')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_timestamp","title":"starting_timestamp class-attribute
instance-attribute
","text":"starting_timestamp: Optional[str] = Field(default=None, alias='startingTimestamp', description='startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_version","title":"starting_version class-attribute
instance-attribute
","text":"starting_version: Optional[str] = Field(default=None, alias='startingVersion', description='startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: Optional[bool] = Field(default=False, description='Whether to read the table as a Stream or not')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.table","title":"table class-attribute
instance-attribute
","text":"table: Union[DeltaTableStep, str] = Field(default=..., description='The table to read')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.temp_view_name","title":"temp_view_name property
","text":"temp_view_name\n
Get the temporary view name for the dataframe for SQL queries
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.view","title":"view property
","text":"view\n
Create a temporary view of the dataframe for SQL queries
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/delta.py
def execute(self):\n df = self.reader.table(self.table.table_name)\n if self.filter_cond is not None:\n df = df.filter(f.expr(self.filter_cond) if isinstance(self.filter_cond, str) else self.filter_cond)\n if self.columns is not None:\n df = df.select(*self.columns)\n self.output.df = df\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.get_options","title":"get_options","text":"get_options() -> Dict[str, Any]\n
Get the options for the DeltaTableReader based on the streaming
attribute
Source code in src/koheesio/spark/readers/delta.py
def get_options(self) -> Dict[str, Any]:\n \"\"\"Get the options for the DeltaTableReader based on the `streaming` attribute\"\"\"\n options = {\n # Enable Change Data Feed (CDF) feature\n \"readChangeFeed\": self.read_change_feed,\n # Initial position, one of:\n \"startingVersion\": self.starting_version,\n \"startingTimestamp\": self.starting_timestamp,\n }\n\n # Streaming only options\n if self.streaming:\n options = {\n **options,\n # Ignore updates and deletes, one of:\n \"ignoreDeletes\": self.ignore_deletes,\n \"ignoreChanges\": self.ignore_changes,\n \"skipChangeCommits\": self.skip_change_commits,\n \"schemaTrackingLocation\": self.schema_tracking_location,\n }\n # Batch only options\n else:\n pass # there are none... for now :)\n\n def normalize(v: Union[str, bool]):\n \"\"\"normalize values\"\"\"\n # True becomes \"true\", False becomes \"false\"\n v = str(v).lower() if isinstance(v, bool) else v\n return v\n\n # Any options with `value == None` are filtered out\n return {k: normalize(v) for k, v in options.items() if v is not None}\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.set_temp_view_name","title":"set_temp_view_name","text":"set_temp_view_name()\n
Set a temporary view name for the dataframe for SQL queries
Source code in src/koheesio/spark/readers/delta.py
@model_validator(mode=\"after\")\ndef set_temp_view_name(self):\n \"\"\"Set a temporary view name for the dataframe for SQL queries\"\"\"\n table_name = self.table.table\n vw_name = get_random_string(prefix=f\"tmp_{table_name}\")\n self.__temp_view_name__ = vw_name\n return self\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader","title":"koheesio.spark.readers.delta.DeltaTableStreamReader","text":"Reads data from a Delta table and returns a DataStream
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: bool = True\n
"},{"location":"api_reference/spark/readers/dummy.html","title":"Dummy","text":"A simple DummyReader that returns a DataFrame with an id-column of the given range
"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader","title":"koheesio.spark.readers.dummy.DummyReader","text":"A simple DummyReader that returns a DataFrame with an id-column of the given range
Can be used in place of any Reader without having to read from a real source.
Wraps SparkSession.range(). Output DataFrame will have a single column named \"id\" of type Long and length of the given range.
Parameters:
Name Type Description Default range
int
How large to make the Dataframe
required Example from koheesio.spark.readers.dummy import DummyReader\n\noutput_df = DummyReader(range=100).read()\n
output_df: Output DataFrame will have a single column named \"id\" of type Long
containing 100 rows (0-99).
id 0 1 ... 99"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.range","title":"range class-attribute
instance-attribute
","text":"range: int = Field(default=100, description='How large to make the Dataframe')\n
"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/dummy.py
def execute(self):\n self.output.df = self.spark.range(self.range)\n
"},{"location":"api_reference/spark/readers/file_loader.html","title":"File loader","text":"Generic file Readers for different file formats.
Supported file formats: - CSV - Parquet - Avro - JSON - ORC - Text
Examples:
from koheesio.spark.readers import (\n CsvReader,\n ParquetReader,\n AvroReader,\n JsonReader,\n OrcReader,\n)\n\ncsv_reader = CsvReader(path=\"path/to/file.csv\", header=True)\nparquet_reader = ParquetReader(path=\"path/to/file.parquet\")\navro_reader = AvroReader(path=\"path/to/file.avro\")\njson_reader = JsonReader(path=\"path/to/file.json\")\norc_reader = OrcReader(path=\"path/to/file.orc\")\n
For more information about the available options, see Spark's official documentation.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader","title":"koheesio.spark.readers.file_loader.AvroReader","text":"Reads an Avro file.
This class is a convenience class that sets the format
field to FileFormat.avro
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = AvroReader(path=\"path/to/file.avro\", mergeSchema=True)\n
Make sure to have the spark-avro
package installed in your environment.
For more information about the available options, see the official documentation.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = avro\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader","title":"koheesio.spark.readers.file_loader.CsvReader","text":"Reads a CSV file.
This class is a convenience class that sets the format
field to FileFormat.csv
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = CsvReader(path=\"path/to/file.csv\", header=True)\n
For more information about the available options, see the official pyspark documentation and read about CSV data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = csv\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat","title":"koheesio.spark.readers.file_loader.FileFormat","text":"Supported file formats.
This enum represents the supported file formats that can be used with the FileLoader class. The available file formats are: - csv: Comma-separated values format - parquet: Apache Parquet format - avro: Apache Avro format - json: JavaScript Object Notation format - orc: Apache ORC format - text: Plain text format
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.avro","title":"avro class-attribute
instance-attribute
","text":"avro = 'avro'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.csv","title":"csv class-attribute
instance-attribute
","text":"csv = 'csv'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.json","title":"json class-attribute
instance-attribute
","text":"json = 'json'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.orc","title":"orc class-attribute
instance-attribute
","text":"orc = 'orc'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.parquet","title":"parquet class-attribute
instance-attribute
","text":"parquet = 'parquet'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.text","title":"text class-attribute
instance-attribute
","text":"text = 'text'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader","title":"koheesio.spark.readers.file_loader.FileLoader","text":"Generic file reader.
Available file formats:\n- CSV\n- Parquet\n- Avro\n- JSON\n- ORC\n- Text (default)\n\nExtra parameters can be passed to the reader using the `extra_params` attribute or as keyword arguments.\n\nExample:\n```python\nreader = FileLoader(path=\"path/to/textfile.txt\", format=\"text\", header=True, lineSep=\"\n
\") ```
For more information about the available options, see Spark's\n[official pyspark documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.text.html)\nand [read about text data source](https://spark.apache.org/docs/latest/sql-data-sources-text.html).\n\nAlso see the [data sources generic options](https://spark.apache.org/docs/3.5.0/sql-data-sources-generic-options.html).\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = Field(default=text, description='File format to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.path","title":"path class-attribute
instance-attribute
","text":"path: Union[Path, str] = Field(default=..., description='Path to the file to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.schema_","title":"schema_ class-attribute
instance-attribute
","text":"schema_: Optional[Union[StructType, str]] = Field(default=None, description='Schema to use when reading the file', validate_default=False, alias='schema')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.ensure_path_is_str","title":"ensure_path_is_str","text":"ensure_path_is_str(v)\n
Ensure that the path is a string as required by Spark.
Source code in src/koheesio/spark/readers/file_loader.py
@field_validator(\"path\")\ndef ensure_path_is_str(cls, v):\n \"\"\"Ensure that the path is a string as required by Spark.\"\"\"\n if isinstance(v, Path):\n return str(v.absolute().as_posix())\n return v\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.execute","title":"execute","text":"execute()\n
Reads the file using the specified format, schema, while applying any extra parameters.
Source code in src/koheesio/spark/readers/file_loader.py
def execute(self):\n \"\"\"Reads the file using the specified format, schema, while applying any extra parameters.\"\"\"\n reader = self.spark.read.format(self.format)\n\n if self.schema_:\n reader.schema(self.schema_)\n\n if self.extra_params:\n reader = reader.options(**self.extra_params)\n\n self.output.df = reader.load(self.path)\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader","title":"koheesio.spark.readers.file_loader.JsonReader","text":"Reads a JSON file.
This class is a convenience class that sets the format
field to FileFormat.json
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = JsonReader(path=\"path/to/file.json\", allowComments=True)\n
For more information about the available options, see the official pyspark documentation and read about JSON data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = json\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader","title":"koheesio.spark.readers.file_loader.OrcReader","text":"Reads an ORC file.
This class is a convenience class that sets the format
field to FileFormat.orc
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = OrcReader(path=\"path/to/file.orc\", mergeSchema=True)\n
For more information about the available options, see the official documentation and read about ORC data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = orc\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader","title":"koheesio.spark.readers.file_loader.ParquetReader","text":"Reads a Parquet file.
This class is a convenience class that sets the format
field to FileFormat.parquet
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = ParquetReader(path=\"path/to/file.parquet\", mergeSchema=True)\n
For more information about the available options, see the official pyspark documentation and read about Parquet data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = parquet\n
"},{"location":"api_reference/spark/readers/hana.html","title":"Hana","text":"HANA reader.
"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader","title":"koheesio.spark.readers.hana.HanaReader","text":"Wrapper around JdbcReader for SAP HANA
Notes - Refer to JdbcReader for the list of all available parameters.
- Refer to SAP HANA Client Interface Programming Reference docs for the list of all available connection string parameters: https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/109397c2206a4ab2a5386d494f4cf75e.html
Example Note: jars should be added to the Spark session manually. This class does not take care of that.
This example depends on the SAP HANA ngdbc
JAR. e.g. ngdbc-2.5.49.
from koheesio.spark.readers.hana import HanaReader\njdbc_hana = HanaReader(\n url=\"jdbc:sap://<domain_or_ip>:<port>/?<options>\n user=\"YOUR_USERNAME\",\n password=\"***\",\n dbtable=\"schemaname.tablename\"\n)\ndf = jdbc_hana.read()\n
Parameters:
Name Type Description Default url
str
JDBC connection string. Refer to SAP HANA docs for the list of all available connection string parameters. Example: jdbc:sap://:[/?] required user
str
required password
SecretStr
required dbtable
str
Database table name, also include schema name
required options
Optional[Dict[str, Any]]
Extra options to pass to the SAP HANA JDBC driver. Refer to SAP HANA docs for the list of all available connection string parameters. Example: {\"fetchsize\": 2000, \"numPartitions\": 10}
required query
Optional[str]
Query
required format
str
The type of format to load. Defaults to 'jdbc'. Should not be changed.
required driver
str
Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.
required"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: str = Field(default='com.sap.db.jdbc.Driver', description='Make sure that the necessary JARs are available in the cluster: ngdbc-2-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field(default={'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the SAP HANA JDBC driver')\n
"},{"location":"api_reference/spark/readers/jdbc.html","title":"Jdbc","text":"Module for reading data from JDBC sources.
Classes:
Name Description JdbcReader
Reader for JDBC tables.
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader","title":"koheesio.spark.readers.jdbc.JdbcReader","text":"Reader for JDBC tables.
Wrapper around Spark's jdbc read format
Notes - Query has precedence over dbtable. If query and dbtable both are filled in, dbtable will be ignored!
- Extra options to the spark reader can be passed through the
options
input. Refer to Spark documentation for details: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html - Consider using
fetchsize
as one of the options, as it is greatly increases the performance of the reader - Consider using
numPartitions
, partitionColumn
, lowerBound
, upperBound
together with real or synthetic partitioning column as it will improve the reader performance
When implementing a JDBC reader, the get_options()
method should be implemented. The method should return a dict of options required for the specific JDBC driver. The get_options()
method can be overridden in the child class. Additionally, the driver
parameter should be set to the name of the JDBC driver. Be aware that the driver jar needs to be included in the Spark session; this class does not (and can not) take care of that!
Example Note: jars should be added to the Spark session manually. This class does not take care of that.
This example depends on the jar for MS SQL: https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre8/mssql-jdbc-9.2.1.jre8.jar
from koheesio.spark.readers.jdbc import JdbcReader\n\njdbc_mssql = JdbcReader(\n driver=\"com.microsoft.sqlserver.jdbc.SQLServerDriver\",\n url=\"jdbc:sqlserver://10.xxx.xxx.xxx:1433;databaseName=YOUR_DATABASE\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n dbtable=\"schemaname.tablename\",\n options={\"fetchsize\": 100},\n)\ndf = jdbc_mssql.read()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.dbtable","title":"dbtable class-attribute
instance-attribute
","text":"dbtable: Optional[str] = Field(default=None, description='Database table name, also include schema name')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: str = Field(default=..., description='Driver name. Be aware that the driver jar needs to be passed to the task')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='jdbc', description=\"The type of format to load. Defaults to 'jdbc'.\")\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field(default_factory=dict, description='Extra options to pass to spark reader')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.password","title":"password class-attribute
instance-attribute
","text":"password: SecretStr = Field(default=..., description='Password belonging to the username')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.query","title":"query class-attribute
instance-attribute
","text":"query: Optional[str] = Field(default=None, description='Query')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., description='URL for the JDBC driver. Note, in some environments you need to use the IP Address instead of the hostname of the server.')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.user","title":"user class-attribute
instance-attribute
","text":"user: str = Field(default=..., description='User to authenticate to the server')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.execute","title":"execute","text":"execute()\n
Wrapper around Spark's jdbc read format
Source code in src/koheesio/spark/readers/jdbc.py
def execute(self):\n \"\"\"Wrapper around Spark's jdbc read format\"\"\"\n\n # Can't have both dbtable and query empty\n if not self.dbtable and not self.query:\n raise ValueError(\"Please do not leave dbtable and query both empty!\")\n\n if self.query and self.dbtable:\n self.log.info(\"Both 'query' and 'dbtable' are filled in, 'dbtable' will be ignored!\")\n\n options = self.get_options()\n\n if pw := self.password:\n options[\"password\"] = pw.get_secret_value()\n\n if query := self.query:\n options[\"query\"] = query\n self.log.info(f\"Executing query: {self.query}\")\n else:\n options[\"dbtable\"] = self.dbtable\n\n self.output.df = self.spark.read.format(self.format).options(**options).load()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.get_options","title":"get_options","text":"get_options()\n
Dictionary of options required for the specific JDBC driver.
Note: override this method if driver requires custom names, e.g. Snowflake: sfUrl
, sfUser
, etc.
Source code in src/koheesio/spark/readers/jdbc.py
def get_options(self):\n \"\"\"\n Dictionary of options required for the specific JDBC driver.\n\n Note: override this method if driver requires custom names, e.g. Snowflake: `sfUrl`, `sfUser`, etc.\n \"\"\"\n return {\n \"driver\": self.driver,\n \"url\": self.url,\n \"user\": self.user,\n \"password\": self.password,\n **self.options,\n }\n
"},{"location":"api_reference/spark/readers/kafka.html","title":"Kafka","text":"Module for KafkaReader and KafkaStreamReader.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader","title":"koheesio.spark.readers.kafka.KafkaReader","text":"Reader for Kafka topics.
Wrapper around Spark's kafka read format. Supports both batch and streaming reads.
Parameters:
Name Type Description Default read_broker
str
Kafka brokers to read from. Should be passed as a single string with multiple brokers passed in a comma separated list
required topic
str
Kafka topic to consume.
required streaming
Optional[bool]
Whether to read the kafka topic as a stream or not.
required params
Optional[Dict[str, str]]
Arbitrary options to be applied when creating NSP Reader. If a user provides values for subscribe
or kafka.bootstrap.servers
, they will be ignored in favor of configuration passed through topic
and read_broker
respectively. Defaults to an empty dictionary.
required Notes - The
read_broker
and topic
parameters are required. - The
streaming
parameter defaults to False
. - The
params
parameter defaults to an empty dictionary. This parameter is also aliased as kafka_options
. - Any extra kafka options can also be passed as key-word arguments; these will be merged with the
params
parameter
Example from koheesio.spark.readers.kafka import KafkaReader\n\nkafka_reader = KafkaReader(\n read_broker=\"kafka-broker-1:9092,kafka-broker-2:9092\",\n topic=\"my-topic\",\n streaming=True,\n # extra kafka options can be passed as key-word arguments\n startingOffsets=\"earliest\",\n)\n
In the example above, the KafkaReader
will read from the my-topic
Kafka topic, using the brokers kafka-broker-1:9092
and kafka-broker-2:9092
. The reader will read the topic as a stream and will start reading from the earliest available offset.
The stream can be started by calling the read
or execute
method on the kafka_reader
object.
Note: The KafkaStreamReader
could be used in the example above to achieve the same result. streaming
would default to True
in that case and could be omitted from the parameters.
See Also - Official Spark Documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.batch_reader","title":"batch_reader property
","text":"batch_reader\n
Returns the Spark read object for batch processing.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.logged_option_keys","title":"logged_option_keys property
","text":"logged_option_keys\n
Keys that are allowed to be logged for the options.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.options","title":"options property
","text":"options\n
Merge fixed parameters with arbitrary options provided by user.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, str]] = Field(default_factory=dict, alias='kafka_options', description=\"Arbitrary options to be applied when creating NSP Reader. If a user provides values for 'subscribe' or 'kafka.bootstrap.servers', they will be ignored in favor of configuration passed through 'topic' and 'read_broker' respectively.\")\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.read_broker","title":"read_broker class-attribute
instance-attribute
","text":"read_broker: str = Field(..., description='Kafka brokers to read from, should be passed as a single string with multiple brokers passed in a comma separated list')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.reader","title":"reader property
","text":"reader\n
Returns the appropriate reader based on the streaming flag.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.stream_reader","title":"stream_reader property
","text":"stream_reader\n
Returns the Spark readStream object.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: Optional[bool] = Field(default=False, description='Whether to read the kafka topic as a stream or not. Defaults to False.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.topic","title":"topic class-attribute
instance-attribute
","text":"topic: str = Field(default=..., description='Kafka topic to consume.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/kafka.py
def execute(self):\n applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n self.log.debug(f\"Applying options {applied_options}\")\n\n self.output.df = self.reader.format(\"kafka\").options(**self.options).load()\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader","title":"koheesio.spark.readers.kafka.KafkaStreamReader","text":"KafkaStreamReader is a KafkaReader that reads data as a stream
This class is identical to KafkaReader, with the streaming
parameter defaulting to True
.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: bool = True\n
"},{"location":"api_reference/spark/readers/memory.html","title":"Memory","text":"Create Spark DataFrame directly from the data stored in a Python variable
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat","title":"koheesio.spark.readers.memory.DataFormat","text":"Data formats supported by the InMemoryDataReader
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.CSV","title":"CSV class-attribute
instance-attribute
","text":"CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.JSON","title":"JSON class-attribute
instance-attribute
","text":"JSON = 'json'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader","title":"koheesio.spark.readers.memory.InMemoryDataReader","text":"Directly read data from a Python variable and convert it to a Spark DataFrame.
Read the data, that is stored in one of the supported formats (see DataFormat
) directly from the variable and convert it to the Spark DataFrame. The use cases include converting JSON output of the API into the dataframe; reading the CSV data via the API (e.g. Box API).
The advantage of using this reader is that it allows to read the data directly from the Python variable, without the need to store it on the disk. This can be useful when the data is small and does not need to be stored permanently.
Parameters:
Name Type Description Default data
Union[str, list, dict, bytes]
Source data
required format
DataFormat
File / data format
required schema_
Optional[StructType]
Schema that will be applied during the creation of Spark DataFrame
None
params
Optional[Dict[str, Any]]
Set of extra parameters that should be passed to the appropriate reader (csv / json). Optionally, the user can pass the parameters that are specific to the reader (e.g. multiLine
for JSON reader) as key-word arguments. These will be merged with the params
parameter.
dict
Example # Read CSV data from a string\ndf1 = InMemoryDataReader(format=DataFormat.CSV, data='foo,bar\\nA,1\\nB,2')\n\n# Read JSON data from a string\ndf2 = InMemoryDataReader(format=DataFormat.JSON, data='{\"foo\": A, \"bar\": 1}'\ndf3 = InMemoryDataReader(format=DataFormat.JSON, data=['{\"foo\": \"A\", \"bar\": 1}', '{\"foo\": \"B\", \"bar\": 2}']\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.data","title":"data class-attribute
instance-attribute
","text":"data: Union[str, list, dict, bytes] = Field(default=..., description='Source data')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.format","title":"format class-attribute
instance-attribute
","text":"format: DataFormat = Field(default=..., description='File / data format')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the appropriate reader (csv / json)')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.schema_","title":"schema_ class-attribute
instance-attribute
","text":"schema_: Optional[StructType] = Field(default=None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.execute","title":"execute","text":"execute()\n
Execute method appropriate to the specific data format
Source code in src/koheesio/spark/readers/memory.py
def execute(self):\n \"\"\"\n Execute method appropriate to the specific data format\n \"\"\"\n _func = getattr(InMemoryDataReader, f\"_{self.format}\")\n _df = partial(_func, self, self._rdd)()\n self.output.df = _df\n
"},{"location":"api_reference/spark/readers/metastore.html","title":"Metastore","text":"Create Spark DataFrame from table in Metastore
"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader","title":"koheesio.spark.readers.metastore.MetastoreReader","text":"Reader for tables/views from Spark Metastore
Parameters:
Name Type Description Default table
str
Table name in spark metastore
required"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='Table name in spark metastore')\n
"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/metastore.py
def execute(self):\n self.output.df = self.spark.table(self.table)\n
"},{"location":"api_reference/spark/readers/rest_api.html","title":"Rest api","text":"This module provides the RestApiReader class for interacting with RESTful APIs.
The RestApiReader class is designed to fetch data from RESTful APIs and store the response in a DataFrame. It supports different transports, e.g. Paginated Http or Async HTTP. The main entry point is the execute
method, which performs transport.execute() call and provide data from the API calls.
For more details on how to use this class and its methods, refer to the class docstring.
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader","title":"koheesio.spark.readers.rest_api.RestApiReader","text":"A reader class that executes an API call and stores the response in a DataFrame.
Parameters:
Name Type Description Default transport
Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]
The HTTP transport step.
required spark_schema
Union[str, StructType, List[str], Tuple[str, ...], AtomicType]
The pyspark schema of the response.
required Attributes:
Name Type Description transport
Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]
The HTTP transport step.
spark_schema
Union[str, StructType, List[str], Tuple[str, ...], AtomicType]
The pyspark schema of the response.
Returns:
Type Description Output
The output of the reader, which includes the DataFrame.
Examples:
Here are some examples of how to use this class:
Example 1: Paginated Transport
import requests\nfrom urllib3 import Retry\n\nfrom koheesio.steps.http import HttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = requests.Session()\nretry_logic = Retry(total=max_retries, status_forcelist=[503])\nsession.mount(\"https://\", HTTPAdapter(max_retries=retry_logic))\nsession.mount(\"http://\", HTTPAdapter(max_retries=retry_logic))\n\ntransport = PaginatedHtppGetStep(\n url=\"https://api.example.com/data?page={page}\",\n paginate=True,\n pages=3,\n session=session,\n)\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n
Example 2: Async Transport
from aiohttp import ClientSession, TCPConnector\nfrom aiohttp_retry import ExponentialRetry\nfrom yarl import URL\n\nfrom koheesio.steps.asyncio.http import AsyncHttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = ClientSession()\nurls = [URL(\"http://httpbin.org/get\"), URL(\"http://httpbin.org/get\")]\nretry_options = ExponentialRetry()\nconnector = TCPConnector(limit=10)\ntransport = AsyncHttpGetStep(\n client_session=session,\n url=urls,\n retry_options=retry_options,\n connector=connector,\n)\n\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.spark_schema","title":"spark_schema class-attribute
instance-attribute
","text":"spark_schema: Union[str, StructType, List[str], Tuple[str, ...], AtomicType] = Field(..., description='The pyspark schema of the response')\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.transport","title":"transport class-attribute
instance-attribute
","text":"transport: Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]] = Field(..., description='HTTP transport step', exclude=True)\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.execute","title":"execute","text":"execute() -> Output\n
Executes the API call and stores the response in a DataFrame.
Returns:
Type Description Output
The output of the reader, which includes the DataFrame.
Source code in src/koheesio/spark/readers/rest_api.py
def execute(self) -> Reader.Output:\n \"\"\"\n Executes the API call and stores the response in a DataFrame.\n\n Returns\n -------\n Reader.Output\n The output of the reader, which includes the DataFrame.\n \"\"\"\n raw_data = self.transport.execute()\n\n if isinstance(raw_data, HttpGetStep.Output):\n data = raw_data.response_json\n elif isinstance(raw_data, AsyncHttpGetStep.Output):\n data = [d for d, _ in raw_data.responses_urls] # type: ignore\n\n if data:\n self.output.df = self.spark.createDataFrame(data=data, schema=self.spark_schema) # type: ignore\n
"},{"location":"api_reference/spark/readers/snowflake.html","title":"Snowflake","text":"Module containing Snowflake reader classes.
This module contains classes for reading data from Snowflake. The classes are used to create a Spark DataFrame from a Snowflake table or a query.
Classes:
Name Description SnowflakeReader
Reader for Snowflake tables.
Query
Reader for Snowflake queries.
DbTableQuery
Reader for Snowflake queries that return a single row.
Notes The classes are defined in the koheesio.steps.integrations.snowflake module; this module simply inherits from the classes defined there.
See Also - koheesio.spark.readers.Reader Base class for all Readers.
- koheesio.steps.integrations.snowflake Module containing Snowflake classes.
More detailed class descriptions can be found in the class docstrings.
"},{"location":"api_reference/spark/readers/spark_sql_reader.html","title":"Spark sql reader","text":"This module contains the SparkSqlReader class which reads the SparkSQL compliant query and returns the dataframe.
"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader","title":"koheesio.spark.readers.spark_sql_reader.SparkSqlReader","text":"SparkSqlReader reads the SparkSQL compliant query and returns the dataframe.
This SQL can originate from a string or a file and may contain placeholder (parameters) for templating. - Placeholders are identified with ${placeholder}. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).
Example SQL script (example.sql):
SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n
Python code:
from koheesio.spark.readers import SparkSqlReader\n\nreader = SparkSqlReader(\n sql_path=\"example.sql\",\n # params can also be passed as kwargs\n dynamic_column\"=\"name\",\n \"table_name\"=\"my_table\"\n)\nreader.execute()\n
In this example, the SQL script is read from a file and the placeholders are replaced with the given params. The resulting SQL query is:
SELECT id, id + 1 AS incremented_id, name AS extra_column\nFROM my_table\n
The query is then executed and the resulting DataFrame is stored in the output.df
attribute.
Parameters:
Name Type Description Default sql_path
str or Path
Path to a SQL file
required sql
str
SQL query to execute
required params
dict
Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.
required Notes Any arbitrary kwargs passed to the class will be added to params.
"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/spark_sql_reader.py
def execute(self):\n self.output.df = self.spark.sql(self.query)\n
"},{"location":"api_reference/spark/readers/teradata.html","title":"Teradata","text":"Teradata reader.
"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader","title":"koheesio.spark.readers.teradata.TeradataReader","text":"Wrapper around JdbcReader for Teradata.
Notes - Consider using synthetic partitioning column when using partitioned read:
MOD(HASHBUCKET(HASHROW(<TABLE>.<COLUMN>)), <NUM_PARTITIONS>)
- Relevant jars should be added to the Spark session manually. This class does not take care of that.
See Also - Refer to JdbcReader for the list of all available parameters.
- Refer to Teradata docs for the list of all available connection string parameters: https://teradata-docs.s3.amazonaws.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_2.html#BABJIHBJ
Example This example depends on the Teradata terajdbc4
JAR. e.g. terajdbc4-17.20.00.15. Keep in mind that older versions of terajdbc4
drivers also require tdgssconfig
JAR.
from koheesio.spark.readers.teradata import TeradataReader\n\ntd = TeradataReader(\n url=\"jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n dbtable=\"schemaname.tablename\",\n)\n
Parameters:
Name Type Description Default url
str
JDBC connection string. Refer to Teradata docs for the list of all available connection string parameters. Example: jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on
required user
str
Username
required password
SecretStr
Password
required dbtable
str
Database table name, also include schema name
required options
Optional[Dict[str, Any]]
Extra options to pass to the Teradata JDBC driver. Refer to Teradata docs for the list of all available connection string parameters.
{\"fetchsize\": 2000, \"numPartitions\": 10}
query
Optional[str]
Query
None
format
str
The type of format to load. Defaults to 'jdbc'. Should not be changed.
required driver
str
Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.
required"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: str = Field('com.teradata.jdbc.TeraDriver', description='Make sure that the necessary JARs are available in the cluster: terajdbc4-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field({'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the Teradata JDBC driver')\n
"},{"location":"api_reference/spark/readers/databricks/index.html","title":"Databricks","text":""},{"location":"api_reference/spark/readers/databricks/autoloader.html","title":"Autoloader","text":"Read from a location using Databricks' autoloader
Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader","title":"koheesio.spark.readers.databricks.autoloader.AutoLoader","text":"Read from a location using Databricks' autoloader
Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
Notes autoloader
is a Spark Structured Streaming
function!
Although most transformations are compatible with Spark Structured Streaming
, not all of them are. As a result, be mindful with your downstream transformations.
Parameters:
Name Type Description Default format
Union[str, AutoLoaderFormat]
The file format, used in cloudFiles.format
. Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
required location
str
The location where the files are located, used in cloudFiles.location
required schema_location
str
The location for storing inferred schema and supporting schema evolution, used in cloudFiles.schemaLocation
.
required options
Optional[Dict[str, str]]
Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html
{}
Example from koheesio.spark.readers.databricks import AutoLoader, AutoLoaderFormat\n\nresult_df = AutoLoader(\n format=AutoLoaderFormat.JSON,\n location=\"some_s3_path\",\n schema_location=\"other_s3_path\",\n options={\"multiLine\": \"true\"},\n).read()\n
See Also Some other useful documentation:
- autoloader: https://docs.databricks.com/ingestion/auto-loader/index.html
- Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.format","title":"format class-attribute
instance-attribute
","text":"format: Union[str, AutoLoaderFormat] = Field(default=..., description=__doc__)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.location","title":"location class-attribute
instance-attribute
","text":"location: str = Field(default=..., description='The location where the files are located, used in `cloudFiles.location`')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, str]] = Field(default_factory=dict, description='Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.schema_location","title":"schema_location class-attribute
instance-attribute
","text":"schema_location: str = Field(default=..., alias='schemaLocation', description='The location for storing inferred schema and supporting schema evolution, used in `cloudFiles.schemaLocation`.')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.execute","title":"execute","text":"execute()\n
Reads from the given location with the given options using Autoloader
Source code in src/koheesio/spark/readers/databricks/autoloader.py
def execute(self):\n \"\"\"Reads from the given location with the given options using Autoloader\"\"\"\n self.output.df = self.reader().load(self.location)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.get_options","title":"get_options","text":"get_options()\n
Get the options for the autoloader
Source code in src/koheesio/spark/readers/databricks/autoloader.py
def get_options(self):\n \"\"\"Get the options for the autoloader\"\"\"\n self.options.update(\n {\n \"cloudFiles.format\": self.format,\n \"cloudFiles.schemaLocation\": self.schema_location,\n }\n )\n return self.options\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.reader","title":"reader","text":"reader()\n
Return the reader for the autoloader
Source code in src/koheesio/spark/readers/databricks/autoloader.py
def reader(self):\n \"\"\"Return the reader for the autoloader\"\"\"\n return self.spark.readStream.format(\"cloudFiles\").options(**self.get_options())\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.validate_format","title":"validate_format","text":"validate_format(format_specified)\n
Validate format
value
Source code in src/koheesio/spark/readers/databricks/autoloader.py
@field_validator(\"format\")\ndef validate_format(cls, format_specified):\n \"\"\"Validate `format` value\"\"\"\n if isinstance(format_specified, str):\n if format_specified.upper() in [f.value.upper() for f in AutoLoaderFormat]:\n format_specified = getattr(AutoLoaderFormat, format_specified.upper())\n return str(format_specified.value)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","title":"koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","text":"The file format, used in cloudFiles.format
Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.AVRO","title":"AVRO class-attribute
instance-attribute
","text":"AVRO = 'avro'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.BINARYFILE","title":"BINARYFILE class-attribute
instance-attribute
","text":"BINARYFILE = 'binaryfile'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.CSV","title":"CSV class-attribute
instance-attribute
","text":"CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.JSON","title":"JSON class-attribute
instance-attribute
","text":"JSON = 'json'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.ORC","title":"ORC class-attribute
instance-attribute
","text":"ORC = 'orc'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.PARQUET","title":"PARQUET class-attribute
instance-attribute
","text":"PARQUET = 'parquet'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.TEXT","title":"TEXT class-attribute
instance-attribute
","text":"TEXT = 'text'\n
"},{"location":"api_reference/spark/transformations/index.html","title":"Transformations","text":"This module contains the base classes for all transformations.
See class docstrings for more information.
References For a comprehensive guide on the usage, examples, and additional features of Transformation classes, please refer to the reference/concepts/steps/transformations section of the Koheesio documentation.
Classes:
Name Description Transformation
Base class for all transformations
ColumnsTransformation
Extended Transformation class with a preset validator for handling column(s) data
ColumnsTransformationWithTarget
Extended ColumnsTransformation class with an additional target_column
field
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation","title":"koheesio.spark.transformations.ColumnsTransformation","text":"Extended Transformation class with a preset validator for handling column(s) data with a standardized input for a single column or multiple columns.
Concept A ColumnsTransformation is a Transformation with a standardized input for column or columns. The columns
are stored as a list. Either a single string, or a list of strings can be passed to enter the columns
. column
and columns
are aliases to one another - internally the name columns
should be used though.
columns
are stored as a list - either a single string, or a list of strings can be passed to enter the
columns
column
and columns
are aliases to one another - internally the name columns
should be used though.
If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns
Configuring the ColumnsTransformation The ColumnsTransformation class has a ColumnConfig
class that can be used to configure the behavior of the class. This class has the following fields: - run_for_all_data_type
allows to run the transformation for all columns of a given type.
-
limit_data_type
allows to limit the transformation to a specific data type.
-
data_type_strict_mode
Toggles strict mode for data type validation. Will only work if limit_data_type
is set.
Note that Data types need to be specified as a SparkDatatype enum.
See the docstrings of the ColumnConfig
class for more information. See the SparkDatatype enum for a list of available data types.
Users should not have to interact with the ColumnConfig
class directly.
Parameters:
Name Type Description Default columns
The column (or list of columns) to apply the transformation to. Alias: column
required Example from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\n\nclass AddOne(ColumnsTransformation):\n def execute(self):\n for column in self.get_columns():\n self.output.df = self.df.withColumn(column, f.col(column) + 1)\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.columns","title":"columns class-attribute
instance-attribute
","text":"columns: ListOfColumns = Field(default='', alias='column', description='The column (or list of columns) to apply the transformation to. Alias: column')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.data_type_strict_mode_is_set","title":"data_type_strict_mode_is_set property
","text":"data_type_strict_mode_is_set: bool\n
Returns True if data_type_strict_mode is set
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.limit_data_type_is_set","title":"limit_data_type_is_set property
","text":"limit_data_type_is_set: bool\n
Returns True if limit_data_type is set
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.run_for_all_is_set","title":"run_for_all_is_set property
","text":"run_for_all_is_set: bool\n
Returns True if the transformation should be run for all columns of a given type
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig","title":"ColumnConfig","text":"Koheesio ColumnsTransformation specific Config
Parameters:
Name Type Description Default run_for_all_data_type
allows to run the transformation for all columns of a given type. A user can trigger this behavior by either omitting the columns
parameter or by passing a single *
as a column name. In both cases, the run_for_all_data_type
will be used to determine the data type. Value should be be passed as a SparkDatatype enum. (default: [None])
required limit_data_type
allows to limit the transformation to a specific data type. Value should be passed as a SparkDatatype enum. (default: [None])
required data_type_strict_mode
Toggles strict mode for data type validation. Will only work if limit_data_type
is set. - when True, a ValueError will be raised if any column does not adhere to the limit_data_type
- when False, a warning will be thrown and the column will be skipped instead (default: False)
required"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute
instance-attribute
","text":"data_type_strict_mode: bool = False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.column_type_of_col","title":"column_type_of_col","text":"column_type_of_col(col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True) -> Union[DataType, str]\n
Returns the dataType of a Column object as a string.
The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type based on the column name. We retrieve the name of the column from the Column object by calling toString() from the JVM.
Examples:
input_df: | str_column | int_column | |------------|------------| | hello | 1 | | world | 2 |
# using the AddOne transformation from the example above\nadd_one = AddOne(\n columns=[\"str_column\", \"int_column\"],\n df=input_df,\n)\nadd_one.column_type_of_col(\"str_column\") # returns \"string\"\nadd_one.column_type_of_col(\"int_column\") # returns \"integer\"\n# returns IntegerType\nadd_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n
Parameters:
Name Type Description Default col
Union[str, Column]
The column to check the type of
required df
Optional[DataFrame]
The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor will be used.
None
simple_return_mode
bool
If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.
True
Returns:
Name Type Description datatype
str
The type of the column as a string
Source code in src/koheesio/spark/transformations/__init__.py
def column_type_of_col(\n self, col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True\n) -> Union[DataType, str]:\n \"\"\"\n Returns the dataType of a Column object as a string.\n\n The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type\n based on the column name. We retrieve the name of the column from the Column object by calling toString() from\n the JVM.\n\n Examples\n --------\n __input_df:__\n | str_column | int_column |\n |------------|------------|\n | hello | 1 |\n | world | 2 |\n\n ```python\n # using the AddOne transformation from the example above\n add_one = AddOne(\n columns=[\"str_column\", \"int_column\"],\n df=input_df,\n )\n add_one.column_type_of_col(\"str_column\") # returns \"string\"\n add_one.column_type_of_col(\"int_column\") # returns \"integer\"\n # returns IntegerType\n add_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n ```\n\n Parameters\n ----------\n col: Union[str, Column]\n The column to check the type of\n\n df: Optional[DataFrame]\n The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor\n will be used.\n\n simple_return_mode: bool\n If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.\n\n Returns\n -------\n datatype: str\n The type of the column as a string\n \"\"\"\n df = df or self.df\n if not df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n\n if not isinstance(col, Column):\n col = f.col(col)\n\n # ask the JVM for the name of the column\n # noinspection PyProtectedMember\n col_name = col._jc.toString()\n\n # In order to check the datatype of the column, we have to ask the DataFrame its schema\n df_col = [c for c in df.schema if c.name == col_name][0]\n\n if simple_return_mode:\n return SparkDatatype(df_col.dataType.typeName()).value\n\n return df_col.dataType\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_all_columns_of_specific_type","title":"get_all_columns_of_specific_type","text":"get_all_columns_of_specific_type(data_type: Union[str, SparkDatatype]) -> List[str]\n
Get all columns from the dataframe of a given type
A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will be raised.
Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you have to call this method multiple times.
Parameters:
Name Type Description Default data_type
Union[str, SparkDatatype]
The data type to get the columns for
required Returns:
Type Description List[str]
A list of column names of the given data type
Source code in src/koheesio/spark/transformations/__init__.py
def get_all_columns_of_specific_type(self, data_type: Union[str, SparkDatatype]) -> List[str]:\n \"\"\"Get all columns from the dataframe of a given type\n\n A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will\n be raised.\n\n Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you\n have to call this method multiple times.\n\n Parameters\n ----------\n data_type: Union[str, SparkDatatype]\n The data type to get the columns for\n\n Returns\n -------\n List[str]\n A list of column names of the given data type\n \"\"\"\n if not self.df:\n raise ValueError(\"No dataframe available - cannot get columns\")\n\n expected_data_type = (SparkDatatype.from_string(data_type) if isinstance(data_type, str) else data_type).value\n\n columns_of_given_type: List[str] = [\n col for col in self.df.columns if self.df.schema[col].dataType.typeName() == expected_data_type\n ]\n return columns_of_given_type\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_columns","title":"get_columns","text":"get_columns() -> iter\n
Return an iterator of the columns
Source code in src/koheesio/spark/transformations/__init__.py
def get_columns(self) -> iter:\n \"\"\"Return an iterator of the columns\"\"\"\n # If `run_for_all_is_set` is True, we want to run the transformation for all columns of a given type\n if self.run_for_all_is_set:\n columns = []\n for data_type in self.ColumnConfig.run_for_all_data_type:\n columns += self.get_all_columns_of_specific_type(data_type)\n else:\n columns = self.columns\n\n for column in columns:\n if self.is_column_type_correct(column):\n yield column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_limit_data_types","title":"get_limit_data_types","text":"get_limit_data_types()\n
Get the limit_data_type as a list of strings
Source code in src/koheesio/spark/transformations/__init__.py
def get_limit_data_types(self):\n \"\"\"Get the limit_data_type as a list of strings\"\"\"\n return [dt.value for dt in self.ColumnConfig.limit_data_type]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.is_column_type_correct","title":"is_column_type_correct","text":"is_column_type_correct(column)\n
Check if column type is correct and handle it if not, when limit_data_type is set
Source code in src/koheesio/spark/transformations/__init__.py
def is_column_type_correct(self, column):\n \"\"\"Check if column type is correct and handle it if not, when limit_data_type is set\"\"\"\n if not self.limit_data_type_is_set:\n return True\n\n if self.column_type_of_col(column) in (limit_data_types := self.get_limit_data_types()):\n return True\n\n # Raises a ValueError if the Column object is not of a given type and data_type_strict_mode is set\n if self.data_type_strict_mode_is_set:\n raise ValueError(\n f\"Critical error: {column} is not of type {limit_data_types}. Exception is raised because \"\n f\"`data_type_strict_mode` is set to True for {self.name}.\"\n )\n\n # Otherwise, throws a warning that the Column object is not of a given type\n self.log.warning(f\"Column `{column}` is not of type `{limit_data_types}` and will be skipped.\")\n return False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.set_columns","title":"set_columns","text":"set_columns(columns_value)\n
Validate columns through the columns configuration provided
Source code in src/koheesio/spark/transformations/__init__.py
@field_validator(\"columns\", mode=\"before\")\ndef set_columns(cls, columns_value):\n \"\"\"Validate columns through the columns configuration provided\"\"\"\n columns = columns_value\n run_for_all_data_type = cls.ColumnConfig.run_for_all_data_type\n\n if run_for_all_data_type and len(columns) == 0:\n columns = [\"*\"]\n\n if columns[0] == \"*\" and not run_for_all_data_type:\n raise ValueError(\"Cannot use '*' as a column name when no run_for_all_data_type is set\")\n\n return columns\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget","title":"koheesio.spark.transformations.ColumnsTransformationWithTarget","text":"Extended ColumnsTransformation class with an additional target_column
field
Using this class makes implementing Transformations significantly easier.
Concept A ColumnsTransformationWithTarget
is a ColumnsTransformation
with an additional target_column
field. This field can be used to store the result of the transformation in a new column.
If the target_column
is not provided, the result will be stored in the source column.
If more than one column is passed, the behavior of the Class changes this way:
- the transformation will be run in a loop against all the given columns
- automatically handles the renaming of the columns when more than one column is passed
- the
target_column
will be used as a suffix. Leaving this blank will result in the original columns being renamed
The func
method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target
method to loop over all the columns and apply this function to transform the DataFrame.
Parameters:
Name Type Description Default columns
ListOfColumns
The column (or list of columns) to apply the transformation to. Alias: column. If not provided, the run_for_all_data_type
will be used to determine the data type. If run_for_all_data_type
is not set, the transformation will be run for all columns of a given type.
*
target_column
Optional[str]
The name of the column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this input will be used as a suffix instead.
None
Example Writing your own transformation using the ColumnsTransformationWithTarget
class:
from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n def func(self, col: Column):\n return col + 1\n
In the above example, the func
method is implemented to add 1 to the values of a given column.
In order to use this transformation, we can call the transform
method:
from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOneWithTarget(column=\"id\", target_column=\"new_id\").transform(df)\n
The output_df
will now contain the original DataFrame with an additional column called new_id
with the values of id
+ 1.
output_df:
id new_id 0 1 1 2 2 3 Note: The target_column
will be used as a suffix when more than one column is given as source. Leaving this blank will result in the original columns being renamed.
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: Optional[str] = Field(default=None, alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.execute","title":"execute","text":"execute()\n
Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output) This can be left unchanged, and hence should not be implemented in the child class.
Source code in src/koheesio/spark/transformations/__init__.py
def execute(self):\n \"\"\"Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output)\n This can be left unchanged, and hence should not be implemented in the child class.\n \"\"\"\n df = self.df\n\n for target_column, column in self.get_columns_with_target():\n func = self.func # select the applicable function\n df = df.withColumn(\n target_column,\n func(f.col(column)),\n )\n\n self.output.df = df\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.func","title":"func abstractmethod
","text":"func(column: Column) -> Column\n
The function that will be run on a single Column of the DataFrame
The func
method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target
method to loop over all the columns and apply this function to transform the DataFrame.
Parameters:
Name Type Description Default column
Column
The column to apply the transformation to
required Returns:
Type Description Column
The transformed column
Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef func(self, column: Column) -> Column:\n \"\"\"The function that will be run on a single Column of the DataFrame\n\n The `func` method should be implemented in the child class. This method should return the transformation that\n will be applied to the column(s). The execute method (already preset) will use the `get_columns_with_target`\n method to loop over all the columns and apply this function to transform the DataFrame.\n\n Parameters\n ----------\n column: Column\n The column to apply the transformation to\n\n Returns\n -------\n Column\n The transformed column\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.get_columns_with_target","title":"get_columns_with_target","text":"get_columns_with_target() -> iter\n
Return an iterator of the columns
Works just like in get_columns from the ColumnsTransformation class except that it handles the target_column
as well.
If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns - the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.
Returns:
Type Description iter
An iterator of tuples containing the target column name and the original column name
Source code in src/koheesio/spark/transformations/__init__.py
def get_columns_with_target(self) -> iter:\n \"\"\"Return an iterator of the columns\n\n Works just like in get_columns from the ColumnsTransformation class except that it handles the `target_column`\n as well.\n\n If more than one column is passed, the behavior of the Class changes this way:\n - the transformation will be run in a loop against all the given columns\n - the target_column will be used as a suffix. Leaving this blank will result in the original columns being\n renamed.\n\n Returns\n -------\n iter\n An iterator of tuples containing the target column name and the original column name\n \"\"\"\n columns = [*self.get_columns()]\n\n for column in columns:\n # ensures that we at least use the original column name\n target_column = self.target_column or column\n\n if len(columns) > 1: # target_column becomes a suffix when more than 1 column is given\n # dict.fromkeys is used to avoid duplicates in the name while maintaining order\n _cols = [column, target_column]\n target_column = \"_\".join(list(dict.fromkeys(_cols)))\n\n yield target_column, column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation","title":"koheesio.spark.transformations.Transformation","text":"Base class for all transformations
Concept A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is transformed based on the logic implemented in the execute
method. Any additional parameters that are needed for the transformation can be passed to the constructor.
Parameters:
Name Type Description Default df
The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the transform-method.
required Example from koheesio.steps.transformations import Transformation\nfrom pyspark.sql import functions as f\n\n\nclass AddOne(Transformation):\n def execute(self):\n self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n
In the example above, the execute
method is implemented to add 1 to the values of the old_column
and store the result in a new column called new_column
.
In order to use this transformation, we can call the transform
method:
from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOne().transform(df)\n
The output_df
will now contain the original DataFrame with an additional column called new_column
with the values of old_column
+ 1.
output_df:
id new_column 0 1 1 2 2 3 ... Alternatively, we can pass the DataFrame to the constructor and call the execute
or transform
method without any arguments:
output_df = AddOne(df).transform()\n# or\noutput_df = AddOne(df).execute().output.df\n
Note: that the transform method was not implemented explicitly in the AddOne class. This is because the transform
method is already implemented in the Transformation
class. This means that all classes that inherit from the Transformation class will have the transform
method available. Only the execute method needs to be implemented.
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.execute","title":"execute abstractmethod
","text":"execute() -> Output\n
Execute on a Transformation should handle self.df (input) and set self.output.df (output)
This method should be implemented in the child class. The input DataFrame is available as self.df
and the output DataFrame should be stored in self.output.df
.
For example:
def execute(self):\n self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n
The transform method will call this method and return the output DataFrame.
Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef execute(self) -> SparkStep.Output:\n \"\"\"Execute on a Transformation should handle self.df (input) and set self.output.df (output)\n\n This method should be implemented in the child class. The input DataFrame is available as `self.df` and the\n output DataFrame should be stored in `self.output.df`.\n\n For example:\n ```python\n def execute(self):\n self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n ```\n\n The transform method will call this method and return the output DataFrame.\n \"\"\"\n # self.df # input dataframe\n # self.output.df # output dataframe\n self.output.df = ... # implement the transformation logic\n raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.transform","title":"transform","text":"transform(df: Optional[DataFrame] = None) -> DataFrame\n
Execute the transformation and return the output DataFrame
Note: when creating a child from this, don't implement this transform method. Instead, implement execute!
See Also Transformation.execute
Parameters:
Name Type Description Default df
Optional[DataFrame]
The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor will be used.
None
Returns:
Type Description DataFrame
The transformed DataFrame
Source code in src/koheesio/spark/transformations/__init__.py
def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n \"\"\"Execute the transformation and return the output DataFrame\n\n Note: when creating a child from this, don't implement this transform method. Instead, implement execute!\n\n See Also\n --------\n `Transformation.execute`\n\n Parameters\n ----------\n df: Optional[DataFrame]\n The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor\n will be used.\n\n Returns\n -------\n DataFrame\n The transformed DataFrame\n \"\"\"\n self.df = df or self.df\n if not self.df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n self.execute()\n return self.output.df\n
"},{"location":"api_reference/spark/transformations/arrays.html","title":"Arrays","text":"A collection of classes for performing various transformations on arrays in PySpark.
These transformations include operations such as removing duplicates, exploding arrays into separate rows, reversing the order of elements, sorting elements, removing certain values, and calculating aggregate statistics like minimum, maximum, sum, mean, and median.
Concept - Every transformation in this module is implemented as a class that inherits from the
ArrayTransformation
class. - The
ArrayTransformation
class is a subclass of ColumnsTransformationWithTarget
- The
ArrayTransformation
class implements the func
method, which is used to define the transformation logic. - The
func
method takes a column
as input and returns a Column
object. - The
Column
object is a PySpark column that can be used to perform transformations on a DataFrame column. - The
ArrayTransformation
limits the data type of the transformation to array by setting the ColumnConfig
class to run_for_all_data_type = [SparkDatatype.ARRAY]
and limit_data_type = [SparkDatatype.ARRAY]
.
See Also - koheesio.spark.transformations Module containing all transformation classes.
- koheesio.spark.transformations.ColumnsTransformationWithTarget Base class for all transformations that operate on columns and have a target column.
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortAsc","title":"koheesio.spark.transformations.arrays.ArraySortAsc module-attribute
","text":"ArraySortAsc = ArraySort\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct","title":"koheesio.spark.transformations.arrays.ArrayDistinct","text":"Remove duplicates from array
Example ArrayDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.filter_empty","title":"filter_empty class-attribute
instance-attribute
","text":"filter_empty: bool = Field(default=True, description='Remove null, nan, and empty values from array. Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n _fn = F.array_distinct(column)\n\n # noinspection PyUnresolvedReferences\n element_type = self.column_type_of_col(column, None, False).elementType\n is_numeric = spark_data_type_is_numeric(element_type)\n\n if self.filter_empty:\n # Remove null values from array\n if spark_minor_version >= 3.4:\n # Run array_compact if spark version is 3.4 or higher\n # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_compact.html\n # pylint: disable=E0611\n from pyspark.sql.functions import array_compact as _array_compact\n\n _fn = _array_compact(_fn)\n # pylint: enable=E0611\n else:\n # Otherwise, remove null from array using array_except\n _fn = F.array_except(_fn, F.array(F.lit(None)))\n\n # Remove nan or empty values from array (depends on the type of the elements in array)\n if is_numeric:\n # Remove nan from array (float/int/numbers)\n _fn = F.array_except(_fn, F.array(F.lit(float(\"nan\")).cast(element_type)))\n else:\n # Remove empty values from array (string/text)\n _fn = F.array_except(_fn, F.array(F.lit(\"\"), F.lit(\" \")))\n\n return _fn\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax","title":"koheesio.spark.transformations.arrays.ArrayMax","text":"Return the maximum value in the array
Example ArrayMax(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n # Call for processing of nan values\n column = super().func(column)\n\n return F.array_max(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean","title":"koheesio.spark.transformations.arrays.ArrayMean","text":"Return the mean of the values in the array.
Note: Only numeric values are supported for calculating the mean.
Example ArrayMean(column=\"array_column\", target_column=\"average\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean.func","title":"func","text":"func(column: Column) -> Column\n
Calculate the mean of the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"Calculate the mean of the values in the array\"\"\"\n # raise an error if the array contains non-numeric elements\n element_type = self.column_type_of_col(col=column, df=None, simple_return_mode=False).elementType\n\n if not spark_data_type_is_numeric(element_type):\n raise ValueError(\n f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n f\"Only numeric values are supported for calculating a mean.\"\n )\n\n _sum = ArraySum.from_step(self).func(column)\n # Call for processing of nan values\n column = super().func(column)\n _size = F.size(column)\n # return 0 if the size of the array is 0 to avoid division by zero\n return F.when(_size == 0, F.lit(0)).otherwise(_sum / _size)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian","title":"koheesio.spark.transformations.arrays.ArrayMedian","text":"Return the median of the values in the array.
The median is the middle value in a sorted, ascending or descending, list of numbers.
- If the size of the array is even, the median is the average of the two middle numbers.
- If the size of the array is odd, the median is the middle number.
Note: Only numeric values are supported for calculating the median.
Example ArrayMedian(column=\"array_column\", target_column=\"median\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian.func","title":"func","text":"func(column: Column) -> Column\n
Calculate the median of the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"Calculate the median of the values in the array\"\"\"\n # Call for processing of nan values\n column = super().func(column)\n\n sorted_array = ArraySort.from_step(self).func(column)\n _size: Column = F.size(sorted_array)\n\n # Calculate the middle index. If the size is odd, PySpark discards the fractional part.\n # Use floor function to ensure the result is an integer\n middle: Column = F.floor((_size + 1) / 2).cast(\"int\")\n\n # Define conditions\n is_size_zero: Column = _size == 0\n is_column_null: Column = column.isNull()\n is_size_even: Column = _size % 2 == 0\n\n # Define actions / responses\n # For even-sized arrays, calculate the average of the two middle elements\n average_of_middle_elements = (F.element_at(sorted_array, middle) + F.element_at(sorted_array, middle + 1)) / 2\n # For odd-sized arrays, select the middle element\n middle_element = F.element_at(sorted_array, middle)\n # In case the array is empty, return either None or 0\n none_value = F.lit(None)\n zero_value = F.lit(0)\n\n median = (\n # Check if the size of the array is 0\n F.when(\n is_size_zero,\n # If the size of the array is 0 and the column is null, return None\n # If the size of the array is 0 and the column is not null, return 0\n F.when(is_column_null, none_value).otherwise(zero_value),\n ).otherwise(\n # If the size of the array is not 0, calculate the median\n F.when(is_size_even, average_of_middle_elements).otherwise(middle_element)\n )\n )\n\n return median\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin","title":"koheesio.spark.transformations.arrays.ArrayMin","text":"Return the minimum value in the array
Example ArrayMin(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n return F.array_min(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess","title":"koheesio.spark.transformations.arrays.ArrayNullNanProcess","text":"Process an array by removing NaN and/or NULL values from elements.
Parameters:
Name Type Description Default keep_nan
bool
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.
False
keep_null
bool
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.
False
Returns:
Name Type Description column
Column
The processed column with NaN and/or NULL values removed from elements.
Examples:
>>> input_data = [(1, [1.1, 2.1, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=False)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1]\n\n>>> input_data = [(1, [1.1, 2.2, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=True)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1, nan]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_nan","title":"keep_nan class-attribute
instance-attribute
","text":"keep_nan: bool = Field(False, description='Whether to keep nan values in the array. Default is False. If set to True, the nan values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_null","title":"keep_null class-attribute
instance-attribute
","text":"keep_null: bool = Field(False, description='Whether to keep null values in the array. Default is False. If set to True, the null values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.func","title":"func","text":"func(column: Column) -> Column\n
Process the given column by removing NaN and/or NULL values from elements.
Parameters: column : Column The column to be processed.
Returns: column : Column The processed column with NaN and/or NULL values removed from elements.
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"\n Process the given column by removing NaN and/or NULL values from elements.\n\n Parameters:\n -----------\n column : Column\n The column to be processed.\n\n Returns:\n --------\n column : Column\n The processed column with NaN and/or NULL values removed from elements.\n \"\"\"\n\n def apply_logic(x: Column):\n if self.keep_nan is False and self.keep_null is False:\n logic = x.isNotNull() & ~F.isnan(x)\n elif self.keep_nan is False:\n logic = ~F.isnan(x)\n elif self.keep_null is False:\n logic = x.isNotNull()\n\n return logic\n\n if self.keep_nan is False or self.keep_null is False:\n column = F.filter(column, apply_logic)\n\n return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove","title":"koheesio.spark.transformations.arrays.ArrayRemove","text":"Remove a certain value from the array
Parameters:
Name Type Description Default keep_nan
bool
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.
False
keep_null
bool
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.
False
Example ArrayRemove(column=\"array_column\", value=\"value_to_remove\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.make_distinct","title":"make_distinct class-attribute
instance-attribute
","text":"make_distinct: bool = Field(default=False, description='Whether to remove duplicates from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.value","title":"value class-attribute
instance-attribute
","text":"value: Any = Field(default=None, description='The value to remove from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n value = self.value\n\n column = super().func(column)\n\n def filter_logic(x: Column, _val: Any):\n if self.keep_null and self.keep_nan:\n logic = (x != F.lit(_val)) | x.isNull() | F.isnan(x)\n elif self.keep_null:\n logic = (x != F.lit(_val)) | x.isNull()\n elif self.keep_nan:\n logic = (x != F.lit(_val)) | F.isnan(x)\n else:\n logic = x != F.lit(_val)\n\n return logic\n\n # Check if the value is iterable (i.e., a list, tuple, or set)\n if isinstance(value, (list, tuple, set)):\n result = reduce(lambda res, val: F.filter(res, lambda x: filter_logic(x, val)), value, column)\n else:\n # If the value is not iterable, simply remove the value from the array\n result = F.filter(column, lambda x: filter_logic(x, value))\n\n if self.make_distinct:\n result = F.array_distinct(result)\n\n return result\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse","title":"koheesio.spark.transformations.arrays.ArrayReverse","text":"Reverse the order of elements in the array
Example ArrayReverse(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n return F.reverse(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort","title":"koheesio.spark.transformations.arrays.ArraySort","text":"Sort the elements in the array
By default, the elements are sorted in ascending order. To sort the elements in descending order, set the reverse
parameter to True.
Example ArraySort(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.reverse","title":"reverse class-attribute
instance-attribute
","text":"reverse: bool = Field(default=False, description='Sort the elements in the array in a descending order. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n column = F.array_sort(column)\n if self.reverse:\n # Reverse the order of elements in the array\n column = ArrayReverse.from_step(self).func(column)\n return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc","title":"koheesio.spark.transformations.arrays.ArraySortDesc","text":"Sort the elements in the array in descending order
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc.reverse","title":"reverse class-attribute
instance-attribute
","text":"reverse: bool = True\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum","title":"koheesio.spark.transformations.arrays.ArraySum","text":"Return the sum of the values in the array
Parameters:
Name Type Description Default keep_nan
bool
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.
False
keep_null
bool
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.
False
Example ArraySum(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum.func","title":"func","text":"func(column: Column) -> Column\n
Using the aggregate
function to sum the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"Using the `aggregate` function to sum the values in the array\"\"\"\n # raise an error if the array contains non-numeric elements\n element_type = self.column_type_of_col(column, None, False).elementType\n if not spark_data_type_is_numeric(element_type):\n raise ValueError(\n f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n f\"Only numeric values are supported for summing.\"\n )\n\n # remove na values from array.\n column = super().func(column)\n\n # Using the `aggregate` function to sum the values in the array by providing the initial value as 0.0 and the\n # lambda function to add the elements together. Pyspark will automatically infer the type of the initial value\n # making 0.0 valid for both integer and float types.\n initial_value = F.lit(0.0)\n return F.aggregate(column, initial_value, lambda accumulator, x: accumulator + x)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation","title":"koheesio.spark.transformations.arrays.ArrayTransformation","text":"Base class for array transformations
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig","title":"ColumnConfig","text":"Set the data type of the Transformation to array
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n raise NotImplementedError(\"This is an abstract class\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode","title":"koheesio.spark.transformations.arrays.Explode","text":"Explode the array into separate rows
Example Explode(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.distinct","title":"distinct class-attribute
instance-attribute
","text":"distinct: bool = Field(False, description='Remove duplicates from the exploded array. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.preserve_nulls","title":"preserve_nulls class-attribute
instance-attribute
","text":"preserve_nulls: bool = Field(True, description='Preserve rows with null values in the exploded array by using explode_outer instead of explode.Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n if self.distinct:\n column = ArrayDistinct.from_step(self).func(column)\n return F.explode_outer(column) if self.preserve_nulls else F.explode(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct","title":"koheesio.spark.transformations.arrays.ExplodeDistinct","text":"Explode the array into separate rows while removing duplicates and empty values
Example ExplodeDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct.distinct","title":"distinct class-attribute
instance-attribute
","text":"distinct: bool = True\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html","title":"Camel to snake","text":"Class for converting DataFrame column names from camel case to snake case.
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.camel_to_snake_re","title":"koheesio.spark.transformations.camel_to_snake.camel_to_snake_re module-attribute
","text":"camel_to_snake_re = compile('([a-z0-9])([A-Z])')\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","title":"koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","text":"Converts column names from camel case to snake cases
Parameters:
Name Type Description Default columns
Optional[ListOfColumns]
The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: [\"column1\", \"column2\"]
or \"column1\"
None
Example input_df:
camelCaseColumn snake_case_column ... ... output_df = CamelToSnakeTransformation(column=\"camelCaseColumn\").transform(input_df)\n
output_df:
camel_case_column snake_case_column ... ... In this example, the column camelCaseColumn
is converted to camel_case_column
.
Note: the data in the columns is not changed, only the column names.
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[ListOfColumns] = Field(default='', alias='column', description=\"The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'` \")\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/camel_to_snake.py
def execute(self):\n _df = self.df\n\n # Prepare columns input:\n columns = self.df.columns if self.columns == [\"*\"] else self.columns\n\n for column in columns:\n _df = _df.withColumnRenamed(column, convert_camel_to_snake(column))\n\n self.output.df = _df\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","title":"koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","text":"convert_camel_to_snake(name: str)\n
Converts a string from camelCase to snake_case.
Parameters: name : str The string to be converted.
Returns: str The converted string in snake_case.
Source code in src/koheesio/spark/transformations/camel_to_snake.py
def convert_camel_to_snake(name: str):\n \"\"\"\n Converts a string from camelCase to snake_case.\n\n Parameters:\n ----------\n name : str\n The string to be converted.\n\n Returns:\n --------\n str\n The converted string in snake_case.\n \"\"\"\n return camel_to_snake_re.sub(r\"\\1_\\2\", name).lower()\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html","title":"Cast to datatype","text":"Transformations to cast a column or set of columns to a given datatype.
Each one of these have been vetted to throw warnings when wrong datatypes are passed (to skip erroring any job or pipeline).
Furthermore, detailed tests have been added to ensure that types are actually compatible as prescribed.
Concept - One can use the CastToDataType class directly, or use one of the more specific subclasses.
- Each subclass is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
- Compatible Data types are specified in the ColumnConfig class of each subclass, and are documented in the docstring of each subclass.
See class docstrings for more information
Note Dates, Arrays and Maps are not supported by this module.
- for dates, use the koheesio.spark.transformations.date_time module
- for arrays, use the koheesio.spark.transformations.arrays module
Classes:
Name Description CastToDatatype:
Cast a column or set of columns to a given datatype
CastToByte
Cast to Byte (a.k.a. tinyint)
CastToShort
Cast to Short (a.k.a. smallint)
CastToInteger
Cast to Integer (a.k.a. int)
CastToLong
Cast to Long (a.k.a. bigint)
CastToFloat
Cast to Float (a.k.a. real)
CastToDouble
Cast to Double
CastToDecimal
Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)
CastToString
Cast to String
CastToBinary
Cast to Binary (a.k.a. byte array)
CastToBoolean
Cast to Boolean
CastToTimestamp
Cast to Timestamp
Note The following parameters are common to all classes in this module:
Parameters:
Name Type Description Default columns
ListOfColumns
Name of the source column(s). Alias: column
required target_column
str
Name of the target column or alias if more than one column is specified. Alias: target_alias
required datatype
str or SparkDatatype
Datatype to cast to. Choose from SparkDatatype Enum (only needs to be specified in CastToDatatype, all other classes have a fixed datatype)
required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary","title":"koheesio.spark.transformations.cast_to_datatype.CastToBinary","text":"Cast to Binary (a.k.a. byte array)
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- float
- double
- decimal
- boolean
- timestamp
- date
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- string
Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = BINARY\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToBinary class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, STRING]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean","title":"koheesio.spark.transformations.cast_to_datatype.CastToBoolean","text":"Cast to Boolean
Unsupported datatypes: Following casts are not supported
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = BOOLEAN\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToBoolean class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte","title":"koheesio.spark.transformations.cast_to_datatype.CastToByte","text":"Cast to Byte (a.k.a. tinyint)
Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- boolean
- timestamp
- decimal
- double
- float
- long
- integer
- short
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- timestamp range of values too small for timestamp to have any meaning
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = BYTE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToByte class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype","title":"koheesio.spark.transformations.cast_to_datatype.CastToDatatype","text":"Cast a column or set of columns to a given datatype
Wrapper around pyspark.sql.Column.cast
Concept This class acts as the base class for all the other CastTo* classes. It is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
Example input_df:
c1 c2 1 2 3 4 output_df = CastToDatatype(\n column=\"c1\",\n datatype=\"string\",\n target_alias=\"c1\",\n).transform(input_df)\n
output_df:
c1 c2 \"1\" 2 \"3\" 4 In the example above, the column c1
is cast to a string datatype. The column c2
is not affected.
Parameters:
Name Type Description Default columns
ListOfColumns
Name of the source column(s). Alias: column
required datatype
str or SparkDatatype
Datatype to cast to. Choose from SparkDatatype Enum
required target_column
str
Name of the target column or alias if more than one column is specified. Alias: target_alias
required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = Field(default=..., description='Datatype. Choose from SparkDatatype Enum')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n # This is to let the IDE explicitly know that the datatype is not a string, but a `SparkDatatype` Enum\n datatype: SparkDatatype = self.datatype\n return column.cast(datatype.spark_type())\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.validate_datatype","title":"validate_datatype","text":"validate_datatype(datatype_value) -> SparkDatatype\n
Validate the datatype.
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@field_validator(\"datatype\")\ndef validate_datatype(cls, datatype_value) -> SparkDatatype:\n \"\"\"Validate the datatype.\"\"\"\n # handle string input\n try:\n if isinstance(datatype_value, str):\n datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value)\n return datatype_value\n\n # and let SparkDatatype handle the rest\n datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value.value)\n\n except AttributeError as e:\n raise AttributeError(f\"Invalid datatype: {datatype_value}\") from e\n\n return datatype_value\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal","title":"koheesio.spark.transformations.cast_to_datatype.CastToDecimal","text":"Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)
Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal
. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
The DecimalType must have fixed precision (the maximum total number of digits) and scale (the number of digits on the right of dot). For example, (5, 2) can support the value from [-999.99 to 999.99].
The precision can be up to 38, the scale must be less or equal to precision.
Spark sets the default precision and scale to (10, 0) when creating a DecimalType. However, when inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).
For compatibility reasons, Koheesio sets the default precision and scale to (38, 18) when creating a DecimalType.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- boolean
- timestamp
- date
- string
- void
- decimal spark will convert existing decimals to null if the precision and scale doesn't fit the data
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
Parameters:
Name Type Description Default columns
ListOfColumns
Name of the source column(s). Alias: column
*
target_column
str
Name of the target column or alias if more than one column is specified. Alias: target_alias
required precision
conint(gt=0, le=38)
the maximum (i.e. total) number of digits (default: 38). Must be > 0.
38
scale
conint(ge=0, le=18)
the number of digits on right side of dot. (default: 18). Must be >= 0.
18
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = DECIMAL\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.precision","title":"precision class-attribute
instance-attribute
","text":"precision: conint(gt=0, le=38) = Field(default=38, description='The maximum total number of digits (precision) of the decimal. Must be > 0. Default is 38')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.scale","title":"scale class-attribute
instance-attribute
","text":"scale: conint(ge=0, le=18) = Field(default=18, description='The number of digits to the right of the decimal point (scale). Must be >= 0. Default is 18')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToDecimal class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n return column.cast(self.datatype.spark_type(precision=self.precision, scale=self.scale))\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.validate_scale_and_precisions","title":"validate_scale_and_precisions","text":"validate_scale_and_precisions()\n
Validate the precision and scale values.
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@model_validator(mode=\"after\")\ndef validate_scale_and_precisions(self):\n \"\"\"Validate the precision and scale values.\"\"\"\n precision_value = self.precision\n scale_value = self.scale\n\n if scale_value == precision_value:\n self.log.warning(\"scale and precision are equal, this will result in a null value\")\n if scale_value > precision_value:\n raise ValueError(\"scale must be < precision\")\n\n return self\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble","title":"koheesio.spark.transformations.cast_to_datatype.CastToDouble","text":"Cast to Double
Represents 8-byte double-precision floating point numbers. The range of numbers is from -1.7976931348623157E308 to 1.7976931348623157E308.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = DOUBLE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToDouble class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat","title":"koheesio.spark.transformations.cast_to_datatype.CastToFloat","text":"Cast to Float (a.k.a. real)
Represents 4-byte single-precision floating point numbers. The range of numbers is from -3.402823E38 to 3.402823E38.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- double
- decimal
- boolean
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- timestamp precision is lost (use CastToDouble instead)
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = FLOAT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToFloat class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger","title":"koheesio.spark.transformations.cast_to_datatype.CastToInteger","text":"Cast to Integer (a.k.a. int)
Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- long
- float
- double
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = INTEGER\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToInteger class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong","title":"koheesio.spark.transformations.cast_to_datatype.CastToLong","text":"Cast to Long (a.k.a. bigint)
Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- long
- float
- double
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = LONG\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToLong class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort","title":"koheesio.spark.transformations.cast_to_datatype.CastToShort","text":"Cast to Short (a.k.a. smallint)
Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- integer
- long
- float
- double
- decimal
- string
- boolean
- timestamp
- date
- void
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- timestamp range of values too small for timestamp to have any meaning
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = SHORT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToShort class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString","title":"koheesio.spark.transformations.cast_to_datatype.CastToString","text":"Cast to String
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- binary
- boolean
- timestamp
- date
- array
- map
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = STRING\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToString class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BINARY, BOOLEAN, TIMESTAMP, DATE, ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","title":"koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","text":"Cast to Timestamp
Numeric time stamp is the number of seconds since January 1, 1970 00:00:00.000000000 UTC. Not advised to be used on small integers, as the range of values is too small for timestamp to have any meaning.
For more fine-grained control over the timestamp format, use the date_time
module. This allows for parsing strings to timestamps and vice versa.
See Also - koheesio.spark.transformations.date_time
- https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#timestamp-pattern
Unsupported datatypes: Following casts are not supported
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- integer
- long
- float
- double
- decimal
- date
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- boolean: range of values too small for timestamp to have any meaning
- byte: range of values too small for timestamp to have any meaning
- string: converts to null in most cases, use
date_time
module instead - short: range of values too small for timestamp to have any meaning
- void: skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = TIMESTAMP\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToTimestamp class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, BOOLEAN, BYTE, SHORT, STRING, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, DATE]\n
"},{"location":"api_reference/spark/transformations/drop_column.html","title":"Drop column","text":"This module defines the DropColumn class, a subclass of ColumnsTransformation.
"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn","title":"koheesio.spark.transformations.drop_column.DropColumn","text":"Drop one or more columns
The DropColumn class is used to drop one or more columns from a PySpark DataFrame. It wraps the pyspark.DataFrame.drop
function and can handle either a single string or a list of strings as input.
If a specified column does not exist in the DataFrame, no error or warning is thrown, and all existing columns will remain.
Expected behavior - When the
column
does not exist, all columns will remain (no error or warning is thrown) - Either a single string, or a list of strings can be specified
Example df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = DropColumn(column=\"product\").transform(df)\n
output_df:
amount country 1000 USA 1500 USA 1600 USA In this example, the product
column is dropped from the DataFrame df
.
"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/drop_column.py
def execute(self):\n self.log.info(f\"{self.column=}\")\n self.output.df = self.df.drop(*self.columns)\n
"},{"location":"api_reference/spark/transformations/dummy.html","title":"Dummy","text":"Dummy transformation for testing purposes.
"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation","title":"koheesio.spark.transformations.dummy.DummyTransformation","text":"Dummy transformation for testing purposes.
This transformation adds a new column hello
to the DataFrame with the value world
.
It is intended for testing purposes or for use in examples or reference documentation.
Example input_df:
id 1 output_df = DummyTransformation().transform(input_df)\n
output_df:
id hello 1 world In this example, the hello
column is added to the DataFrame input_df
.
"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/dummy.py
def execute(self):\n self.output.df = self.df.withColumn(\"hello\", lit(\"world\"))\n
"},{"location":"api_reference/spark/transformations/get_item.html","title":"Get item","text":"Transformation to wrap around the pyspark getItem function
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem","title":"koheesio.spark.transformations.get_item.GetItem","text":"Get item from list or map (dictionary)
Wrapper around pyspark.sql.functions.getItem
GetItem
is strict about the data type of the column. If the column is not a list or a map, an error will be raised.
Note Only MapType and ArrayType are supported.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to get the item from. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
key
Union[int, str]
The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index
required Example"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-list-arraytype","title":"Example with list (ArrayType)","text":"By specifying an integer for the parameter \"key\", getItem knows to get the element at index n of a list (index starts at 0).
input_df:
id content 1 [1, 2, 3] 2 [4, 5] 3 [6] 4 [] output_df = GetItem(\n column=\"content\",\n index=1, # get the second element of the list\n target_column=\"item\",\n).transform(input_df)\n
output_df:
id content item 1 [1, 2, 3] 2 2 [4, 5] 5 3 [6] null 4 [] null"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-a-dict-maptype","title":"Example with a dict (MapType)","text":"input_df:
id content 1 {key1 -> value1} 2 {key1 -> value2} 3 {key2 -> hello} 4 {key2 -> world} output_df = GetItem(\n column= \"content\",\n key=\"key2,\n target_column=\"item\",\n).transform(input_df)\n
As we request the key to be \"key2\", the first 2 rows will be null, because it does not have \"key2\". output_df:
id content item 1 {key1 -> value1} null 2 {key1 -> value2} null 3 {key2 -> hello} hello 4 {key2 -> world} world"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.key","title":"key class-attribute
instance-attribute
","text":"key: Union[int, str] = Field(default=..., alias='index', description='The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index')\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig","title":"ColumnConfig","text":"Limit the data types to ArrayType and MapType.
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute
instance-attribute
","text":"data_type_strict_mode = True\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = run_for_all_data_type\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/get_item.py
def func(self, column: Column) -> Column:\n return get_item(column, self.key)\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.get_item","title":"koheesio.spark.transformations.get_item.get_item","text":"get_item(column: Column, key: Union[str, int])\n
Wrapper around pyspark.sql.functions.getItem
Parameters:
Name Type Description Default column
Column
The column to get the item from
required key
Union[str, int]
The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string.
required Returns:
Type Description Column
The column with the item
Source code in src/koheesio/spark/transformations/get_item.py
def get_item(column: Column, key: Union[str, int]):\n \"\"\"\n Wrapper around pyspark.sql.functions.getItem\n\n Parameters\n ----------\n column : Column\n The column to get the item from\n key : Union[str, int]\n The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer.\n If the column is a dict (MapType), this should be a string.\n\n Returns\n -------\n Column\n The column with the item\n \"\"\"\n return column.getItem(key)\n
"},{"location":"api_reference/spark/transformations/hash.html","title":"Hash","text":"Module for hashing data using SHA-2 family of hash functions
See the docstring of the Sha2Hash class for more information.
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.HASH_ALGORITHM","title":"koheesio.spark.transformations.hash.HASH_ALGORITHM module-attribute
","text":"HASH_ALGORITHM = Literal[224, 256, 384, 512]\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.STRING","title":"koheesio.spark.transformations.hash.STRING module-attribute
","text":"STRING = STRING\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash","title":"koheesio.spark.transformations.hash.Sha2Hash","text":"hash the value of 1 or more columns using SHA-2 family of hash functions
Mild wrapper around pyspark.sql.functions.sha2
- https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
Note This function allows concatenating the values of multiple columns together prior to hashing.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to hash. Alias: column
required delimiter
Optional[str]
Optional separator for the string that will eventually be hashed. Defaults to '|'
|
num_bits
Optional[HASH_ALGORITHM]
Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512
256
target_column
str
The generated hash will be written to the column name specified here
required"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.delimiter","title":"delimiter class-attribute
instance-attribute
","text":"delimiter: Optional[str] = Field(default='|', description=\"Optional separator for the string that will eventually be hashed. Defaults to '|'\")\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.num_bits","title":"num_bits class-attribute
instance-attribute
","text":"num_bits: Optional[HASH_ALGORITHM] = Field(default=256, description='Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: str = Field(default=..., description='The generated hash will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/hash.py
def execute(self):\n columns = list(self.get_columns())\n self.output.df = (\n self.df.withColumn(\n self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)\n )\n if columns\n else self.df\n )\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.sha2_hash","title":"koheesio.spark.transformations.hash.sha2_hash","text":"sha2_hash(columns: List[str], delimiter: Optional[str] = '|', num_bits: Optional[HASH_ALGORITHM] = 256)\n
hash the value of 1 or more columns using SHA-2 family of hash functions
Mild wrapper around pyspark.sql.functions.sha2
- https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.
If a null is passed, the result will also be null.
Parameters:
Name Type Description Default columns
List[str]
The columns to hash
required delimiter
Optional[str]
Optional separator for the string that will eventually be hashed. Defaults to '|'
|
num_bits
Optional[HASH_ALGORITHM]
Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512
256
Source code in src/koheesio/spark/transformations/hash.py
def sha2_hash(columns: List[str], delimiter: Optional[str] = \"|\", num_bits: Optional[HASH_ALGORITHM] = 256):\n \"\"\"\n hash the value of 1 or more columns using SHA-2 family of hash functions\n\n Mild wrapper around pyspark.sql.functions.sha2\n\n - https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html\n\n Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).\n This function allows concatenating the values of multiple columns together prior to hashing.\n\n If a null is passed, the result will also be null.\n\n Parameters\n ----------\n columns : List[str]\n The columns to hash\n delimiter : Optional[str], optional, default=|\n Optional separator for the string that will eventually be hashed. Defaults to '|'\n num_bits : Optional[HASH_ALGORITHM], optional, default=256\n Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512\n \"\"\"\n # make sure all columns are of type pyspark.sql.Column and cast to string\n _columns = []\n for c in columns:\n if isinstance(c, str):\n c: Column = col(c)\n _columns.append(c.cast(STRING.spark_type()))\n\n # concatenate columns if more than 1 column is provided\n if len(_columns) > 1:\n column = concat_ws(delimiter, *_columns)\n else:\n column = _columns[0]\n\n return sha2(column, num_bits)\n
"},{"location":"api_reference/spark/transformations/lookup.html","title":"Lookup","text":"Lookup transformation for joining two dataframes together
Classes:
Name Description JoinMapping
TargetColumn
JoinType
JoinHint
DataframeLookup
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup","title":"koheesio.spark.transformations.lookup.DataframeLookup","text":"Lookup transformation for joining two dataframes together
Parameters:
Name Type Description Default df
DataFrame
The left Spark DataFrame
required other
DataFrame
The right Spark DataFrame
required on
List[JoinMapping] | JoinMapping
List of join mappings. If only one mapping is passed, it can be passed as a single object.
required targets
List[TargetColumn] | TargetColumn
List of target columns. If only one target is passed, it can be passed as a single object.
required how
JoinType
What type of join to perform. Defaults to left. See JoinType for more information.
required hint
JoinHint
What type of join hint to use. Defaults to None. See JoinHint for more information.
required Example from pyspark.sql import SparkSession\nfrom koheesio.spark.transformations.lookup import (\n DataframeLookup,\n JoinMapping,\n TargetColumn,\n JoinType,\n)\n\nspark = SparkSession.builder.getOrCreate()\n\n# create the dataframes\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\n# perform the lookup\nlookup = DataframeLookup(\n df=left_df,\n other=right_df,\n on=JoinMapping(source_column=\"id\", joined_column=\"id\"),\n targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n how=JoinType.LEFT,\n)\n\noutput_df = lookup.transform()\n
output_df:
id value right_value 1 A A 2 B null In this example, the left_df
and right_df
dataframes are joined together using the id
column. The value
column from the right_df
is aliased as right_value
in the output dataframe.
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.df","title":"df class-attribute
instance-attribute
","text":"df: DataFrame = Field(default=None, description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.hint","title":"hint class-attribute
instance-attribute
","text":"hint: Optional[JoinHint] = Field(default=None, description='What type of join hint to use. Defaults to None. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.how","title":"how class-attribute
instance-attribute
","text":"how: Optional[JoinType] = Field(default=LEFT, description='What type of join to perform. Defaults to left. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.on","title":"on class-attribute
instance-attribute
","text":"on: Union[List[JoinMapping], JoinMapping] = Field(default=..., alias='join_mapping', description='List of join mappings. If only one mapping is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.other","title":"other class-attribute
instance-attribute
","text":"other: DataFrame = Field(default=None, description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.targets","title":"targets class-attribute
instance-attribute
","text":"targets: Union[List[TargetColumn], TargetColumn] = Field(default=..., alias='target_columns', description='List of target columns. If only one target is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output","title":"Output","text":"Output for the lookup transformation
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.left_df","title":"left_df class-attribute
instance-attribute
","text":"left_df: DataFrame = Field(default=..., description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.right_df","title":"right_df class-attribute
instance-attribute
","text":"right_df: DataFrame = Field(default=..., description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.execute","title":"execute","text":"execute() -> Output\n
Execute the lookup transformation
Source code in src/koheesio/spark/transformations/lookup.py
def execute(self) -> Output:\n \"\"\"Execute the lookup transformation\"\"\"\n # prepare the right dataframe\n prepared_right_df = self.get_right_df().select(\n *[join_mapping.column for join_mapping in self.on],\n *[target.column for target in self.targets],\n )\n if self.hint:\n prepared_right_df = prepared_right_df.hint(self.hint)\n\n # generate the output\n self.output.left_df = self.df\n self.output.right_df = prepared_right_df\n self.output.df = self.df.join(\n prepared_right_df,\n on=[join_mapping.source_column for join_mapping in self.on],\n how=self.how,\n )\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.get_right_df","title":"get_right_df","text":"get_right_df() -> DataFrame\n
Get the right side dataframe
Source code in src/koheesio/spark/transformations/lookup.py
def get_right_df(self) -> DataFrame:\n \"\"\"Get the right side dataframe\"\"\"\n return self.other\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.set_list","title":"set_list","text":"set_list(value)\n
Ensure that we can pass either a single object, or a list of objects
Source code in src/koheesio/spark/transformations/lookup.py
@field_validator(\"on\", \"targets\")\ndef set_list(cls, value):\n \"\"\"Ensure that we can pass either a single object, or a list of objects\"\"\"\n return [value] if not isinstance(value, list) else value\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint","title":"koheesio.spark.transformations.lookup.JoinHint","text":"Supported join hints
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.BROADCAST","title":"BROADCAST class-attribute
instance-attribute
","text":"BROADCAST = 'broadcast'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.MERGE","title":"MERGE class-attribute
instance-attribute
","text":"MERGE = 'merge'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping","title":"koheesio.spark.transformations.lookup.JoinMapping","text":"Mapping for joining two dataframes together
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.column","title":"column property
","text":"column: Column\n
Get the join mapping as a pyspark.sql.Column object
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.other_column","title":"other_column instance-attribute
","text":"other_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.source_column","title":"source_column instance-attribute
","text":"source_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType","title":"koheesio.spark.transformations.lookup.JoinType","text":"Supported join types
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.ANTI","title":"ANTI class-attribute
instance-attribute
","text":"ANTI = 'anti'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.CROSS","title":"CROSS class-attribute
instance-attribute
","text":"CROSS = 'cross'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.FULL","title":"FULL class-attribute
instance-attribute
","text":"FULL = 'full'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.INNER","title":"INNER class-attribute
instance-attribute
","text":"INNER = 'inner'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.LEFT","title":"LEFT class-attribute
instance-attribute
","text":"LEFT = 'left'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.RIGHT","title":"RIGHT class-attribute
instance-attribute
","text":"RIGHT = 'right'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.SEMI","title":"SEMI class-attribute
instance-attribute
","text":"SEMI = 'semi'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn","title":"koheesio.spark.transformations.lookup.TargetColumn","text":"Target column for the joined dataframe
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.column","title":"column property
","text":"column: Column\n
Get the target column as a pyspark.sql.Column object
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column","title":"target_column instance-attribute
","text":"target_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column_alias","title":"target_column_alias instance-attribute
","text":"target_column_alias: str\n
"},{"location":"api_reference/spark/transformations/repartition.html","title":"Repartition","text":"Repartition Transformation
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition","title":"koheesio.spark.transformations.repartition.Repartition","text":"Wrapper around DataFrame.repartition
With repartition, the number of partitions can be given as an optional value. If this is not provided, a default value is used. The default number of partitions is defined by the spark config 'spark.sql.shuffle.partitions', for which the default value is 200 and will never exceed the number or rows in the DataFrame (whichever is value is lower).
If columns are omitted, the entire DataFrame is repartitioned without considering the particular values in the columns.
Parameters:
Name Type Description Default column
Optional[Union[str, List[str]]]
Name of the source column(s). If omitted, the entire DataFrame is repartitioned without considering the particular values in the columns. Alias: columns
None
num_partitions
Optional[int]
The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.
None
Example Repartition(column=[\"c1\", \"c2\"], num_partitions=3) # results in 3 partitions\nRepartition(column=\"c1\", num_partitions=2) # results in 2 partitions\nRepartition(column=[\"c1\", \"c2\"]) # results in <= 200 partitions\nRepartition(num_partitions=5) # results in 5 partitions\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[ListOfColumns] = Field(default='', alias='column', description='Name of the source column(s)')\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.numPartitions","title":"numPartitions class-attribute
instance-attribute
","text":"numPartitions: Optional[int] = Field(default=None, alias='num_partitions', description=\"The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.\")\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/repartition.py
def execute(self):\n # Prepare columns input:\n columns = self.df.columns if self.columns == [\"*\"] else self.columns\n # Prepare repartition input:\n # num_partitions comes first, but if it is not provided it should not be included as None.\n repartition_inputs = [i for i in [self.numPartitions, *columns] if i]\n self.output.df = self.df.repartition(*repartition_inputs)\n
"},{"location":"api_reference/spark/transformations/replace.html","title":"Replace","text":"Transformation to replace a particular value in a column with another one
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace","title":"koheesio.spark.transformations.replace.Replace","text":"Replace a particular value in a column with another one
Can handle empty strings (\"\") as well as NULL / None values.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- boolean
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- timestamp
- date
- string
- void skipped by default
Any supported none-string datatype will be cast to string before the replacement is done.
Example input_df:
id string 1 hello 2 world 3 output_df = Replace(\n column=\"string\",\n from_value=\"hello\",\n to_value=\"programmer\",\n).transform(input_df)\n
output_df:
id string 1 programmer 2 world 3 In this example, the value \"hello\" in the column \"string\" is replaced with \"programmer\".
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.from_value","title":"from_value class-attribute
instance-attribute
","text":"from_value: Optional[str] = Field(default=None, alias='from', description=\"The original value that needs to be replaced. If no value is given, all 'null' values will be replaced with the to_value\")\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.to_value","title":"to_value class-attribute
instance-attribute
","text":"to_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig","title":"ColumnConfig","text":"Column type configurations for the column to be replaced
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP, DATE]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/replace.py
def func(self, column: Column) -> Column:\n return replace(column=column, from_value=self.from_value, to_value=self.to_value)\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.replace","title":"koheesio.spark.transformations.replace.replace","text":"replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None)\n
Function to replace a particular value in a column with another one
Source code in src/koheesio/spark/transformations/replace.py
def replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None):\n \"\"\"Function to replace a particular value in a column with another one\"\"\"\n # make sure we have a Column object\n if isinstance(column, str):\n column = col(column)\n\n if not from_value:\n condition = column.isNull()\n else:\n condition = column == from_value\n\n return when(condition, lit(to_value)).otherwise(column)\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html","title":"Row number dedup","text":"This module contains the RowNumberDedup class, which performs a row_number deduplication operation on a DataFrame.
See the docstring of the RowNumberDedup class for more information.
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup","title":"koheesio.spark.transformations.row_number_dedup.RowNumberDedup","text":"A class used to perform a row_number deduplication operation on a DataFrame.
This class is a specialized transformation that extends the ColumnsTransformation class. It sorts the DataFrame based on the provided sort columns and assigns a row_number to each row. It then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row can be stored in a specified target column or a default column named \"meta_row_number_column\". The class also provides an option to preserve meta columns (like the row_numberk column) in the output DataFrame.
Attributes:
Name Type Description columns
list
List of columns to apply the transformation to. If a single '*' is passed as a column name or if the columns parameter is omitted, the transformation will be applied to all columns of the data types specified in run_for_all_data_type
of the ColumnConfig. (inherited from ColumnsTransformation)
sort_columns
list
List of columns that the DataFrame will be sorted by.
target_column
(str, optional)
Column where the row_number of each row will be stored.
preserve_meta
(bool, optional)
Flag that determines whether the meta columns should be kept in the output DataFrame.
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.preserve_meta","title":"preserve_meta class-attribute
instance-attribute
","text":"preserve_meta: bool = Field(default=False, description=\"If true, meta columns are kept in output dataframe. Defaults to 'False'\")\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.sort_columns","title":"sort_columns class-attribute
instance-attribute
","text":"sort_columns: conlist(Union[str, Column], min_length=0) = Field(default_factory=list, alias='sort_column', description='List of orderBy columns. If only one column is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: Optional[Union[str, Column]] = Field(default='meta_row_number_column', alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.window_spec","title":"window_spec property
","text":"window_spec: WindowSpec\n
Builds a WindowSpec object based on the columns defined in the configuration.
The WindowSpec object is used to define a window frame over which functions are applied in Spark. This method partitions the data by the columns returned by the get_columns
method and then orders the partitions by the columns specified in sort_columns
.
Notes The order of the columns in the WindowSpec object is preserved. If a column is passed as a string, it is converted to a Column object with DESC ordering.
Returns:
Type Description WindowSpec
A WindowSpec object that can be used to define a window frame in Spark.
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.execute","title":"execute","text":"execute() -> Output\n
Performs the row_number deduplication operation on the DataFrame.
This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row, and then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row is stored in the target column. If preserve_meta is False, the method also drops the target column from the DataFrame.
Source code in src/koheesio/spark/transformations/row_number_dedup.py
def execute(self) -> RowNumberDedup.Output:\n \"\"\"\n Performs the row_number deduplication operation on the DataFrame.\n\n This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row,\n and then filters the DataFrame to keep only the top-row_number row for each group of duplicates.\n The row_number of each row is stored in the target column. If preserve_meta is False,\n the method also drops the target column from the DataFrame.\n \"\"\"\n df = self.df\n window_spec = self.window_spec\n\n # if target_column is a string, convert it to a Column object\n if isinstance((target_column := self.target_column), str):\n target_column = col(target_column)\n\n # dedup the dataframe based on the window spec\n df = df.withColumn(self.target_column, row_number().over(window_spec)).filter(target_column == 1).select(\"*\")\n\n if not self.preserve_meta:\n df = df.drop(target_column)\n\n self.output.df = df\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.set_sort_columns","title":"set_sort_columns","text":"set_sort_columns(columns_value)\n
Validates and optimizes the sort_columns parameter.
This method ensures that sort_columns is a list (or single object) of unique strings or Column objects. It removes any empty strings or None values from the list and deduplicates the columns.
Parameters:
Name Type Description Default columns_value
Union[str, Column, List[Union[str, Column]]]
The value of the sort_columns parameter.
required Returns:
Type Description List[Union[str, Column]]
The optimized and deduplicated list of sort columns.
Source code in src/koheesio/spark/transformations/row_number_dedup.py
@field_validator(\"sort_columns\", mode=\"before\")\ndef set_sort_columns(cls, columns_value):\n \"\"\"\n Validates and optimizes the sort_columns parameter.\n\n This method ensures that sort_columns is a list (or single object) of unique strings or Column objects.\n It removes any empty strings or None values from the list and deduplicates the columns.\n\n Parameters\n ----------\n columns_value : Union[str, Column, List[Union[str, Column]]]\n The value of the sort_columns parameter.\n\n Returns\n -------\n List[Union[str, Column]]\n The optimized and deduplicated list of sort columns.\n \"\"\"\n # Convert single string or Column object to a list\n columns = [columns_value] if isinstance(columns_value, (str, Column)) else [*columns_value]\n\n # Remove empty strings, None, etc.\n columns = [c for c in columns if (isinstance(c, Column) and c is not None) or (isinstance(c, str) and c)]\n\n dedup_columns = []\n seen = set()\n\n # Deduplicate the columns while preserving the order\n for column in columns:\n if str(column) not in seen:\n dedup_columns.append(column)\n seen.add(str(column))\n\n return dedup_columns\n
"},{"location":"api_reference/spark/transformations/sql_transform.html","title":"Sql transform","text":"SQL Transform module
SQL Transform module provides an easy interface to transform a dataframe using SQL. This SQL can originate from a string or a file and may contain placeholders for templating.
"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform","title":"koheesio.spark.transformations.sql_transform.SqlTransform","text":"SQL Transform module provides an easy interface to transform a dataframe using SQL.
This SQL can originate from a string or a file and may contain placeholder (parameters) for templating.
- Placeholders are identified with
${placeholder}
. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).
Example sql script:
SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n
"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/sql_transform.py
def execute(self):\n table_name = get_random_string(prefix=\"sql_transform\")\n self.params = {**self.params, \"table_name\": table_name}\n\n df = self.df\n df.createOrReplaceTempView(table_name)\n query = self.query\n\n self.output.df = self.spark.sql(query)\n
"},{"location":"api_reference/spark/transformations/transform.html","title":"Transform","text":"Transform module
Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform","title":"koheesio.spark.transformations.transform.Transform","text":"Transform(func: Callable, params: Dict = None, df: DataFrame = None, **kwargs)\n
Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.
The implementation is inspired by and based upon: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html
Parameters:
Name Type Description Default func
Callable
The function to be called on the DataFrame.
required params
Dict
The keyword arguments to be passed to the function. Defaults to None. Alternatively, keyword arguments can be passed directly as keyword arguments - they will be merged with the params
dictionary.
None
Example Source code in src/koheesio/spark/transformations/transform.py
def __init__(self, func: Callable, params: Dict = None, df: DataFrame = None, **kwargs):\n params = {**(params or {}), **kwargs}\n super().__init__(func=func, params=params, df=df)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--a-function-compatible-with-transform","title":"a function compatible with Transform:","text":"def some_func(df, a: str, b: str):\n return df.withColumn(a, f.lit(b))\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--verbose-style-input-in-transform","title":"verbose style input in Transform","text":"Transform(func=some_func, params={\"a\": \"foo\", \"b\": \"bar\"})\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--shortened-style-notation-easier-to-read","title":"shortened style notation (easier to read)","text":"Transform(some_func, a=\"foo\", b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--when-too-much-input-is-given-transform-will-ignore-extra-input","title":"when too much input is given, Transform will ignore extra input","text":"Transform(\n some_func,\n a=\"foo\",\n # ignored input\n c=\"baz\",\n title=42,\n author=\"Adams\",\n # order of params input should not matter\n b=\"bar\",\n)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--using-the-from_func-classmethod","title":"using the from_func classmethod","text":"SomeFunc = Transform.from_func(some_func, a=\"foo\")\nsome_func = SomeFunc(b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.func","title":"func class-attribute
instance-attribute
","text":"func: Callable = Field(default=None, description='The function to be called on the DataFrame.')\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.execute","title":"execute","text":"execute()\n
Call the function on the DataFrame with the given keyword arguments.
Source code in src/koheesio/spark/transformations/transform.py
def execute(self):\n \"\"\"Call the function on the DataFrame with the given keyword arguments.\"\"\"\n func, kwargs = get_args_for_func(self.func, self.params)\n self.output.df = self.df.transform(func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.from_func","title":"from_func classmethod
","text":"from_func(func: Callable, **kwargs) -> Callable[..., Transform]\n
Create a Transform class from a function. Useful for creating a new class with a different name.
This method uses the functools.partial
function to create a new class with the given function and keyword arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for the specific use case.
Example CustomTransform = Transform.from_func(some_func, a=\"foo\")\nsome_func = CustomTransform(b=\"bar\")\n
In this example, CustomTransform
is a Transform class with the function some_func
and the keyword argument a
set to \"foo\". When calling some_func(b=\"bar\")
, the function some_func
will be called with the keyword arguments a=\"foo\"
and b=\"bar\"
.
Source code in src/koheesio/spark/transformations/transform.py
@classmethod\ndef from_func(cls, func: Callable, **kwargs) -> Callable[..., Transform]:\n \"\"\"Create a Transform class from a function. Useful for creating a new class with a different name.\n\n This method uses the `functools.partial` function to create a new class with the given function and keyword\n arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for\n the specific use case.\n\n Example\n -------\n ```python\n CustomTransform = Transform.from_func(some_func, a=\"foo\")\n some_func = CustomTransform(b=\"bar\")\n ```\n\n In this example, `CustomTransform` is a Transform class with the function `some_func` and the keyword argument\n `a` set to \"foo\". When calling `some_func(b=\"bar\")`, the function `some_func` will be called with the keyword\n arguments `a=\"foo\"` and `b=\"bar\"`.\n \"\"\"\n return partial(cls, func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/uuid5.html","title":"Uuid5","text":"Ability to generate UUID5 using native pyspark (no udf)
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5","title":"koheesio.spark.transformations.uuid5.HashUUID5","text":"Generate a UUID with the UUID5 algorithm
Spark does not provide inbuilt API to generate version 5 UUID, hence we have to use a custom implementation to provide this capability.
Prerequisites: this function has no side effects. But be aware that in most cases, the expectation is that your data is clean (e.g. trimmed of leading and trailing spaces)
Concept UUID5 is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). https://docs.python.org/3/library/uuid.html#uuid.uuid5
Based on https://github.com/MrPowers/quinn/pull/96 with the difference that since Spark 3.0.0 an OVERLAY function from ANSI SQL 2016 is available which saves coding space and string allocation(s) in place of CONCAT + SUBSTRING.
For more info on OVERLAY, see: https://docs.databricks.com/sql/language-manual/functions/overlay.html
Example Input is a DataFrame with two columns:
id string 1 hello 2 world 3 Input parameters:
- source_columns = [\"id\", \"string\"]
- target_column = \"uuid5\"
Result:
id string uuid5 1 hello f3e99bbd-85ae-5dc3-bf6e-cd0022a0ebe6 2 world b48e880f-c289-5c94-b51f-b9d21f9616c0 3 2193a99d-222e-5a0c-a7d6-48fbe78d2708 In code:
HashUUID5(source_columns=[\"id\", \"string\"], target_column=\"uuid5\").transform(input_df)\n
In this example, the id
and string
columns are concatenated and hashed using the UUID5 algorithm. The result is stored in the uuid5
column.
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.delimiter","title":"delimiter class-attribute
instance-attribute
","text":"delimiter: Optional[str] = Field(default='|', description='Separator for the string that will eventually be hashed')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.description","title":"description class-attribute
instance-attribute
","text":"description: str = 'Generate a UUID with the UUID5 algorithm'\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.extra_string","title":"extra_string class-attribute
instance-attribute
","text":"extra_string: Optional[str] = Field(default='', description='In case of collisions, one can pass an extra string to hash on.')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.namespace","title":"namespace class-attribute
instance-attribute
","text":"namespace: Optional[Union[str, UUID]] = Field(default='', description='Namespace DNS')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.source_columns","title":"source_columns class-attribute
instance-attribute
","text":"source_columns: ListOfColumns = Field(default=..., description=\"List of columns that should be hashed. Should contain the name of at least 1 column. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'`\")\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: str = Field(default=..., description='The generated UUID will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.execute","title":"execute","text":"execute() -> None\n
Source code in src/koheesio/spark/transformations/uuid5.py
def execute(self) -> None:\n ns = f.lit(uuid5_namespace(self.namespace).bytes)\n self.log.info(f\"UUID5 namespace '{ns}' derived from '{self.namespace}'\")\n cols_to_hash = f.concat_ws(self.delimiter, *self.source_columns)\n cols_to_hash = f.concat(f.lit(self.extra_string), cols_to_hash)\n cols_to_hash = f.encode(cols_to_hash, \"utf-8\")\n cols_to_hash = f.concat(ns, cols_to_hash)\n source_columns_sha1 = f.sha1(cols_to_hash)\n variant_part = f.substring(source_columns_sha1, 17, 4)\n variant_part = f.conv(variant_part, 16, 2)\n variant_part = f.lpad(variant_part, 16, \"0\")\n variant_part = f.overlay(variant_part, f.lit(\"10\"), 1, 2) # RFC 4122 variant.\n variant_part = f.lower(f.conv(variant_part, 2, 16))\n target_col_uuid = f.concat_ws(\n \"-\",\n f.substring(source_columns_sha1, 1, 8),\n f.substring(source_columns_sha1, 9, 4),\n f.concat(f.lit(\"5\"), f.substring(source_columns_sha1, 14, 3)), # Set version.\n variant_part,\n f.substring(source_columns_sha1, 21, 12),\n )\n # Applying the transformation to the input df, storing the result in the column specified in `target_column`.\n self.output.df = self.df.withColumn(self.target_column, target_col_uuid)\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.hash_uuid5","title":"koheesio.spark.transformations.uuid5.hash_uuid5","text":"hash_uuid5(input_value: str, namespace: Optional[Union[str, UUID]] = '', extra_string: Optional[str] = '')\n
pure python implementation of HashUUID5
See: https://docs.python.org/3/library/uuid.html#uuid.uuid5
Parameters:
Name Type Description Default input_value
str
value that will be hashed
required namespace
Optional[str | UUID]
namespace DNS
''
extra_string
Optional[str]
optional extra string that will be prepended to the input_value
''
Returns:
Type Description str
uuid.UUID (uuid5) cast to string
Source code in src/koheesio/spark/transformations/uuid5.py
def hash_uuid5(\n input_value: str,\n namespace: Optional[Union[str, uuid.UUID]] = \"\",\n extra_string: Optional[str] = \"\",\n):\n \"\"\"pure python implementation of HashUUID5\n\n See: https://docs.python.org/3/library/uuid.html#uuid.uuid5\n\n Parameters\n ----------\n input_value : str\n value that will be hashed\n namespace : Optional[str | uuid.UUID]\n namespace DNS\n extra_string : Optional[str]\n optional extra string that will be prepended to the input_value\n\n Returns\n -------\n str\n uuid.UUID (uuid5) cast to string\n \"\"\"\n if not isinstance(namespace, uuid.UUID):\n hashed_namespace = uuid5_namespace(namespace)\n else:\n hashed_namespace = namespace\n return str(uuid.uuid5(hashed_namespace, (extra_string + input_value)))\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.uuid5_namespace","title":"koheesio.spark.transformations.uuid5.uuid5_namespace","text":"uuid5_namespace(ns: Optional[Union[str, UUID]]) -> UUID\n
Helper function used to provide a UUID5 hashed namespace based on the passed str
Parameters:
Name Type Description Default ns
Optional[Union[str, UUID]]
A str, an empty string (or None), or an existing UUID can be passed
required Returns:
Type Description UUID
UUID5 hashed namespace
Source code in src/koheesio/spark/transformations/uuid5.py
def uuid5_namespace(ns: Optional[Union[str, uuid.UUID]]) -> uuid.UUID:\n \"\"\"Helper function used to provide a UUID5 hashed namespace based on the passed str\n\n Parameters\n ----------\n ns : Optional[Union[str, uuid.UUID]]\n A str, an empty string (or None), or an existing UUID can be passed\n\n Returns\n -------\n uuid.UUID\n UUID5 hashed namespace\n \"\"\"\n # if we already have a UUID, we just return it\n if isinstance(ns, uuid.UUID):\n return ns\n\n # if ns is empty or none, we simply return the default NAMESPACE_DNS\n if not ns:\n ns = uuid.NAMESPACE_DNS\n return ns\n\n # else we hash the string against the NAMESPACE_DNS\n ns = uuid.uuid5(uuid.NAMESPACE_DNS, ns)\n return ns\n
"},{"location":"api_reference/spark/transformations/date_time/index.html","title":"Date time","text":"Module that holds the transformations that can be used for date and time related operations.
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone","title":"koheesio.spark.transformations.date_time.ChangeTimeZone","text":"Allows for the value of a column to be changed from one timezone to another
Adding useful metadata When add_target_timezone
is enabled (default), an additional column is created documenting which timezone a field has been converted to. Additionally, the suffix added to this column can be customized (default value is _timezone
).
Example Input:
target_column = \"some_column_name\"\ntarget_timezone = \"EST\"\nadd_target_timezone = True # default value\ntimezone_column_suffix = \"_timezone\" # default value\n
Output:
column name = \"some_column_name_timezone\" # notice the suffix\ncolumn value = \"EST\"\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.add_target_timezone","title":"add_target_timezone class-attribute
instance-attribute
","text":"add_target_timezone: bool = Field(default=True, description='Toggles whether the target timezone is added as a column. True by default.')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.from_timezone","title":"from_timezone class-attribute
instance-attribute
","text":"from_timezone: str = Field(default=..., alias='source_timezone', description='Timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.target_timezone_column_suffix","title":"target_timezone_column_suffix class-attribute
instance-attribute
","text":"target_timezone_column_suffix: Optional[str] = Field(default='_timezone', alias='suffix', description=\"Allows to customize the suffix that is added to the target_timezone column. Defaults to '_timezone'. Note: this will be ignored if 'add_target_timezone' is set to False\")\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.to_timezone","title":"to_timezone class-attribute
instance-attribute
","text":"to_timezone: str = Field(default=..., alias='target_timezone', description='Target timezone. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def execute(self):\n df = self.df\n\n for target_column, column in self.get_columns_with_target():\n func = self.func # select the applicable function\n df = df.withColumn(\n target_column,\n func(f.col(column)),\n )\n\n # document which timezone a field has been converted to\n if self.add_target_timezone:\n df = df.withColumn(f\"{target_column}{self.target_timezone_column_suffix}\", f.lit(self.to_timezone))\n\n self.output.df = df\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n return change_timezone(column=column, source_timezone=self.from_timezone, target_timezone=self.to_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_no_duplicate_timezones","title":"validate_no_duplicate_timezones","text":"validate_no_duplicate_timezones(values)\n
Validate that source and target timezone are not the same
Source code in src/koheesio/spark/transformations/date_time/__init__.py
@model_validator(mode=\"before\")\ndef validate_no_duplicate_timezones(cls, values):\n \"\"\"Validate that source and target timezone are not the same\"\"\"\n from_timezone_value = values.get(\"from_timezone\")\n to_timezone_value = values.get(\"o_timezone\")\n\n if from_timezone_value == to_timezone_value:\n raise ValueError(\"Timezone conversions from and to the same timezones are not valid.\")\n\n return values\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_timezone","title":"validate_timezone","text":"validate_timezone(timezone_value)\n
Validate that the timezone is a valid timezone.
Source code in src/koheesio/spark/transformations/date_time/__init__.py
@field_validator(\"from_timezone\", \"to_timezone\")\ndef validate_timezone(cls, timezone_value):\n \"\"\"Validate that the timezone is a valid timezone.\"\"\"\n if timezone_value not in all_timezones_set:\n raise ValueError(\n \"Not a valid timezone. Refer to the `TZ database name` column here: \"\n \"https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\"\n )\n return timezone_value\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat","title":"koheesio.spark.transformations.date_time.DateFormat","text":"wrapper around pyspark.sql.functions.date_format
See Also - https://spark.apache.org/docs/3.3.2/api/python/reference/pyspark.sql/api/pyspark.sql.functions.date_format.html
- https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html
Concept This Transformation allows to convert a date/timestamp/string to a value of string in the format specified by the date format given.
A pattern could be for instance dd.MM.yyyy
and could return a string like \u201818.03.1993\u2019. All pattern letters of datetime pattern can be used, see: https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html
How to use If more than one column is passed, the behavior of the Class changes this way
- the transformation will be run in a loop against all the given columns
- the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.
Example source_column value: datetime.date(2020, 1, 1)\ntarget: \"yyyyMMdd HH:mm\"\noutput: \"20200101 00:00\"\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(..., description='The format for the resulting string. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n return date_format(column, self.format)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp","title":"koheesio.spark.transformations.date_time.ToTimestamp","text":"wrapper around pyspark.sql.functions.to_timestamp
Converts a Column (or set of Columns) into pyspark.sql.types.TimestampType
using the specified format. Specify formats according to datetime pattern https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
_.
Functionally equivalent to col.cast(\"timestamp\").
See Also Related Koheesio classes:
- koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
- koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field
pyspark.sql.functions:
- datetime pattern : https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Example"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--basic-usage-example","title":"Basic usage example:","text":"input_df:
t \"1997-02-28 10:30:00\" t
is a string
tts = ToTimestamp(\n # since the source column is the same as the target in this example, 't' will be overwritten\n column=\"t\",\n target_column=\"t\",\n format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df)\n
output_df:
t datetime.datetime(1997, 2, 28, 10, 30) Now t
is a timestamp
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--multiple-columns-at-once","title":"Multiple columns at once:","text":"input_df:
t1 t2 \"1997-02-28 10:30:00\" \"2007-03-31 11:40:10\" t1
and t2
are strings
tts = ToTimestamp(\n columns=[\"t1\", \"t2\"],\n # 'target_suffix' is synonymous with 'target_column'\n target_suffix=\"new\",\n format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df).select(\"t1_new\", \"t2_new\")\n
output_df:
t1_new t2_new datetime.datetime(1997, 2, 28, 10, 30) datetime.datetime(2007, 3, 31, 11, 40) Now t1_new
and t2_new
are both timestamps
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default=..., description='The date format for of the timestamp field. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n # convert string to timestamp\n converted_col = to_timestamp(column, self.format)\n return when(column.isNull(), lit(None)).otherwise(converted_col)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.change_timezone","title":"koheesio.spark.transformations.date_time.change_timezone","text":"change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str)\n
Helper function to change from one timezone to another
wrapper around pyspark.sql.functions.from_utc_timestamp
and to_utc_timestamp
Parameters:
Name Type Description Default column
Union[str, Column]
The column to change the timezone of
required source_timezone
str
The timezone of the source_column value. Timezone fields are validated against the TZ database name
column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
required target_timezone
str
The target timezone. Timezone fields are validated against the TZ database name
column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
required Source code in src/koheesio/spark/transformations/date_time/__init__.py
def change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str):\n \"\"\"Helper function to change from one timezone to another\n\n wrapper around `pyspark.sql.functions.from_utc_timestamp` and `to_utc_timestamp`\n\n Parameters\n ----------\n column : Union[str, Column]\n The column to change the timezone of\n source_timezone : str\n The timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in\n this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n target_timezone : str\n The target timezone. Timezone fields are validated against the `TZ database name` column in this list:\n https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n\n \"\"\"\n column = col(column) if isinstance(column, str) else column\n return from_utc_timestamp((to_utc_timestamp(column, source_timezone)), target_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html","title":"Interval","text":"This module provides a DateTimeColumn
class that extends the Column
class from PySpark. It allows for adding or subtracting an interval value from a datetime column.
This can be used to reflect a change in a given date / time column in a more human-readable way.
Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
Background The aim is to easily add or subtract an 'interval' value to a datetime column. An interval value is a string that represents a time interval. For example, '1 day', '1 month', '5 years', '1 minute 30 seconds', '10 milliseconds', etc. These can be used to reflect a change in a given date / time column in a more human-readable way.
Typically, this can be done using the date_add()
and date_sub()
functions in Spark SQL. However, these functions only support adding or subtracting a single unit of time measured in days. Using an interval gives us much more flexibility; however, Spark SQL does not provide a function to add or subtract an interval value from a datetime column through the python API directly, so we have to use the expr()
function to do this to be able to directly use SQL.
This module provides a DateTimeColumn
class that extends the Column
class from PySpark. It allows for adding or subtracting an interval value from a datetime column using the +
and -
operators.
Additionally, this module provides two transformation classes that can be used as a transformation step in a pipeline:
DateTimeAddInterval
: adds an interval value to a datetime column DateTimeSubtractInterval
: subtracts an interval value from a datetime column
These classes are subclasses of ColumnsTransformationWithTarget
and hence can be used to perform transformations on multiple columns at once.
The above transformations both use the provided asjust_time()
function to perform the actual transformation.
See also: Related Koheesio classes:
- koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
- koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field
pyspark.sql.functions:
- https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
- https://spark.apache.org/docs/latest/api/sql/index.html
- https://spark.apache.org/docs/latest/api/sql/#try_add
- https://spark.apache.org/docs/latest/api/sql/#try_subtract
Classes:
Name Description DateTimeColumn
A datetime column that can be adjusted by adding or subtracting an interval value using the +
and -
operators.
DateTimeAddInterval
A transformation that adds an interval value to a datetime column. This class is a subclass of ColumnsTransformationWithTarget
and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget
for more information.
DateTimeSubtractInterval
A transformation that subtracts an interval value from a datetime column. This class is a subclass of ColumnsTransformationWithTarget
and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget
for more information.
Note the DateTimeAddInterval
and DateTimeSubtractInterval
classes are very similar. The only difference is that one adds an interval value to a datetime column, while the other subtracts an interval value from a datetime column.
Functions:
Name Description dt_column
Converts a column to a DateTimeColumn
. This function aims to be a drop-in replacement for pyspark.sql.functions.col
that returns a DateTimeColumn
instead of a Column
.
adjust_time
Adjusts a datetime column by adding or subtracting an interval value.
validate_interval
Validates a given interval string.
Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--various-ways-to-create-and-interact-with-datetimecolumn","title":"Various ways to create and interact with DateTimeColumn
:","text":" - Create a
DateTimeColumn
from a string: dt_column(\"my_column\")
- Create a
DateTimeColumn
from a Column
: dt_column(df.my_column)
- Use the
+
and -
operators to add or subtract an interval value from a DateTimeColumn
: dt_column(\"my_column\") + \"1 day\"
dt_column(\"my_column\") - \"1 month\"
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--functional-examples-using-adjust_time","title":"Functional examples using adjust_time()
:","text":" - Add 1 day to a column:
adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")
- Subtract 1 month from a column:
adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--as-a-transformation-step","title":"As a transformation step:","text":"from koheesio.spark.transformations.date_time.interval import (\n DateTimeAddInterval,\n)\n\ninput_df = spark.createDataFrame([(1, \"2022-01-01 00:00:00\")], [\"id\", \"my_column\"])\n\n# add 1 day to my_column and store the result in a new column called 'one_day_later'\noutput_df = DateTimeAddInterval(column=\"my_column\", target_column=\"one_day_later\", interval=\"1 day\").transform(input_df)\n
output_df: id my_column one_day_later 1 2022-01-01 00:00:00 2022-01-02 00:00:00 DateTimeSubtractInterval
works in a similar way, but subtracts an interval value from a datetime column.
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.Operations","title":"koheesio.spark.transformations.date_time.interval.Operations module-attribute
","text":"Operations = Literal['add', 'subtract']\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","text":"A transformation that adds or subtracts a specified interval from a datetime column.
See also: pyspark.sql.functions:
- https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
- https://spark.apache.org/docs/latest/api/sql/index.html#interval
Parameters:
Name Type Description Default interval
str
The interval to add to the datetime column.
required operation
Operations
The operation to perform. Must be either 'add' or 'subtract'.
add
Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--add-1-day-to-a-column","title":"add 1 day to a column","text":"DateTimeAddInterval(\n column=\"my_column\",\n interval=\"1 day\",\n).transform(df)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--subtract-1-month-from-my_column-and-store-the-result-in-a-new-column-called-one_month_earlier","title":"subtract 1 month from my_column
and store the result in a new column called one_month_earlier
","text":"DateTimeSubtractInterval(\n column=\"my_column\",\n target_column=\"one_month_earlier\",\n interval=\"1 month\",\n)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.interval","title":"interval class-attribute
instance-attribute
","text":"interval: str = Field(default=..., description='The interval to add to the datetime column.', examples=['1 day', '5 years', '3 months'])\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.operation","title":"operation class-attribute
instance-attribute
","text":"operation: Operations = Field(default='add', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.validate_interval","title":"validate_interval class-attribute
instance-attribute
","text":"validate_interval = field_validator('interval')(validate_interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/date_time/interval.py
def func(self, column: Column):\n return adjust_time(column, operation=self.operation, interval=self.interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn","title":"koheesio.spark.transformations.date_time.interval.DateTimeColumn","text":"A datetime column that can be adjusted by adding or subtracting an interval value using the +
and -
operators.
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn.from_column","title":"from_column classmethod
","text":"from_column(column: Column)\n
Create a DateTimeColumn from an existing Column
Source code in src/koheesio/spark/transformations/date_time/interval.py
@classmethod\ndef from_column(cls, column: Column):\n \"\"\"Create a DateTimeColumn from an existing Column\"\"\"\n return cls(column._jc)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","text":"Subtracts a specified interval from a datetime column.
Works in the same way as DateTimeAddInterval
, but subtracts the specified interval from the datetime column. See DateTimeAddInterval
for more information.
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval.operation","title":"operation class-attribute
instance-attribute
","text":"operation: Operations = Field(default='subtract', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time","title":"koheesio.spark.transformations.date_time.interval.adjust_time","text":"adjust_time(column: Column, operation: Operations, interval: str) -> Column\n
Adjusts a datetime column by adding or subtracting an interval value.
This can be used to reflect a change in a given date / time column in a more human-readable way.
See also Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
Example Parameters:
Name Type Description Default column
Column
The datetime column to adjust.
required operation
Operations
The operation to perform. Must be either 'add' or 'subtract'.
required interval
str
The value to add or subtract. Must be a valid interval string.
required Returns:
Type Description Column
The adjusted datetime column.
Source code in src/koheesio/spark/transformations/date_time/interval.py
def adjust_time(column: Column, operation: Operations, interval: str) -> Column:\n \"\"\"\n Adjusts a datetime column by adding or subtracting an interval value.\n\n This can be used to reflect a change in a given date / time column in a more human-readable way.\n\n\n See also\n --------\n Please refer to the Spark SQL documentation for a list of valid interval values:\n https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal\n\n ### pyspark.sql.functions:\n\n * https://spark.apache.org/docs/latest/api/sql/index.html#interval\n * https://spark.apache.org/docs/latest/api/sql/#try_add\n * https://spark.apache.org/docs/latest/api/sql/#try_subtract\n\n Example\n --------\n ### add 1 day to a column\n ```python\n adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n ```\n\n ### subtract 1 month from a column\n ```python\n adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n ```\n\n ### or, a much more complicated example\n\n In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called `my_column`.\n ```python\n adjust_time(\n \"my_column\",\n operation=\"add\",\n interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n )\n ```\n\n Parameters\n ----------\n column : Column\n The datetime column to adjust.\n operation : Operations\n The operation to perform. Must be either 'add' or 'subtract'.\n interval : str\n The value to add or subtract. Must be a valid interval string.\n\n Returns\n -------\n Column\n The adjusted datetime column.\n \"\"\"\n\n # check that value is a valid interval\n interval = validate_interval(interval)\n\n column_name = column._jc.toString()\n\n # determine the operation to perform\n try:\n operation = {\n \"add\": \"try_add\",\n \"subtract\": \"try_subtract\",\n }[operation]\n except KeyError as e:\n raise ValueError(f\"Operation '{operation}' is not valid. Must be either 'add' or 'subtract'.\") from e\n\n # perform the operation\n _expression = f\"{operation}({column_name}, interval '{interval}')\"\n column = expr(_expression)\n\n return column\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--pysparksqlfunctions","title":"pyspark.sql.functions:","text":" - https://spark.apache.org/docs/latest/api/sql/index.html#interval
- https://spark.apache.org/docs/latest/api/sql/#try_add
- https://spark.apache.org/docs/latest/api/sql/#try_subtract
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--add-1-day-to-a-column","title":"add 1 day to a column","text":"adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--subtract-1-month-from-a-column","title":"subtract 1 month from a column","text":"adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--or-a-much-more-complicated-example","title":"or, a much more complicated example","text":"In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called my_column
.
adjust_time(\n \"my_column\",\n operation=\"add\",\n interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column","title":"koheesio.spark.transformations.date_time.interval.dt_column","text":"dt_column(column: Union[str, Column]) -> DateTimeColumn\n
Convert a column to a DateTimeColumn
Aims to be a drop-in replacement for pyspark.sql.functions.col
that returns a DateTimeColumn instead of a Column.
Example Parameters:
Name Type Description Default column
Union[str, Column]
The column (or name of the column) to convert to a DateTimeColumn
required Source code in src/koheesio/spark/transformations/date_time/interval.py
def dt_column(column: Union[str, Column]) -> DateTimeColumn:\n \"\"\"Convert a column to a DateTimeColumn\n\n Aims to be a drop-in replacement for `pyspark.sql.functions.col` that returns a DateTimeColumn instead of a Column.\n\n Example\n --------\n ### create a DateTimeColumn from a string\n ```python\n dt_column(\"my_column\")\n ```\n\n ### create a DateTimeColumn from a Column\n ```python\n dt_column(df.my_column)\n ```\n\n Parameters\n ----------\n column : Union[str, Column]\n The column (or name of the column) to convert to a DateTimeColumn\n \"\"\"\n if isinstance(column, str):\n column = col(column)\n elif not isinstance(column, Column):\n raise TypeError(f\"Expected column to be of type str or Column, got {type(column)} instead.\")\n return DateTimeColumn.from_column(column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-string","title":"create a DateTimeColumn from a string","text":"dt_column(\"my_column\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-column","title":"create a DateTimeColumn from a Column","text":"dt_column(df.my_column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.validate_interval","title":"koheesio.spark.transformations.date_time.interval.validate_interval","text":"validate_interval(interval: str)\n
Validate an interval string
Parameters:
Name Type Description Default interval
str
The interval string to validate
required Raises:
Type Description ValueError
If the interval string is invalid
Source code in src/koheesio/spark/transformations/date_time/interval.py
def validate_interval(interval: str):\n \"\"\"Validate an interval string\n\n Parameters\n ----------\n interval : str\n The interval string to validate\n\n Raises\n ------\n ValueError\n If the interval string is invalid\n \"\"\"\n try:\n expr(f\"interval '{interval}'\")\n except ParseException as e:\n raise ValueError(f\"Value '{interval}' is not a valid interval.\") from e\n return interval\n
"},{"location":"api_reference/spark/transformations/strings/index.html","title":"Strings","text":"Adds a number of Transformations that are intended to be used with StringType column input. Some will work with other types however, but will output StringType or an array of StringType.
These Transformations take full advantage of Koheesio's ColumnsTransformationWithTarget class, allowing a user to apply column transformations to multiple columns at once. See the class docstrings for more information.
The following Transformations are included:
change_case:
Lower
Converts a string column to lower case. Upper
Converts a string column to upper case. TitleCase
or InitCap
Converts a string column to title case, where each word starts with a capital letter.
concat:
Concat
Concatenates multiple input columns together into a single column, optionally using the given separator.
pad:
Pad
Pads the values of source_column
with the character
up until it reaches length
of characters LPad
Pad with a character on the left side of the string. RPad
Pad with a character on the right side of the string.
regexp:
RegexpExtract
Extract a specific group matched by a Java regexp from the specified string column. RegexpReplace
Searches for the given regexp and replaces all instances with what is in 'replacement'.
replace:
Replace
Replace all instances of a string in a column with another string.
split:
SplitAll
Splits the contents of a column on basis of a split_pattern. SplitAtFirstMatch
Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.
substring:
Substring
Extracts a substring from a string column starting at the given position.
trim:
Trim
Trim whitespace from the beginning and/or end of a string. LTrim
Trim whitespace from the beginning of a string. RTrim
Trim whitespace from the end of a string.
"},{"location":"api_reference/spark/transformations/strings/change_case.html","title":"Change case","text":"Convert the case of a string column to upper case, lower case, or title case
Classes:
Name Description `Lower`
Converts a string column to lower case.
`Upper`
Converts a string column to upper case.
`TitleCase` or `InitCap`
Converts a string column to title case, where each word starts with a capital letter.
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.InitCap","title":"koheesio.spark.transformations.strings.change_case.InitCap module-attribute
","text":"InitCap = TitleCase\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase","title":"koheesio.spark.transformations.strings.change_case.LowerCase","text":"This function makes the contents of a column lower case.
Wraps the pyspark.sql.functions.lower
function.
Warnings If the type of the column is not string, LowerCase
will not be run. A Warning will be thrown indicating this.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The name of the column or columns to convert to lower case. Alias: column. Lower case will be applied to all columns in the list. Column is required to be of string type.
required target_column
The name of the column to store the result in. If None, the result will be stored in the same column as the input.
required Example input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = LowerCase(column=\"product\", target_column=\"product_lower\").transform(df)\n
output_df:
product amount country product_lower Banana lemon orange 1000 USA banana lemon orange Carrots Blueberries 1500 USA carrots blueberries Beans 1600 USA beans In this example, the column product
is converted to product_lower
and the contents of this column are converted to lower case.
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig","title":"ColumnConfig","text":"Limit data type to string
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n return lower(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase","title":"koheesio.spark.transformations.strings.change_case.TitleCase","text":"This function makes the contents of a column title case. This means that every word starts with an upper case.
Wraps the pyspark.sql.functions.initcap
function.
Warnings If the type of the column is not string, TitleCase will not be run. A Warning will be thrown indicating this.
Parameters:
Name Type Description Default columns
The name of the column or columns to convert to title case. Alias: column. Title case will be applied to all columns in the list. Column is required to be of string type.
required target_column
The name of the column to store the result in. If None, the result will be stored in the same column as the input.
required Example input_df:
product amount country Banana lemon orange 1000 USA Carrots blueberries 1500 USA Beans 1600 USA output_df = TitleCase(column=\"product\", target_column=\"product_title\").transform(df)\n
output_df:
product amount country product_title Banana lemon orange 1000 USA Banana Lemon Orange Carrots blueberries 1500 USA Carrots Blueberries Beans 1600 USA Beans In this example, the column product
is converted to product_title
and the contents of this column are converted to title case (each word now starts with an upper case).
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n return initcap(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase","title":"koheesio.spark.transformations.strings.change_case.UpperCase","text":"This function makes the contents of a column upper case.
Wraps the pyspark.sql.functions.upper
function.
Warnings If the type of the column is not string, UpperCase
will not be run. A Warning will be thrown indicating this.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The name of the column or columns to convert to upper case. Alias: column. Upper case will be applied to all columns in the list. Column is required to be of string type.
required target_column
The name of the column to store the result in. If None, the result will be stored in the same column as the input.
required Examples:
input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = UpperCase(column=\"product\", target_column=\"product_upper\").transform(df)\n
output_df:
product amount country product_upper Banana lemon orange 1000 USA BANANA LEMON ORANGE Carrots Blueberries 1500 USA CARROTS BLUEBERRIES Beans 1600 USA BEANS In this example, the column product
is converted to product_upper
and the contents of this column are converted to upper case.
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n return upper(column)\n
"},{"location":"api_reference/spark/transformations/strings/concat.html","title":"Concat","text":"Concatenates multiple input columns together into a single column, optionally using a given separator.
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat","title":"koheesio.spark.transformations.strings.concat.Concat","text":"This is a wrapper around PySpark concat() and concat_ws() functions
Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.
Concept When working with arrays, the function will return the result of the concatenation of the elements in the array.
- If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.
- If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.
When working with date/timestamps, the function will return the result of the concatenation of the elements in the array. The timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss
.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first.
required target_column
Optional[str]
Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.
None
spacer
Optional[str]
Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used
None
Example"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-a-string-column-and-a-timestamp-column","title":"Example using a string column and a timestamp column","text":"input_df:
column_a column_b text 1997-02-28 10:30:00 output_df = Concat(\n columns=[\"column_a\", \"column_b\"],\n target_column=\"concatenated_column\",\n spacer=\"--\",\n).transform(input_df)\n
output_df:
column_a column_b concatenated_column text 1997-02-28 10:30:00 text--1997-02-28 10:30:00 In the example above, the resulting column is a string column.
If we had left out the spacer, the resulting column would have had the value of text1997-02-28 10:30:00
(a string). Note that the timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss
.
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-two-array-columns","title":"Example using two array columns","text":"input_df:
array_col_1 array_col_2 [text1, text2] [text3, text4] output_df = Concat(\n columns=[\"array_col_1\", \"array_col_2\"],\n target_column=\"concatenated_column\",\n spacer=\"--\",\n).transform(input_df)\n
output_df:
array_col_1 array_col_2 concatenated_column [text1, text2] [text3, text4] \"text1--text2--text3\" Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would have been an array with the values of [\"text1\", \"text2\", \"text3\"]
.
Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.spacer","title":"spacer class-attribute
instance-attribute
","text":"spacer: Optional[str] = Field(default=None, description='Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used', alias='sep')\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: Optional[str] = Field(default=None, description=\"Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.\")\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.execute","title":"execute","text":"execute() -> DataFrame\n
Source code in src/koheesio/spark/transformations/strings/concat.py
def execute(self) -> DataFrame:\n columns = [col(s) for s in self.get_columns()]\n self.output.df = self.df.withColumn(\n self.target_column, concat_ws(self.spacer, *columns) if self.spacer else concat(*columns)\n )\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.get_target_column","title":"get_target_column","text":"get_target_column(target_column_value, values)\n
Get the target column name if it is not provided.
If not provided, a name will be generated by concatenating the names of the source columns with an '_'.
Source code in src/koheesio/spark/transformations/strings/concat.py
@field_validator(\"target_column\")\ndef get_target_column(cls, target_column_value, values):\n \"\"\"Get the target column name if it is not provided.\n\n If not provided, a name will be generated by concatenating the names of the source columns with an '_'.\"\"\"\n if not target_column_value:\n columns_value: List = values[\"columns\"]\n columns = list(dict.fromkeys(columns_value)) # dict.fromkeys is used to dedup while maintaining order\n return \"_\".join(columns)\n\n return target_column_value\n
"},{"location":"api_reference/spark/transformations/strings/pad.html","title":"Pad","text":"Pad the values of a column with a character up until it reaches a certain length.
Classes:
Name Description Pad
Pads the values of source_column
with the character
up until it reaches length
of characters
LPad
Pad with a character on the left side of the string.
RPad
Pad with a character on the right side of the string.
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.LPad","title":"koheesio.spark.transformations.strings.pad.LPad module-attribute
","text":"LPad = Pad\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.pad_directions","title":"koheesio.spark.transformations.strings.pad.pad_directions module-attribute
","text":"pad_directions = Literal['left', 'right']\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad","title":"koheesio.spark.transformations.strings.pad.Pad","text":"Pads the values of source_column
with the character
up until it reaches length
of characters The direction
param can be changed to apply either a left or a right pad. Defaults to left pad.
Wraps the lpad
and rpad
functions from PySpark.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to pad. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
character
constr(min_length=1)
The character to use for padding
required length
PositiveInt
Positive integer to indicate the intended length
required direction
Optional[pad_directions]
On which side to add the characters. Either \"left\" or \"right\". Defaults to \"left\"
left
Example input_df:
column hello world output_df = Pad(\n column=\"column\",\n target_column=\"padded_column\",\n character=\"*\",\n length=10,\n direction=\"right\",\n).transform(input_df)\n
output_df:
column padded_column hello hello***** world world***** Note: in the example above, we could have used the RPad class instead of Pad with direction=\"right\" to achieve the same result.
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.character","title":"character class-attribute
instance-attribute
","text":"character: constr(min_length=1) = Field(default=..., description='The character to use for padding')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.direction","title":"direction class-attribute
instance-attribute
","text":"direction: Optional[pad_directions] = Field(default='left', description='On which side to add the characters . Either \"left\" or \"right\". Defaults to \"left\"')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.length","title":"length class-attribute
instance-attribute
","text":"length: PositiveInt = Field(default=..., description='Positive integer to indicate the intended length')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/pad.py
def func(self, column: Column):\n func = lpad if self.direction == \"left\" else rpad\n return func(column, self.length, self.character)\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad","title":"koheesio.spark.transformations.strings.pad.RPad","text":"Pad with a character on the right side of the string.
See Pad class docstring for more information.
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad.direction","title":"direction class-attribute
instance-attribute
","text":"direction: Optional[pad_directions] = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html","title":"Regexp","text":"String transformations using regular expressions.
This module contains transformations that use regular expressions to transform strings.
Classes:
Name Description RegexpExtract
Extract a specific group matched by a Java regexp from the specified string column.
RegexpReplace
Searches for the given regexp and replaces all instances with what is in 'replacement'.
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract","title":"koheesio.spark.transformations.strings.regexp.RegexpExtract","text":"Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.
A wrapper around the pyspark regexp_extract function
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to extract from. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
regexp
str
The Java regular expression to extract
required index
Optional[int]
When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.
0
Example"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--extracting-the-year-and-week-number-from-a-string","title":"Extracting the year and week number from a string","text":"Let's say we have a column containing the year and week in a format like Y## W#
and we would like to extract the week numbers.
input_df:
YWK 2020 W1 2021 WK2 output_df = RegexpExtract(\n column=\"YWK\",\n target_column=\"week_number\",\n regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n index=2, # remember that this is 1-indexed! So 2 will get the week number in this example.\n).transform(input_df)\n
output_df:
YWK week_number 2020 W1 1 2021 WK2 2"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--using-the-same-example-but-extracting-the-year-instead","title":"Using the same example, but extracting the year instead","text":"If you want to extract the year, you can use index=1.
output_df = RegexpExtract(\n column=\"YWK\",\n target_column=\"year\",\n regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n index=1, # remember that this is 1-indexed! So 1 will get the year in this example.\n).transform(input_df)\n
output_df:
YWK year 2020 W1 2020 2021 WK2 2021"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.index","title":"index class-attribute
instance-attribute
","text":"index: Optional[int] = Field(default=0, description='When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.regexp","title":"regexp class-attribute
instance-attribute
","text":"regexp: str = Field(default=..., description='The Java regular expression to extract')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n return regexp_extract(column, self.regexp, self.index)\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace","title":"koheesio.spark.transformations.strings.regexp.RegexpReplace","text":"Searches for the given regexp and replaces all instances with what is in 'replacement'.
A wrapper around the pyspark regexp_replace function
Parameters:
Name Type Description Default columns
The column (or list of columns) to replace in. Alias: column
required target_column
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
required regexp
The regular expression to replace
required replacement
String to replace matched pattern with.
required Examples:
input_df: | content | |------------| | hello world|
Let's say you want to replace 'hello'.
output_df = RegexpReplace(\n column=\"content\",\n target_column=\"replaced\",\n regexp=\"hello\",\n replacement=\"gutentag\",\n).transform(input_df)\n
output_df: | content | replaced | |------------|---------------| | hello world| gutentag world|
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.regexp","title":"regexp class-attribute
instance-attribute
","text":"regexp: str = Field(default=..., description='The regular expression to replace')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.replacement","title":"replacement class-attribute
instance-attribute
","text":"replacement: str = Field(default=..., description='String to replace matched pattern with.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n return regexp_replace(column, self.regexp, self.replacement)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html","title":"Replace","text":"String replacements without using regular expressions.
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace","title":"koheesio.spark.transformations.strings.replace.Replace","text":"Replace all instances of a string in a column with another string.
This transformation uses PySpark when().otherwise() functions.
Notes - If original_value is not set, the transformation will replace all null values with new_value
- If original_value is set, the transformation will replace all values matching original_value with new_value
- Numeric values are supported, but will be cast to string in the process
- Replace is meant for simple string replacements. If more advanced replacements are needed, use the
RegexpReplace
transformation instead.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to replace values in. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
original_value
Optional[str]
The original value that needs to be replaced. Alias: from
None
new_value
str
The new value to replace this with. Alias: to
required Examples:
input_df:
column hello world None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-null-values-with-a-new-value","title":"Replace all null values with a new value","text":"output_df = Replace(\n column=\"column\",\n target_column=\"replaced_column\",\n original_value=None, # This is the default value, so it can be omitted\n new_value=\"programmer\",\n).transform(input_df)\n
output_df:
column replaced_column hello hello world world None programmer"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-instances-of-a-string-in-a-column-with-another-string","title":"Replace all instances of a string in a column with another string","text":"output_df = Replace(\n column=\"column\",\n target_column=\"replaced_column\",\n original_value=\"world\",\n new_value=\"programmer\",\n).transform(input_df)\n
output_df:
column replaced_column hello hello world programmer None None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.new_value","title":"new_value class-attribute
instance-attribute
","text":"new_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.original_value","title":"original_value class-attribute
instance-attribute
","text":"original_value: Optional[str] = Field(default=None, alias='from', description='The original value that needs to be replaced')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.cast_values_to_str","title":"cast_values_to_str","text":"cast_values_to_str(value)\n
Cast values to string if they are not None
Source code in src/koheesio/spark/transformations/strings/replace.py
@field_validator(\"original_value\", \"new_value\", mode=\"before\")\ndef cast_values_to_str(cls, value):\n \"\"\"Cast values to string if they are not None\"\"\"\n if value:\n return str(value)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/replace.py
def func(self, column: Column):\n when_statement = (\n when(column.isNull(), lit(self.new_value))\n if not self.original_value\n else when(\n column == self.original_value,\n lit(self.new_value),\n )\n )\n return when_statement.otherwise(column)\n
"},{"location":"api_reference/spark/transformations/strings/split.html","title":"Split","text":"Splits the contents of a column on basis of a split_pattern
Classes:
Name Description SplitAll
Splits the contents of a column on basis of a split_pattern.
SplitAtFirstMatch
Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll","title":"koheesio.spark.transformations.strings.split.SplitAll","text":"This function splits the contents of a column on basis of a split_pattern.
It splits at al the locations the pattern is found. The new column will be of ArrayType.
Wraps the pyspark.sql.functions.split function.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to split. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
split_pattern
str
This is the pattern that will be used to split the column contents.
required Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = SplitColumn(column=\"product\", target_column=\"split\", split_pattern=\" \").transform(input_df)\n
output_df:
product amount country split Banana lemon orange 1000 USA [\"Banana\", \"lemon\" \"orange\"] Carrots Blueberries 1500 USA [\"Carrots\", \"Blueberries\"] Beans 1600 USA [\"Beans\"]"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.split_pattern","title":"split_pattern class-attribute
instance-attribute
","text":"split_pattern: str = Field(default=..., description='The pattern to split the column contents.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n return split(column, pattern=self.split_pattern)\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch","title":"koheesio.spark.transformations.strings.split.SplitAtFirstMatch","text":"Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..
Note - SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you can toggle the parameter retrieve_first_part.
- The new column will be of StringType.
- If you want to split a column more than once, you should call this function multiple times.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to split. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
split_pattern
str
This is the pattern that will be used to split the column contents.
required retrieve_first_part
Optional[bool]
Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.
True
Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = SplitColumn(column=\"product\", target_column=\"split_first\", split_pattern=\"an\").transform(input_df)\n
output_df:
product amount country split_first Banana lemon orange 1000 USA B Carrots Blueberries 1500 USA Carrots Blueberries Beans 1600 USA Be"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.retrieve_first_part","title":"retrieve_first_part class-attribute
instance-attribute
","text":"retrieve_first_part: Optional[bool] = Field(default=True, description='Takes the first part of the split when true, the second part when False. Other parts are ignored.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n split_func = split(column, pattern=self.split_pattern)\n\n # first part\n if self.retrieve_first_part:\n return split_func.getItem(0)\n\n # or, second part\n return coalesce(split_func.getItem(1), lit(\"\"))\n
"},{"location":"api_reference/spark/transformations/strings/substring.html","title":"Substring","text":"Extracts a substring from a string column starting at the given position.
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring","title":"koheesio.spark.transformations.strings.substring.Substring","text":"Extracts a substring from a string column starting at the given position.
This is a wrapper around PySpark substring() function
Notes - Numeric columns will be cast to string
- start is 1-indexed, not 0-indexed!
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to substring. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
start
PositiveInt
Positive int. Defines where to begin the substring from. The first character of the field has index 1!
required length
Optional[int]
Optional. If not provided, the substring will go until end of string.
-1
Example"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring--extract-a-substring-from-a-string-column-starting-at-the-given-position","title":"Extract a substring from a string column starting at the given position.","text":"input_df:
column skyscraper output_df = Substring(\n column=\"column\",\n target_column=\"substring_column\",\n start=3, # 1-indexed! So this will start at the 3rd character\n length=4,\n).transform(input_df)\n
output_df:
column substring_column skyscraper yscr"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.length","title":"length class-attribute
instance-attribute
","text":"length: Optional[int] = Field(default=-1, description='The target length for the string. use -1 to perform until end')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.start","title":"start class-attribute
instance-attribute
","text":"start: PositiveInt = Field(default=..., description='The starting position')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/substring.py
def func(self, column: Column):\n return when(column.isNull(), None).otherwise(substring(column, self.start, self.length)).cast(StringType())\n
"},{"location":"api_reference/spark/transformations/strings/trim.html","title":"Trim","text":"Trim whitespace from the beginning and/or end of a string.
Classes:
Name Description - `Trim`
Trim whitespace from the beginning and/or end of a string.
- `LTrim`
Trim whitespace from the beginning of a string.
- `RTrim`
Trim whitespace from the end of a string.
See class docstrings for more information.
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.trim_type","title":"koheesio.spark.transformations.strings.trim.trim_type module-attribute
","text":"trim_type = Literal['left', 'right', 'left-right']\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim","title":"koheesio.spark.transformations.strings.trim.LTrim","text":"Trim whitespace from the beginning of a string. Alias: LeftTrim
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim.direction","title":"direction class-attribute
instance-attribute
","text":"direction: trim_type = 'left'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim","title":"koheesio.spark.transformations.strings.trim.RTrim","text":"Trim whitespace from the end of a string. Alias: RightTrim
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim.direction","title":"direction class-attribute
instance-attribute
","text":"direction: trim_type = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim","title":"koheesio.spark.transformations.strings.trim.Trim","text":"Trim whitespace from the beginning and/or end of a string.
This is a wrapper around PySpark ltrim() and rtrim() functions
The direction
parameter can be changed to apply either a left or a right trim. Defaults to left AND right trim.
Note: If the type of the column is not string, Trim will not be run. A Warning will be thrown indicating this
Parameters:
Name Type Description Default columns
The column (or list of columns) to trim. Alias: column If no columns are provided, all string columns will be trimmed.
required target_column
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
required direction
On which side to remove the spaces. Either \"left\", \"right\" or \"left-right\". Defaults to \"left-right\"
required Examples:
input_df: | column | |-----------| | \" hello \" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-beginning-of-a-string","title":"Trim whitespace from the beginning of a string","text":"output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"left\").transform(input_df)\n
output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello \" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-both-sides-of-a-string","title":"Trim whitespace from both sides of a string","text":"output_df = Trim(\n column=\"column\",\n target_column=\"trimmed_column\",\n direction=\"left-right\", # default value\n).transform(input_df)\n
output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello\" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-end-of-a-string","title":"Trim whitespace from the end of a string","text":"output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"right\").transform(input_df)\n
output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \" hello\" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.columns","title":"columns class-attribute
instance-attribute
","text":"columns: ListOfColumns = Field(default='*', alias='column', description='The column (or list of columns) to trim. Alias: column. If no columns are provided, all stringcolumns will be trimmed.')\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.direction","title":"direction class-attribute
instance-attribute
","text":"direction: trim_type = Field(default='left-right', description=\"On which side to remove the spaces. Either 'left', 'right' or 'left-right'\")\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig","title":"ColumnConfig","text":"Limit data types to string only.
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/trim.py
def func(self, column: Column):\n if self.direction == \"left\":\n return f.ltrim(column)\n\n if self.direction == \"right\":\n return f.rtrim(column)\n\n # both (left-right)\n return f.rtrim(f.ltrim(column))\n
"},{"location":"api_reference/spark/writers/index.html","title":"Writers","text":"The Writer class is used to write the DataFrame to a target.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode","title":"koheesio.spark.writers.BatchOutputMode","text":"For Batch:
- append: Append the contents of the DataFrame to the output table, default option in Koheesio.
- overwrite: overwrite the existing data.
- ignore: ignore the operation (i.e. no-op).
- error or errorifexists: throw an exception at runtime.
- merge: update matching data in the table and insert rows that do not exist.
- merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERROR","title":"ERROR class-attribute
instance-attribute
","text":"ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute
instance-attribute
","text":"ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.IGNORE","title":"IGNORE class-attribute
instance-attribute
","text":"IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE","title":"MERGE class-attribute
instance-attribute
","text":"MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute
instance-attribute
","text":"MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute
instance-attribute
","text":"MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute
instance-attribute
","text":"OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode","title":"koheesio.spark.writers.StreamingOutputMode","text":"For Streaming:
- append: only the new rows in the streaming DataFrame will be written to the sink.
- complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.
- update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.COMPLETE","title":"COMPLETE class-attribute
instance-attribute
","text":"COMPLETE = 'complete'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.UPDATE","title":"UPDATE class-attribute
instance-attribute
","text":"UPDATE = 'update'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer","title":"koheesio.spark.writers.Writer","text":"The Writer class is used to write the DataFrame to a target.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='delta', description='The format of the output')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.streaming","title":"streaming property
","text":"streaming: bool\n
Check if the DataFrame is a streaming DataFrame or not.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.execute","title":"execute abstractmethod
","text":"execute()\n
Execute on a Writer should handle writing of the self.df (input) as a minimum
Source code in src/koheesio/spark/writers/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Execute on a Writer should handle writing of the self.df (input) as a minimum\"\"\"\n # self.df # input dataframe\n ...\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.write","title":"write","text":"write(df: Optional[DataFrame] = None) -> Output\n
Write the DataFrame to the output using execute() and return the output.
If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.
Source code in src/koheesio/spark/writers/__init__.py
def write(self, df: Optional[DataFrame] = None) -> SparkStep.Output:\n \"\"\"Write the DataFrame to the output using execute() and return the output.\n\n If no DataFrame is passed, the self.df will be used.\n If no self.df is set, a RuntimeError will be thrown.\n \"\"\"\n self.df = df or self.df\n if not self.df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n self.execute()\n return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html","title":"Buffer","text":"This module contains classes for writing data to a buffer before writing to the final destination.
The BufferWriter
class is a base class for writers that write to a buffer first. It provides methods for writing, reading, and resetting the buffer, as well as checking if the buffer is compressed and compressing the buffer.
The PandasCsvBufferWriter
class is a subclass of BufferWriter
that writes a Spark DataFrame to CSV file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).
The PandasJsonBufferWriter
class is a subclass of BufferWriter
that writes a Spark DataFrame to JSON file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter","title":"koheesio.spark.writers.buffer.BufferWriter","text":"Base class for writers that write to a buffer first, before writing to the final destination.
execute()
method should implement how the incoming DataFrame is written to the buffer object (e.g. BytesIO) in the output.
The default implementation uses a SpooledTemporaryFile
as the buffer. This is a file-like object that starts off stored in memory and automatically rolls over to a temporary file on disk if it exceeds a certain size. A SpooledTemporaryFile
behaves similar to BytesIO
, but with the added benefit of being able to handle larger amounts of data.
This approach provides a balance between speed and memory usage, allowing for fast in-memory operations for smaller amounts of data while still being able to handle larger amounts of data that would not otherwise fit in memory.
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output","title":"Output","text":"Output class for BufferWriter
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.buffer","title":"buffer class-attribute
instance-attribute
","text":"buffer: InstanceOf[SpooledTemporaryFile] = Field(default_factory=partial(SpooledTemporaryFile, mode='w+b', max_size=0), exclude=True)\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.compress","title":"compress","text":"compress()\n
Compress the file_buffer in place using GZIP
Source code in src/koheesio/spark/writers/buffer.py
def compress(self):\n \"\"\"Compress the file_buffer in place using GZIP\"\"\"\n # check if the buffer is already compressed\n if self.is_compressed():\n self.logger.warn(\"Buffer is already compressed. Nothing to compress...\")\n return self\n\n # compress the file_buffer\n file_buffer = self.buffer\n compressed = gzip.compress(file_buffer.read())\n\n # write the compressed content back to the buffer\n self.reset_buffer()\n self.buffer.write(compressed)\n\n return self # to allow for chaining\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.is_compressed","title":"is_compressed","text":"is_compressed()\n
Check if the buffer is compressed.
Source code in src/koheesio/spark/writers/buffer.py
def is_compressed(self):\n \"\"\"Check if the buffer is compressed.\"\"\"\n self.rewind_buffer()\n magic_number_present = self.buffer.read(2) == b\"\\x1f\\x8b\"\n self.rewind_buffer()\n return magic_number_present\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.read","title":"read","text":"read()\n
Read the buffer
Source code in src/koheesio/spark/writers/buffer.py
def read(self):\n \"\"\"Read the buffer\"\"\"\n self.rewind_buffer()\n data = self.buffer.read()\n self.rewind_buffer()\n return data\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.reset_buffer","title":"reset_buffer","text":"reset_buffer()\n
Reset the buffer
Source code in src/koheesio/spark/writers/buffer.py
def reset_buffer(self):\n \"\"\"Reset the buffer\"\"\"\n self.buffer.truncate(0)\n self.rewind_buffer()\n return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.rewind_buffer","title":"rewind_buffer","text":"rewind_buffer()\n
Rewind the buffer
Source code in src/koheesio/spark/writers/buffer.py
def rewind_buffer(self):\n \"\"\"Rewind the buffer\"\"\"\n self.buffer.seek(0)\n return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.write","title":"write","text":"write(df=None) -> Output\n
Write the DataFrame to the buffer
Source code in src/koheesio/spark/writers/buffer.py
def write(self, df=None) -> Output:\n \"\"\"Write the DataFrame to the buffer\"\"\"\n self.df = df or self.df\n if not self.df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n self.output.reset_buffer()\n self.execute()\n return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter","title":"koheesio.spark.writers.buffer.PandasCsvBufferWriter","text":"Write a Spark DataFrame to CSV file(s) using Pandas.
Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
See also: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
Note This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).
Pyspark vs Pandas The following table shows the mapping between Pyspark, Pandas, and Koheesio properties. Note that the default values are mostly the same as Pyspark's DataFrameWriter
implementation, with some exceptions (see below).
This class implements the most commonly used properties. If a property is not explicitly implemented, it can be accessed through params
.
PySpark Property Default PySpark Pandas Property Default Pandas Koheesio Property Default Koheesio Notes maxRecordsPerFile ... chunksize None max_records_per_file ... Spark property name: spark.sql.files.maxRecordsPerFile sep , sep , sep , lineSep \\n
line_terminator os.linesep lineSep (alias=line_terminator) \\n N/A ... index True index False Determines whether row labels (index) are included in the output header False header True header True quote \" quotechar \" quote (alias=quotechar) \" quoteAll False doublequote True quoteAll (alias=doublequote) False escape \\
escapechar None escapechar (alias=escape) \\ escapeQuotes True N/A N/A N/A ... Not available in Pandas ignoreLeadingWhiteSpace True N/A N/A N/A ... Not available in Pandas ignoreTrailingWhiteSpace True N/A N/A N/A ... Not available in Pandas charToEscapeQuoteEscaping escape or \u0000
N/A N/A N/A ... Not available in Pandas dateFormat yyyy-MM-dd
N/A N/A N/A ... Pandas implements Timestamp, not Date timestampFormat yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
date_format N/A timestampFormat (alias=date_format) yyyy-MM-dd'T'HHss.SSS Follows PySpark defaults timestampNTZFormat yyyy-MM-dd'T'HH:mm:ss[.SSS]
N/A N/A N/A ... Pandas implements Timestamp, see above compression None compression infer compression None encoding utf-8 encoding utf-8 N/A ... Not explicitly implemented nullValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented emptyValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented N/A ... float_format N/A N/A ... Not explicitly implemented N/A ... decimal N/A N/A ... Not explicitly implemented N/A ... index_label None N/A ... Not explicitly implemented N/A ... columns N/A N/A ... Not explicitly implemented N/A ... mode N/A N/A ... Not explicitly implemented N/A ... quoting N/A N/A ... Not explicitly implemented N/A ... errors N/A N/A ... Not explicitly implemented N/A ... storage_options N/A N/A ... Not explicitly implemented differences with Pyspark: - dateFormat -> Pandas implements Timestamp, not just Date. Hence, Koheesio sets the default to the python equivalent of PySpark's default.
- compression -> Spark does not compress by default, hence Koheesio does not compress by default. Compression can be provided though.
Parameters:
Name Type Description Default header
bool
Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.
True
sep
str
Field delimiter for the output file. Default is ','.
,
quote
str
String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'. Default is '\"'.
\"
quoteAll
bool
A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio sets the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'. Default is False.
False
escape
str
String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to \\
to match Pyspark's default behavior. In Pandas, this field is called 'escapechar', and defaults to None. Default is '\\'.
\\
timestampFormat
str
Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
which mimics the iso8601 format (datetime.isoformat()
). Default is '%Y-%m-%dT%H:%M:%S.%f'.
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
lineSep
str, optional, default=
String of length 1. Defines the character used as line separator that should be used for writing. Default is os.linesep.
required compression
Optional[Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', 'tar']]
A string representing the compression to use for on-the-fly compression of the output data. Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.
None
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.compression","title":"compression class-attribute
instance-attribute
","text":"compression: Optional[CompressionOptions] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.escape","title":"escape class-attribute
instance-attribute
","text":"escape: constr(max_length=1) = Field(default='\\\\', description=\"String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to `\\\\` to match Pyspark's default behavior. In Pandas, this is called 'escapechar', and defaults to None.\", alias='escapechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.header","title":"header class-attribute
instance-attribute
","text":"header: bool = Field(default=True, description=\"Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.index","title":"index class-attribute
instance-attribute
","text":"index: bool = Field(default=False, description='Toggles whether to write row names (index). Default False in Koheesio - pandas default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.lineSep","title":"lineSep class-attribute
instance-attribute
","text":"lineSep: Optional[constr(max_length=1)] = Field(default=linesep, description='String of length 1. Defines the character used as line separator that should be used for writing.', alias='line_terminator')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quote","title":"quote class-attribute
instance-attribute
","text":"quote: constr(max_length=1) = Field(default='\"', description=\"String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'.\", alias='quotechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quoteAll","title":"quoteAll class-attribute
instance-attribute
","text":"quoteAll: bool = Field(default=False, description=\"A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio set the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'.\", alias='doublequote')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.sep","title":"sep class-attribute
instance-attribute
","text":"sep: constr(max_length=1) = Field(default=',', description='Field delimiter for the output file')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.timestampFormat","title":"timestampFormat class-attribute
instance-attribute
","text":"timestampFormat: str = Field(default='%Y-%m-%dT%H:%M:%S.%f', description=\"Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` which mimics the iso8601 format (`datetime.isoformat()`).\", alias='date_format')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output","title":"Output","text":"Output class for PandasCsvBufferWriter
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output.pandas_df","title":"pandas_df class-attribute
instance-attribute
","text":"pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.execute","title":"execute","text":"execute()\n
Write the DataFrame to the buffer using Pandas to_csv() method. Compression is handled by pandas to_csv() method.
Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n \"\"\"Write the DataFrame to the buffer using Pandas to_csv() method.\n Compression is handled by pandas to_csv() method.\n \"\"\"\n # convert the Spark DataFrame to a Pandas DataFrame\n self.output.pandas_df = self.df.toPandas()\n\n # create csv file in memory\n file_buffer = self.output.buffer\n self.output.pandas_df.to_csv(file_buffer, **self.get_options(options_type=\"spark\"))\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.get_options","title":"get_options","text":"get_options(options_type: str = 'csv')\n
Returns the options to pass to Pandas' to_csv() method.
Source code in src/koheesio/spark/writers/buffer.py
def get_options(self, options_type: str = \"csv\"):\n \"\"\"Returns the options to pass to Pandas' to_csv() method.\"\"\"\n try:\n import pandas as _pd\n\n # Get the pandas version as a tuple of integers\n pandas_version = tuple(int(i) for i in _pd.__version__.split(\".\"))\n except ImportError:\n raise ImportError(\"Pandas is required to use this writer\")\n\n # Use line_separator for pandas 2.0.0 and later\n line_sep_option_naming = \"line_separator\" if pandas_version >= (2, 0, 0) else \"line_terminator\"\n\n csv_options = {\n \"header\": self.header,\n \"sep\": self.sep,\n \"quotechar\": self.quote,\n \"doublequote\": self.quoteAll,\n \"escapechar\": self.escape,\n \"na_rep\": self.emptyValue or self.nullValue,\n line_sep_option_naming: self.lineSep,\n \"index\": self.index,\n \"date_format\": self.timestampFormat,\n \"compression\": self.compression,\n **self.params,\n }\n\n if options_type == \"spark\":\n csv_options[\"lineterminator\"] = csv_options.pop(line_sep_option_naming)\n elif options_type == \"kohesio_pandas_buffer_writer\":\n csv_options[\"line_terminator\"] = csv_options.pop(line_sep_option_naming)\n\n return csv_options\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter","title":"koheesio.spark.writers.buffer.PandasJsonBufferWriter","text":"Write a Spark DataFrame to JSON file(s) using Pandas.
Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
Note This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).
Parameters:
Name Type Description Default orient
Format of the resulting JSON string. Default is 'records'.
required lines
Format output as one JSON object per line. Only used when orient='records'. Default is True. - If true, the output will be formatted as one JSON object per line. - If false, the output will be written as a single JSON object. Note: this value is only used when orient='records' and will be ignored otherwise.
required date_format
Type of date conversion. Default is 'iso'. See Date and Timestamp Formats
for a detailed description and more information.
required double_precision
Number of decimal places for encoding floating point values. Default is 10.
required force_ascii
Force encoded string to be ASCII. Default is True.
required compression
A string representing the compression to use for on-the-fly compression of the output data. Koheesio sets this default to 'None' leaving the data uncompressed. Can be set to gzip' optionally. Other compression options are currently not supported by Koheesio for JSON output.
required The
required dates
required The
required different
required original
required Note
required then
required References
required"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[list[str]] = Field(default=None, description='The columns to write. If None, all columns will be written.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.compression","title":"compression class-attribute
instance-attribute
","text":"compression: Optional[Literal['gzip']] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to 'gzip' optionally.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.date_format","title":"date_format class-attribute
instance-attribute
","text":"date_format: Literal['iso', 'epoch'] = Field(default='iso', description=\"Type of date conversion. Default is 'iso'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.double_precision","title":"double_precision class-attribute
instance-attribute
","text":"double_precision: int = Field(default=10, description='Number of decimal places for encoding floating point values. Default is 10.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.force_ascii","title":"force_ascii class-attribute
instance-attribute
","text":"force_ascii: bool = Field(default=True, description='Force encoded string to be ASCII. Default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.lines","title":"lines class-attribute
instance-attribute
","text":"lines: bool = Field(default=True, description=\"Format output as one JSON object per line. Only used when orient='records'. Default is True.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.orient","title":"orient class-attribute
instance-attribute
","text":"orient: Literal['split', 'records', 'index', 'columns', 'values', 'table'] = Field(default='records', description=\"Format of the resulting JSON string. Default is 'records'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output","title":"Output","text":"Output class for PandasJsonBufferWriter
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output.pandas_df","title":"pandas_df class-attribute
instance-attribute
","text":"pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.execute","title":"execute","text":"execute()\n
Write the DataFrame to the buffer using Pandas to_json() method.
Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n \"\"\"Write the DataFrame to the buffer using Pandas to_json() method.\"\"\"\n df = self.df\n if self.columns:\n df = df[self.columns]\n\n # convert the Spark DataFrame to a Pandas DataFrame\n self.output.pandas_df = df.toPandas()\n\n # create json file in memory\n file_buffer = self.output.buffer\n self.output.pandas_df.to_json(file_buffer, **self.get_options())\n\n # compress the buffer if compression is set\n if self.compression:\n self.output.compress()\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.get_options","title":"get_options","text":"get_options()\n
Returns the options to pass to Pandas' to_json() method.
Source code in src/koheesio/spark/writers/buffer.py
def get_options(self):\n \"\"\"Returns the options to pass to Pandas' to_json() method.\"\"\"\n json_options = {\n \"orient\": self.orient,\n \"date_format\": self.date_format,\n \"double_precision\": self.double_precision,\n \"force_ascii\": self.force_ascii,\n \"lines\": self.lines,\n **self.params,\n }\n\n # ignore the 'lines' parameter if orient is not 'records'\n if self.orient != \"records\":\n del json_options[\"lines\"]\n\n return json_options\n
"},{"location":"api_reference/spark/writers/dummy.html","title":"Dummy","text":"Module for the DummyWriter class.
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter","title":"koheesio.spark.writers.dummy.DummyWriter","text":"A simple DummyWriter that performs the equivalent of a df.show() on the given DataFrame and returns the first row of data as a dict.
This Writer does not actually write anything to a source/destination, but is useful for debugging or testing purposes.
Parameters:
Name Type Description Default n
PositiveInt
Number of rows to show.
20
truncate
bool | PositiveInt
If set to True
, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate
and align cells right.
True
vertical
bool
If set to True
, print output rows vertically (one line per column value).
False
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.n","title":"n class-attribute
instance-attribute
","text":"n: PositiveInt = Field(default=20, description='Number of rows to show.', gt=0)\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.truncate","title":"truncate class-attribute
instance-attribute
","text":"truncate: Union[bool, PositiveInt] = Field(default=True, description='If set to ``True``, truncate strings longer than 20 chars by default.If set to a number greater than one, truncates long strings to length ``truncate`` and align cells right.')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.vertical","title":"vertical class-attribute
instance-attribute
","text":"vertical: bool = Field(default=False, description='If set to ``True``, print output rows vertically (one line per column value).')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output","title":"Output","text":"DummyWriter output
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.df_content","title":"df_content class-attribute
instance-attribute
","text":"df_content: str = Field(default=..., description='The content of the DataFrame as a string')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.head","title":"head class-attribute
instance-attribute
","text":"head: Dict[str, Any] = Field(default=..., description='The first row of the DataFrame as a dict')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.execute","title":"execute","text":"execute() -> Output\n
Execute the DummyWriter
Source code in src/koheesio/spark/writers/dummy.py
def execute(self) -> Output:\n \"\"\"Execute the DummyWriter\"\"\"\n df: DataFrame = self.df\n\n # noinspection PyProtectedMember\n df_content = df._jdf.showString(self.n, self.truncate, self.vertical)\n\n # logs the equivalent of doing df.show()\n self.log.info(f\"content of df that was passed to DummyWriter:\\n{df_content}\")\n\n self.output.head = self.df.head().asDict()\n self.output.df_content = df_content\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.int_truncate","title":"int_truncate","text":"int_truncate(truncate_value) -> int\n
Truncate is either a bool or an int.
Parameters: truncate_value : int | bool, optional, default=True If int, specifies the maximum length of the string. If bool and True, defaults to a maximum length of 20 characters.
Returns: int The maximum length of the string.
Source code in src/koheesio/spark/writers/dummy.py
@field_validator(\"truncate\")\ndef int_truncate(cls, truncate_value) -> int:\n \"\"\"\n Truncate is either a bool or an int.\n\n Parameters:\n -----------\n truncate_value : int | bool, optional, default=True\n If int, specifies the maximum length of the string.\n If bool and True, defaults to a maximum length of 20 characters.\n\n Returns:\n --------\n int\n The maximum length of the string.\n\n \"\"\"\n # Same logic as what is inside DataFrame.show()\n if isinstance(truncate_value, bool) and truncate_value is True:\n return 20 # default is 20 chars\n return int(truncate_value) # otherwise 0, or whatever the user specified\n
"},{"location":"api_reference/spark/writers/kafka.html","title":"Kafka","text":"Kafka writer to write batch or streaming data into kafka topics
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter","title":"koheesio.spark.writers.kafka.KafkaWriter","text":"Kafka writer to write batch or streaming data into kafka topics
All kafka specific options can be provided as additional init params
Parameters:
Name Type Description Default broker
str
broker url of the kafka cluster
required topic
str
full topic name to write the data to
required trigger
Optional[Union[Trigger, str, Dict]]
Indicates optionally how to stream the data into kafka, continuous or batch
required checkpoint_location
str
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs.
required Example KafkaWriter(\n write_broker=\"broker.com:9500\",\n topic=\"test-topic\",\n trigger=Trigger(continuous=True)\n includeHeaders: \"true\",\n key.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n value.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n kafka.group.id: \"test-group\",\n checkpoint_location: \"s3://bucket/test-topic\"\n)\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.batch_writer","title":"batch_writer property
","text":"batch_writer: DataFrameWriter\n
returns a batch writer
Returns:
Type Description DataFrameWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.broker","title":"broker class-attribute
instance-attribute
","text":"broker: str = Field(default=..., description='Kafka brokers to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.checkpoint_location","title":"checkpoint_location class-attribute
instance-attribute
","text":"checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.format","title":"format class-attribute
instance-attribute
","text":"format: str = 'kafka'\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.logged_option_keys","title":"logged_option_keys property
","text":"logged_option_keys\n
keys to be logged
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.options","title":"options property
","text":"options\n
retrieve the kafka options incl topic and broker.
Returns:
Type Description dict
Dict being the combination of kafka options + topic + broker
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.stream_writer","title":"stream_writer property
","text":"stream_writer: DataStreamWriter\n
returns a stream writer
Returns:
Type Description DataStreamWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.streaming_query","title":"streaming_query property
","text":"streaming_query: Optional[Union[str, StreamingQuery]]\n
return the streaming query
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.topic","title":"topic class-attribute
instance-attribute
","text":"topic: str = Field(default=..., description='Kafka topic to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.trigger","title":"trigger class-attribute
instance-attribute
","text":"trigger: Optional[Union[Trigger, str, Dict]] = Field(Trigger(available_now=True), description='Set the trigger for the stream query. If not set data is processed in batch')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.writer","title":"writer property
","text":"writer: Union[DataStreamWriter, DataFrameWriter]\n
function to get the writer of proper type according to whether the data to written is a stream or not This function will also set the trigger property in case of a datastream.
Returns:
Type Description Union[DataStreamWriter, DataFrameWriter]
In case of streaming data -> DataStreamWriter, else -> DataFrameWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output","title":"Output","text":"Output of the KafkaWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output.streaming_query","title":"streaming_query class-attribute
instance-attribute
","text":"streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.execute","title":"execute","text":"execute()\n
Effectively write the data from the dataframe (streaming of batch) to kafka topic.
Returns:
Type Description Output
streaming_query function can be used to gain insights on running write.
Source code in src/koheesio/spark/writers/kafka.py
def execute(self):\n \"\"\"Effectively write the data from the dataframe (streaming of batch) to kafka topic.\n\n Returns\n -------\n KafkaWriter.Output\n streaming_query function can be used to gain insights on running write.\n \"\"\"\n applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n self.log.debug(f\"Applying options {applied_options}\")\n\n self._validate_dataframe()\n\n _writer = self.writer.format(self.format).options(**self.options)\n self.output.streaming_query = _writer.start() if self.streaming else _writer.save()\n
"},{"location":"api_reference/spark/writers/snowflake.html","title":"Snowflake","text":"This module contains the SnowflakeWriter class, which is used to write data to Snowflake.
"},{"location":"api_reference/spark/writers/stream.html","title":"Stream","text":"Module that holds some classes and functions to be able to write to a stream
Classes:
Name Description Trigger
class to set the trigger for a stream query
StreamWriter
abstract class for stream writers
ForEachBatchStreamWriter
class to run a writer for each batch
Functions:
Name Description writer_to_foreachbatch
function to be used as batch_function for StreamWriter (sub)classes
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter","title":"koheesio.spark.writers.stream.ForEachBatchStreamWriter","text":"Runnable ForEachBatchWriter
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n self.streaming_query = self.writer.start()\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter","title":"koheesio.spark.writers.stream.StreamWriter","text":"ABC Stream Writer
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.batch_function","title":"batch_function class-attribute
instance-attribute
","text":"batch_function: Optional[Callable] = Field(default=None, description='allows you to run custom batch functions for each micro batch', alias='batch_function_for_each_df')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.checkpoint_location","title":"checkpoint_location class-attribute
instance-attribute
","text":"checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.output_mode","title":"output_mode class-attribute
instance-attribute
","text":"output_mode: StreamingOutputMode = Field(default=APPEND, alias='outputMode', description=__doc__)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.stream_writer","title":"stream_writer property
","text":"stream_writer: DataStreamWriter\n
Returns the stream writer for the given DataFrame and settings
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.streaming_query","title":"streaming_query class-attribute
instance-attribute
","text":"streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.trigger","title":"trigger class-attribute
instance-attribute
","text":"trigger: Optional[Union[Trigger, str, Dict]] = Field(default=Trigger(available_now=True), description='Set the trigger for the stream query. If this is not set it process data as batch')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.writer","title":"writer property
","text":"writer\n
Returns the stream writer since we don't have a batch mode for streams
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.await_termination","title":"await_termination","text":"await_termination(timeout: Optional[int] = None)\n
Await termination of the stream query
Source code in src/koheesio/spark/writers/stream.py
def await_termination(self, timeout: Optional[int] = None):\n \"\"\"Await termination of the stream query\"\"\"\n self.streaming_query.awaitTermination(timeout=timeout)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.execute","title":"execute abstractmethod
","text":"execute()\n
Source code in src/koheesio/spark/writers/stream.py
@abstractmethod\ndef execute(self):\n raise NotImplementedError\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger","title":"koheesio.spark.writers.stream.Trigger","text":"Trigger types for a stream query.
Only one trigger can be set!
Example - processingTime='5 seconds'
- continuous='5 seconds'
- availableNow=True
- once=True
See Also - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.available_now","title":"available_now class-attribute
instance-attribute
","text":"available_now: Optional[bool] = Field(default=None, alias='availableNow', description='if set to True, set a trigger that processes all available data in multiple batches then terminates the query.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.continuous","title":"continuous class-attribute
instance-attribute
","text":"continuous: Optional[str] = Field(default=None, description=\"a time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a continuous query with a given checkpoint interval.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(validate_default=False, extra='forbid')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.once","title":"once class-attribute
instance-attribute
","text":"once: Optional[bool] = Field(default=None, deprecated=True, description='if set to True, set a trigger that processes only one batch of data in a streaming query then terminates the query. use `available_now` instead of `once`.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.processing_time","title":"processing_time class-attribute
instance-attribute
","text":"processing_time: Optional[str] = Field(default=None, alias='processingTime', description=\"a processing time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a microbatch query periodically based on the processing time.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.triggers","title":"triggers property
","text":"triggers\n
Returns a list of tuples with the value for each trigger
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.value","title":"value property
","text":"value: Dict[str, str]\n
Returns the trigger value as a dictionary
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.execute","title":"execute","text":"execute()\n
Returns the trigger value as a dictionary This method can be skipped, as the value can be accessed directly from the value
property
Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n \"\"\"Returns the trigger value as a dictionary\n This method can be skipped, as the value can be accessed directly from the `value` property\n \"\"\"\n self.log.warning(\"Trigger.execute is deprecated. Use Trigger.value directly instead\")\n self.output.value = self.value\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_any","title":"from_any classmethod
","text":"from_any(value)\n
Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a dictionary
This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types
Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_any(cls, value):\n \"\"\"Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a\n dictionary\n\n This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types\n \"\"\"\n if isinstance(value, Trigger):\n return value\n\n if isinstance(value, str):\n return cls.from_string(value)\n\n if isinstance(value, dict):\n return cls.from_dict(value)\n\n raise RuntimeError(f\"Unable to create Trigger based on the given value: {value}\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_dict","title":"from_dict classmethod
","text":"from_dict(_dict)\n
Creates a Trigger class based on a dictionary
Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_dict(cls, _dict):\n \"\"\"Creates a Trigger class based on a dictionary\"\"\"\n return cls(**_dict)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string","title":"from_string classmethod
","text":"from_string(trigger: str)\n
Creates a Trigger class based on a string
Example Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_string(cls, trigger: str):\n \"\"\"Creates a Trigger class based on a string\n\n Example\n -------\n ### happy flow\n\n * processingTime='5 seconds'\n * processing_time=\"5 hours\"\n * processingTime=4 minutes\n * once=True\n * once=true\n * available_now=true\n * continuous='3 hours'\n * once=TrUe\n * once=TRUE\n\n ### unhappy flow\n valid values, but should fail the validation check of the class\n\n * availableNow=False\n * continuous=True\n * once=false\n \"\"\"\n import re\n\n trigger_from_string = re.compile(r\"(?P<triggerType>\\w+)=[\\'\\\"]?(?P<value>.+)[\\'\\\"]?\")\n _match = trigger_from_string.match(trigger)\n\n if _match is None:\n raise ValueError(\n f\"Cannot parse value for Trigger: '{trigger}'. \\n\"\n f\"Valid types are {', '.join(cls._all_triggers_with_alias())}\"\n )\n\n trigger_type, value = _match.groups()\n\n # strip the value of any quotes\n value = value.strip(\"'\").strip('\"')\n\n # making value a boolean when given\n value = convert_str_to_bool(value)\n\n return cls.from_dict({trigger_type: value})\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--happy-flow","title":"happy flow","text":" - processingTime='5 seconds'
- processing_time=\"5 hours\"
- processingTime=4 minutes
- once=True
- once=true
- available_now=true
- continuous='3 hours'
- once=TrUe
- once=TRUE
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--unhappy-flow","title":"unhappy flow","text":"valid values, but should fail the validation check of the class
- availableNow=False
- continuous=True
- once=false
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_available_now","title":"validate_available_now","text":"validate_available_now(available_now)\n
Validate the available_now trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"available_now\", mode=\"before\")\ndef validate_available_now(cls, available_now):\n \"\"\"Validate the available_now trigger value\"\"\"\n # making value a boolean when given\n available_now = convert_str_to_bool(available_now)\n\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n if available_now is not True:\n raise ValueError(f\"Value for availableNow must be True. Got:{available_now}\")\n return available_now\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_continuous","title":"validate_continuous","text":"validate_continuous(continuous)\n
Validate the continuous trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"continuous\", mode=\"before\")\ndef validate_continuous(cls, continuous):\n \"\"\"Validate the continuous trigger value\"\"\"\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger` except that the if statement is not\n # split in two parts\n if not isinstance(continuous, str):\n raise ValueError(f\"Value for continuous must be a string. Got: {continuous}\")\n\n if len(continuous.strip()) == 0:\n raise ValueError(f\"Value for continuous must be a non empty string. Got: {continuous}\")\n return continuous\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_once","title":"validate_once","text":"validate_once(once)\n
Validate the once trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"once\", mode=\"before\")\ndef validate_once(cls, once):\n \"\"\"Validate the once trigger value\"\"\"\n # making value a boolean when given\n once = convert_str_to_bool(once)\n\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n if once is not True:\n raise ValueError(f\"Value for once must be True. Got: {once}\")\n return once\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_processing_time","title":"validate_processing_time","text":"validate_processing_time(processing_time)\n
Validate the processing time trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"processing_time\", mode=\"before\")\ndef validate_processing_time(cls, processing_time):\n \"\"\"Validate the processing time trigger value\"\"\"\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n if not isinstance(processing_time, str):\n raise ValueError(f\"Value for processing_time must be a string. Got: {processing_time}\")\n\n if len(processing_time.strip()) == 0:\n raise ValueError(f\"Value for processingTime must be a non empty string. Got: {processing_time}\")\n return processing_time\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_triggers","title":"validate_triggers","text":"validate_triggers(triggers: Dict)\n
Validate the trigger value
Source code in src/koheesio/spark/writers/stream.py
@model_validator(mode=\"before\")\ndef validate_triggers(cls, triggers: Dict):\n \"\"\"Validate the trigger value\"\"\"\n params = [*triggers.values()]\n\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`; modified to work with pydantic v2\n if not triggers:\n raise ValueError(\"No trigger provided\")\n if len(params) > 1:\n raise ValueError(\"Multiple triggers not allowed.\")\n\n return triggers\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch","title":"koheesio.spark.writers.stream.writer_to_foreachbatch","text":"writer_to_foreachbatch(writer: Writer)\n
Call writer.execute
on each batch
To be passed as batch_function for StreamWriter (sub)classes.
Example Source code in src/koheesio/spark/writers/stream.py
def writer_to_foreachbatch(writer: Writer):\n \"\"\"Call `writer.execute` on each batch\n\n To be passed as batch_function for StreamWriter (sub)classes.\n\n Example\n -------\n ### Writing to a Delta table and a Snowflake table\n ```python\n DeltaTableStreamWriter(\n table=\"my_table\",\n checkpointLocation=\"my_checkpointlocation\",\n batch_function=writer_to_foreachbatch(\n SnowflakeWriter(\n **sfOptions,\n table=\"snowflake_table\",\n insert_type=SnowflakeWriter.InsertType.APPEND,\n )\n ),\n )\n ```\n \"\"\"\n\n def inner(df, batch_id: int):\n \"\"\"Inner method\n\n As per the Spark documentation:\n In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a\n DataFrame and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the\n output (that is, the provided Dataset) to external systems. The output DataFrame is guaranteed to exactly\n same for the same batchId (assuming all operations are deterministic in the query).\n \"\"\"\n writer.log.debug(f\"Running batch function for batch {batch_id}\")\n writer.write(df)\n\n return inner\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch--writing-to-a-delta-table-and-a-snowflake-table","title":"Writing to a Delta table and a Snowflake table","text":"DeltaTableStreamWriter(\n table=\"my_table\",\n checkpointLocation=\"my_checkpointlocation\",\n batch_function=writer_to_foreachbatch(\n SnowflakeWriter(\n **sfOptions,\n table=\"snowflake_table\",\n insert_type=SnowflakeWriter.InsertType.APPEND,\n )\n ),\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html","title":"Delta","text":"This module is the entry point for the koheesio.spark.writers.delta package.
It imports and exposes the DeltaTableWriter and DeltaTableStreamWriter classes for external use.
Classes: DeltaTableWriter: Class to write data in batch mode to a Delta table. DeltaTableStreamWriter: Class to write data in streaming mode to a Delta table.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode","title":"koheesio.spark.writers.delta.BatchOutputMode","text":"For Batch:
- append: Append the contents of the DataFrame to the output table, default option in Koheesio.
- overwrite: overwrite the existing data.
- ignore: ignore the operation (i.e. no-op).
- error or errorifexists: throw an exception at runtime.
- merge: update matching data in the table and insert rows that do not exist.
- merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERROR","title":"ERROR class-attribute
instance-attribute
","text":"ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute
instance-attribute
","text":"ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.IGNORE","title":"IGNORE class-attribute
instance-attribute
","text":"IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE","title":"MERGE class-attribute
instance-attribute
","text":"MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute
instance-attribute
","text":"MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute
instance-attribute
","text":"MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute
instance-attribute
","text":"OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.DeltaTableStreamWriter","text":"Delta table stream writer
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options","title":"Options","text":"Options for DeltaTableStreamWriter
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute
instance-attribute
","text":"allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute
instance-attribute
","text":"maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute
instance-attribute
","text":"maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n if self.batch_function:\n self.streaming_query = self.writer.start()\n else:\n self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter","title":"koheesio.spark.writers.delta.DeltaTableWriter","text":"Delta table Writer for both batch and streaming dataframes.
Example Parameters:
Name Type Description Default table
Union[DeltaTableStep, str]
The table to write to
required output_mode
Optional[Union[str, BatchOutputMode, StreamingOutputMode]]
The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.
required params
Optional[dict]
Additional parameters to use for specific mode
required"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGEALL,\n output_mode_params={\n \"merge_cond\": \"target.id=source.id\",\n \"update_cond\": \"target.col1_val>=source.col1_val\",\n \"insert_cond\": \"source.col_bk IS NOT NULL\",\n \"target_alias\": \"target\", # <------ DEFAULT, can be changed by providing custom value\n \"source_alias\": \"source\", # <------ DEFAULT, can be changed by providing custom value\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge","title":"Example for MERGE
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n 'merge_builder': (\n DeltaTable\n .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n .alias(target_alias)\n .merge(source=df, condition=merge_cond)\n .whenMatchedUpdateAll(condition=update_cond)\n .whenNotMatchedInsertAll(condition=insert_cond)\n )\n }\n )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE
","text":"in case the table isn't created yet, first run will execute an APPEND operation
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n \"merge_builder\": [\n {\n \"clause\": \"whenMatchedUpdate\",\n \"set\": {\"value\": \"source.value\"},\n \"condition\": \"<update_condition>\",\n },\n {\n \"clause\": \"whenNotMatchedInsert\",\n \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n \"condition\": \"<insert_condition>\",\n },\n ],\n \"merge_cond\": \"<merge_condition>\",\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"dataframe writer options can be passed as keyword arguments
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.APPEND,\n partitionOverwriteMode=\"dynamic\",\n mergeSchema=\"false\",\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.format","title":"format class-attribute
instance-attribute
","text":"format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.output_mode","title":"output_mode class-attribute
instance-attribute
","text":"output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.partition_by","title":"partition_by class-attribute
instance-attribute
","text":"partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.writer","title":"writer property
","text":"writer: Union[DeltaMergeBuilder, DataFrameWriter]\n
Specify DeltaTableWriter
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n _writer = self.writer\n\n if self.table.create_if_not_exists and not self.table.exists:\n _writer = _writer.options(**self.table.default_create_properties)\n\n if isinstance(_writer, DeltaMergeBuilder):\n _writer.execute()\n else:\n if options := self.params:\n # should we add options only if mode is not merge?\n _writer = _writer.options(**options)\n _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod
","text":"get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n
Retrieve an OutputMode by validating choice
against a set of option OutputModes.
Currently supported output modes can be found in:
- BatchOutputMode
- StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n Currently supported output modes can be found in:\n\n - BatchOutputMode\n - StreamingOutputMode\n \"\"\"\n for enum_type in options:\n if choice.upper() in [om.value.upper() for om in enum_type]:\n return getattr(enum_type, choice.upper())\n raise AttributeError(\n f\"\"\"\n Invalid outputMode specified '{choice}'. Allowed values are:\n Batch Mode - {BatchOutputMode.__doc__}\n Streaming Mode - {StreamingOutputMode.__doc__}\n \"\"\"\n )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.SCD2DeltaTableWriter","text":"A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.
Attributes:
Name Type Description table
InstanceOf[DeltaTableStep]
The table to merge to.
merge_key
str
The key used for merging data.
include_columns
List[str]
Columns to be merged. Will be selected from DataFrame. Default is all columns.
exclude_columns
List[str]
Columns to be excluded from DataFrame.
scd2_columns
List[str]
List of attributes for SCD2 type (track changes).
scd2_timestamp_col
Optional[Column]
Timestamp column for SCD2 type (track changes). Default to current_timestamp.
scd1_columns
List[str]
List of attributes for SCD1 type (just update).
meta_scd2_struct_col_name
str
SCD2 struct name.
meta_scd2_effective_time_col_name
str
Effective col name.
meta_scd2_is_current_col_name
str
Current col name.
meta_scd2_end_time_col_name
str
End time col name.
target_auto_generated_columns
List[str]
Auto generated columns from target Delta table. Will be used to exclude from merge logic.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute
instance-attribute
","text":"exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute
instance-attribute
","text":"include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute
instance-attribute
","text":"merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute
instance-attribute
","text":"meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute
instance-attribute
","text":"meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute
instance-attribute
","text":"scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute
instance-attribute
","text":"scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute
instance-attribute
","text":"scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute
instance-attribute
","text":"target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.execute","title":"execute","text":"execute() -> None\n
Execute the SCD Type 2 operation.
This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.
Raises:
Type Description TypeError
If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.
Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n \"\"\"\n Execute the SCD Type 2 operation.\n\n This method executes the SCD Type 2 operation on the DataFrame.\n It validates the existing Delta table, prepares the merge conditions, stages the data,\n and then performs the merge operation.\n\n Raises\n ------\n TypeError\n If the scd2_timestamp_col is not of date or timestamp type.\n If the source DataFrame is missing any of the required merge columns.\n\n \"\"\"\n self.df: DataFrame\n self.spark: SparkSession\n delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n # Prepare required merge columns\n required_merge_columns = [self.merge_key]\n\n if self.scd2_columns:\n required_merge_columns += self.scd2_columns\n\n if self.scd1_columns:\n required_merge_columns += self.scd1_columns\n\n if not all(c in self.df.columns for c in required_merge_columns):\n missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n # Check that required columns are present in the source DataFrame\n if self.scd2_timestamp_col is not None:\n timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n raise TypeError(\n f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n f\"or timestamp type.Current type is {timestamp_col_type}\"\n )\n\n # Prepare columns to process\n include_columns = self.include_columns if self.include_columns else self.df.columns\n exclude_columns = self.exclude_columns\n columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n # Constructing column names for SCD2 attributes\n meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n # Constructing system merge action logic\n system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n if updates_attrs_scd2 := self._prepare_attr_clause(\n attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n if updates_attrs_scd1 := self._prepare_attr_clause(\n attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n system_merge_action += \" ELSE NULL END\"\n\n # Prepare the staged DataFrame\n staged = (\n self.df.withColumn(\n \"__meta_scd2_timestamp\",\n self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n )\n .transform(\n func=self._prepare_staging,\n delta_table=delta_table,\n merge_action_logic=F.expr(system_merge_action),\n meta_scd2_is_current_col=meta_scd2_is_current_col,\n columns_to_process=columns_to_process,\n src_alias=src_alias,\n dest_alias=dest_alias,\n cross_alias=cross_alias,\n )\n .transform(\n func=self._preserve_existing_target_values,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n target_auto_generated_columns=self.target_auto_generated_columns,\n src_alias=src_alias,\n cross_alias=cross_alias,\n dest_alias=dest_alias,\n logger=self.log,\n )\n .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n .withColumn(\n \"__meta_scd2_effective_time\",\n self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n )\n .transform(\n func=self._add_scd2_columns,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n )\n )\n\n self._prepare_merge_builder(\n delta_table=delta_table,\n dest_alias=dest_alias,\n staged=staged,\n merge_key=self.merge_key,\n columns_to_process=columns_to_process,\n meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n ).execute()\n
"},{"location":"api_reference/spark/writers/delta/batch.html","title":"Batch","text":"This module defines the DeltaTableWriter class, which is used to write both batch and streaming dataframes to Delta tables.
DeltaTableWriter supports two output modes: MERGEALL
and MERGE
.
- The
MERGEALL
mode merges all incoming data with existing data in the table based on certain conditions. - The
MERGE
mode allows for more custom merging behavior using the DeltaMergeBuilder class from the delta.tables
library.
The output_mode_params
dictionary is used to specify conditions for merging, updating, and inserting data. The target_alias
and source_alias
keys are used to specify the aliases for the target and source dataframes in the merge conditions.
Classes:
Name Description DeltaTableWriter
A class for writing data to Delta tables.
DeltaTableStreamWriter
A class for writing streaming data to Delta tables.
Example DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGEALL,\n output_mode_params={\n \"merge_cond\": \"target.id=source.id\",\n \"update_cond\": \"target.col1_val>=source.col1_val\",\n \"insert_cond\": \"source.col_bk IS NOT NULL\",\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter","title":"koheesio.spark.writers.delta.batch.DeltaTableWriter","text":"Delta table Writer for both batch and streaming dataframes.
Example Parameters:
Name Type Description Default table
Union[DeltaTableStep, str]
The table to write to
required output_mode
Optional[Union[str, BatchOutputMode, StreamingOutputMode]]
The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.
required params
Optional[dict]
Additional parameters to use for specific mode
required"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGEALL,\n output_mode_params={\n \"merge_cond\": \"target.id=source.id\",\n \"update_cond\": \"target.col1_val>=source.col1_val\",\n \"insert_cond\": \"source.col_bk IS NOT NULL\",\n \"target_alias\": \"target\", # <------ DEFAULT, can be changed by providing custom value\n \"source_alias\": \"source\", # <------ DEFAULT, can be changed by providing custom value\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge","title":"Example for MERGE
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n 'merge_builder': (\n DeltaTable\n .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n .alias(target_alias)\n .merge(source=df, condition=merge_cond)\n .whenMatchedUpdateAll(condition=update_cond)\n .whenNotMatchedInsertAll(condition=insert_cond)\n )\n }\n )\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE
","text":"in case the table isn't created yet, first run will execute an APPEND operation
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n \"merge_builder\": [\n {\n \"clause\": \"whenMatchedUpdate\",\n \"set\": {\"value\": \"source.value\"},\n \"condition\": \"<update_condition>\",\n },\n {\n \"clause\": \"whenNotMatchedInsert\",\n \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n \"condition\": \"<insert_condition>\",\n },\n ],\n \"merge_cond\": \"<merge_condition>\",\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"dataframe writer options can be passed as keyword arguments
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.APPEND,\n partitionOverwriteMode=\"dynamic\",\n mergeSchema=\"false\",\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.format","title":"format class-attribute
instance-attribute
","text":"format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.output_mode","title":"output_mode class-attribute
instance-attribute
","text":"output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.partition_by","title":"partition_by class-attribute
instance-attribute
","text":"partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.writer","title":"writer property
","text":"writer: Union[DeltaMergeBuilder, DataFrameWriter]\n
Specify DeltaTableWriter
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n _writer = self.writer\n\n if self.table.create_if_not_exists and not self.table.exists:\n _writer = _writer.options(**self.table.default_create_properties)\n\n if isinstance(_writer, DeltaMergeBuilder):\n _writer.execute()\n else:\n if options := self.params:\n # should we add options only if mode is not merge?\n _writer = _writer.options(**options)\n _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod
","text":"get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n
Retrieve an OutputMode by validating choice
against a set of option OutputModes.
Currently supported output modes can be found in:
- BatchOutputMode
- StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n Currently supported output modes can be found in:\n\n - BatchOutputMode\n - StreamingOutputMode\n \"\"\"\n for enum_type in options:\n if choice.upper() in [om.value.upper() for om in enum_type]:\n return getattr(enum_type, choice.upper())\n raise AttributeError(\n f\"\"\"\n Invalid outputMode specified '{choice}'. Allowed values are:\n Batch Mode - {BatchOutputMode.__doc__}\n Streaming Mode - {StreamingOutputMode.__doc__}\n \"\"\"\n )\n
"},{"location":"api_reference/spark/writers/delta/scd.html","title":"Scd","text":"This module defines writers to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.
Slowly Changing Dimension (SCD) is a technique used in data warehousing to handle changes to dimension data over time. SCD Type 2 is one of the most common types of SCD, where historical changes are tracked by creating new records for each change.
Koheesio is a powerful data processing framework that provides advanced capabilities for working with Delta tables in Apache Spark. It offers a convenient and efficient way to handle SCD Type 2 operations on Delta tables.
To learn more about Slowly Changing Dimension and SCD Type 2, you can refer to the following resources: - Slowly Changing Dimension (SCD) - Wikipedia
By using Koheesio, you can benefit from its efficient merge logic, support for SCD Type 2 and SCD Type 1 attributes, and seamless integration with Delta tables in Spark.
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","text":"A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.
Attributes:
Name Type Description table
InstanceOf[DeltaTableStep]
The table to merge to.
merge_key
str
The key used for merging data.
include_columns
List[str]
Columns to be merged. Will be selected from DataFrame. Default is all columns.
exclude_columns
List[str]
Columns to be excluded from DataFrame.
scd2_columns
List[str]
List of attributes for SCD2 type (track changes).
scd2_timestamp_col
Optional[Column]
Timestamp column for SCD2 type (track changes). Default to current_timestamp.
scd1_columns
List[str]
List of attributes for SCD1 type (just update).
meta_scd2_struct_col_name
str
SCD2 struct name.
meta_scd2_effective_time_col_name
str
Effective col name.
meta_scd2_is_current_col_name
str
Current col name.
meta_scd2_end_time_col_name
str
End time col name.
target_auto_generated_columns
List[str]
Auto generated columns from target Delta table. Will be used to exclude from merge logic.
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute
instance-attribute
","text":"exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute
instance-attribute
","text":"include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute
instance-attribute
","text":"merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute
instance-attribute
","text":"meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute
instance-attribute
","text":"meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute
instance-attribute
","text":"scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute
instance-attribute
","text":"scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute
instance-attribute
","text":"scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute
instance-attribute
","text":"target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.execute","title":"execute","text":"execute() -> None\n
Execute the SCD Type 2 operation.
This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.
Raises:
Type Description TypeError
If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.
Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n \"\"\"\n Execute the SCD Type 2 operation.\n\n This method executes the SCD Type 2 operation on the DataFrame.\n It validates the existing Delta table, prepares the merge conditions, stages the data,\n and then performs the merge operation.\n\n Raises\n ------\n TypeError\n If the scd2_timestamp_col is not of date or timestamp type.\n If the source DataFrame is missing any of the required merge columns.\n\n \"\"\"\n self.df: DataFrame\n self.spark: SparkSession\n delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n # Prepare required merge columns\n required_merge_columns = [self.merge_key]\n\n if self.scd2_columns:\n required_merge_columns += self.scd2_columns\n\n if self.scd1_columns:\n required_merge_columns += self.scd1_columns\n\n if not all(c in self.df.columns for c in required_merge_columns):\n missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n # Check that required columns are present in the source DataFrame\n if self.scd2_timestamp_col is not None:\n timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n raise TypeError(\n f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n f\"or timestamp type.Current type is {timestamp_col_type}\"\n )\n\n # Prepare columns to process\n include_columns = self.include_columns if self.include_columns else self.df.columns\n exclude_columns = self.exclude_columns\n columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n # Constructing column names for SCD2 attributes\n meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n # Constructing system merge action logic\n system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n if updates_attrs_scd2 := self._prepare_attr_clause(\n attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n if updates_attrs_scd1 := self._prepare_attr_clause(\n attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n system_merge_action += \" ELSE NULL END\"\n\n # Prepare the staged DataFrame\n staged = (\n self.df.withColumn(\n \"__meta_scd2_timestamp\",\n self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n )\n .transform(\n func=self._prepare_staging,\n delta_table=delta_table,\n merge_action_logic=F.expr(system_merge_action),\n meta_scd2_is_current_col=meta_scd2_is_current_col,\n columns_to_process=columns_to_process,\n src_alias=src_alias,\n dest_alias=dest_alias,\n cross_alias=cross_alias,\n )\n .transform(\n func=self._preserve_existing_target_values,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n target_auto_generated_columns=self.target_auto_generated_columns,\n src_alias=src_alias,\n cross_alias=cross_alias,\n dest_alias=dest_alias,\n logger=self.log,\n )\n .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n .withColumn(\n \"__meta_scd2_effective_time\",\n self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n )\n .transform(\n func=self._add_scd2_columns,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n )\n )\n\n self._prepare_merge_builder(\n delta_table=delta_table,\n dest_alias=dest_alias,\n staged=staged,\n merge_key=self.merge_key,\n columns_to_process=columns_to_process,\n meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n ).execute()\n
"},{"location":"api_reference/spark/writers/delta/stream.html","title":"Stream","text":"This module defines the DeltaTableStreamWriter class, which is used to write streaming dataframes to Delta tables.
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","text":"Delta table stream writer
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options","title":"Options","text":"Options for DeltaTableStreamWriter
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute
instance-attribute
","text":"allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute
instance-attribute
","text":"maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute
instance-attribute
","text":"maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n if self.batch_function:\n self.streaming_query = self.writer.start()\n else:\n self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/utils.html","title":"Utils","text":"This module provides utility functions while working with delta framework.
"},{"location":"api_reference/spark/writers/delta/utils.html#koheesio.spark.writers.delta.utils.log_clauses","title":"koheesio.spark.writers.delta.utils.log_clauses","text":"log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]\n
Prepare log message for clauses of DeltaMergePlan statement.
Parameters:
Name Type Description Default clauses
JavaObject
The clauses of the DeltaMergePlan statement.
required source_alias
str
The source alias.
required target_alias
str
The target alias.
required Returns:
Type Description Optional[str]
The log message if there are clauses, otherwise None.
Notes This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses, processes the conditions, and constructs the log message based on the clause type and columns.
If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is None, it sets the condition_clause to \"No conditions required\".
The log message includes the clauses type, the clause type, the columns, and the condition.
Source code in src/koheesio/spark/writers/delta/utils.py
def log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]:\n \"\"\"\n Prepare log message for clauses of DeltaMergePlan statement.\n\n Parameters\n ----------\n clauses : JavaObject\n The clauses of the DeltaMergePlan statement.\n source_alias : str\n The source alias.\n target_alias : str\n The target alias.\n\n Returns\n -------\n Optional[str]\n The log message if there are clauses, otherwise None.\n\n Notes\n -----\n This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses,\n processes the conditions, and constructs the log message based on the clause type and columns.\n\n If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is\n None, it sets the condition_clause to \"No conditions required\".\n\n The log message includes the clauses type, the clause type, the columns, and the condition.\n \"\"\"\n log_message = None\n\n if not clauses.isEmpty():\n clauses_type = clauses.last().nodeName().replace(\"DeltaMergeInto\", \"\")\n _processed_clauses = {}\n\n for i in range(0, clauses.length()):\n clause = clauses.apply(i)\n condition = clause.condition()\n\n if \"value\" in dir(condition):\n condition_clause = (\n condition.value()\n .toString()\n .replace(f\"'{source_alias}\", source_alias)\n .replace(f\"'{target_alias}\", target_alias)\n )\n elif condition.toString() == \"None\":\n condition_clause = \"No conditions required\"\n\n clause_type: str = clause.clauseType().capitalize()\n columns = \"ALL\" if clause_type == \"Delete\" else clause.actions().toList().apply(0).toString()\n\n if clause_type.lower() not in _processed_clauses:\n _processed_clauses[clause_type.lower()] = []\n\n log_message = (\n f\"{clauses_type} will perform action:{clause_type} columns ({columns}) if `{condition_clause}`\"\n )\n\n return log_message\n
"},{"location":"api_reference/sso/index.html","title":"Sso","text":""},{"location":"api_reference/sso/okta.html","title":"Okta","text":"This module contains Okta integration steps.
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter","title":"koheesio.sso.okta.LoggerOktaTokenFilter","text":"LoggerOktaTokenFilter(okta_object: OktaAccessToken, name: str = 'OktaToken')\n
Filter which hides token value from log.
Source code in src/koheesio/sso/okta.py
def __init__(self, okta_object: OktaAccessToken, name: str = \"OktaToken\"):\n self.__okta_object = okta_object\n super().__init__(name=name)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter.filter","title":"filter","text":"filter(record)\n
Source code in src/koheesio/sso/okta.py
def filter(self, record):\n # noinspection PyUnresolvedReferences\n if token := self.__okta_object.output.token:\n token_value = token.get_secret_value()\n record.msg = record.msg.replace(token_value, \"<SECRET_TOKEN>\")\n\n return True\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta","title":"koheesio.sso.okta.Okta","text":"Base Okta class
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_id","title":"client_id class-attribute
instance-attribute
","text":"client_id: str = Field(default=..., alias='okta_id', description='Okta account ID')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_secret","title":"client_secret class-attribute
instance-attribute
","text":"client_secret: SecretStr = Field(default=..., alias='okta_secret', description='Okta account secret', repr=False)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.data","title":"data class-attribute
instance-attribute
","text":"data: Optional[Union[Dict[str, str], str]] = Field(default={'grant_type': 'client_credentials'}, description='Data to be sent along with the token request')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken","title":"koheesio.sso.okta.OktaAccessToken","text":"OktaAccessToken(**kwargs)\n
Get Okta authorization token
Example:
token = (\n OktaAccessToken(\n url=\"https://org.okta.com\",\n client_id=\"client\",\n client_secret=SecretStr(\"secret\"),\n params={\n \"p1\": \"foo\",\n \"p2\": \"bar\",\n },\n )\n .execute()\n .token\n)\n
Source code in src/koheesio/sso/okta.py
def __init__(self, **kwargs):\n _logger = LoggingFactory.get_logger(name=self.__class__.__name__, inherit_from_koheesio=True)\n logger_filter = LoggerOktaTokenFilter(okta_object=self)\n _logger.addFilter(logger_filter)\n super().__init__(**kwargs)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output","title":"Output","text":"Output class for OktaAccessToken.
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output.token","title":"token class-attribute
instance-attribute
","text":"token: Optional[SecretStr] = Field(default=None, description='Okta authentication token')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.execute","title":"execute","text":"execute()\n
Execute an HTTP Post call to Okta service and retrieve the access token.
Source code in src/koheesio/sso/okta.py
def execute(self):\n \"\"\"\n Execute an HTTP Post call to Okta service and retrieve the access token.\n \"\"\"\n HttpPostStep.execute(self)\n\n # noinspection PyUnresolvedReferences\n status_code = self.output.status_code\n # noinspection PyUnresolvedReferences\n raw_payload = self.output.raw_payload\n\n if status_code != 200:\n raise HTTPError(f\"Request failed with '{status_code}' code. Payload: {raw_payload}\")\n\n # noinspection PyUnresolvedReferences\n json_payload = self.output.json_payload\n\n if token := json_payload.get(\"access_token\"):\n self.output.token = SecretStr(token)\n else:\n raise ValueError(f\"No 'access_token' found in the Okta response: {json_payload}\")\n
"},{"location":"api_reference/steps/index.html","title":"Steps","text":"Steps Module
This module contains the definition of the Step
class, which serves as the base class for custom units of logic that can be executed. It also includes the StepOutput
class, which defines the output data model for a Step
.
The Step
class is designed to be subclassed for creating new steps in a data pipeline. Each subclass should implement the execute
method, specifying the expected inputs and outputs.
This module also exports the SparkStep
class for steps that interact with Spark
Classes: - Step: Base class for a custom unit of logic that can be executed.
- StepOutput: Defines the output data model for a
Step
.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step","title":"koheesio.steps.Step","text":"Base class for a step
A custom unit of logic that can be executed.
The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self)
method, specifying the expected inputs and outputs.
Note: since the Step class is meta classed, the execute method is wrapped with the do_execute
function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.
Methods and Attributes The Step class has several attributes and methods.
Background A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.
A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!
The diagram serves to illustrate the concept of a Step:
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.
- Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
- Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the
execute
method of the Step class with the _execute_wrapper
function. This ensures that the execute
method always returns the output of the Step along with providing logging and validation of the output. - Step has an
Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. - The
Output
class can be extended to add additional fields to the output of the Step.
Examples:
class MyStep(Step):\n a: str # input\n\n class Output(StepOutput): # output\n b: str\n\n def execute(self) -> MyStep.Output:\n self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--input","title":"INPUT","text":"The following fields are available by default on the Step class: - name
: Name of the Step. If not set, the name of the class will be used. - description
: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.
When subclassing a Step, any additional pydantic field will be treated as input
to the Step. See also the explanation on the .execute()
method below.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--output","title":"OUTPUT","text":"Every Step has an Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. The Output
class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute()
.
Output
: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class. output
: Allows you to interact with the Output of the Step lazily (see above and StepOutput)
When subclassing a Step, any additional pydantic field added to the nested Output
class will be treated as output
of the Step. See also the description of StepOutput
for more information.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--methods","title":"Methods:","text":" execute
: Abstract method to implement for new steps. - The Inputs of the step can be accessed, using
self.input_name
. - The output of the step can be accessed, using
self.output.output_name
.
run
: Alias to .execute() method. You can use this to run the step, but execute is preferred. to_yaml
: YAML dump the step get_description
: Get the description of the Step
When subclassing a Step, execute
is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.
Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute
function making it always return a StepOutput. See also the explanation on the do_execute
function.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--class-methods","title":"class methods:","text":" from_step
: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
get_description
: Get the description of the Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--dunder-methods","title":"dunder methods:","text":" __getattr__
: Allows input to be accessed through self.input_name
__repr__
and __str__
: String representation of a step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.output","title":"output property
writable
","text":"output: Output\n
Interact with the output of the Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.Output","title":"Output","text":"Output class for Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.execute","title":"execute abstractmethod
","text":"execute()\n
Abstract method to implement for new steps.
The Inputs of the step can be accessed, using self.input_name
Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute
function making it always return the Steps output
Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Abstract method to implement for new steps.\n\n The Inputs of the step can be accessed, using `self.input_name`\n\n Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n it always return the Steps output\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.from_step","title":"from_step classmethod
","text":"from_step(step: Step, **kwargs)\n
Returns a new Step instance based on the data of another Step or BaseModel instance
Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_json","title":"repr_json","text":"repr_json(simple=False) -> str\n
dump the step to json, meant for representation
Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid json
Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n \"\"\"dump the step to json, meant for representation\n\n Note: use to_json if you want to dump the step to json for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_json())\n {\"input\": {\"a\": \"foo\"}}\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid json\n \"\"\"\n model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n _result = {}\n\n # extract input\n _input = self.model_dump(**model_dump_options)\n\n # remove name and description from input and add to result if simple is not set\n name = _input.pop(\"name\", None)\n description = _input.pop(\"description\", None)\n if not simple:\n if name:\n _result[\"name\"] = name\n if description:\n _result[\"description\"] = description\n else:\n model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n # extract output\n _output = self.output.model_dump(**model_dump_options)\n\n # add output to result\n if _output:\n _result[\"output\"] = _output\n\n # add input to result\n _result[\"input\"] = _input\n\n class MyEncoder(json.JSONEncoder):\n \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n def default(self, o: Any) -> Any:\n try:\n return super().default(o)\n except TypeError:\n return o.__class__.__name__\n\n # Use MyEncoder when converting the dictionary to a JSON string\n json_str = json.dumps(_result, cls=MyEncoder)\n\n return json_str\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_yaml","title":"repr_yaml","text":"repr_yaml(simple=False) -> str\n
dump the step to yaml, meant for representation
Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n a: foo\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid yaml
Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n \"\"\"dump the step to yaml, meant for representation\n\n Note: use to_yaml if you want to dump the step to yaml for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_yaml())\n input:\n a: foo\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid yaml\n \"\"\"\n json_str = self.repr_json(simple=simple)\n\n # Parse the JSON string back into a dictionary\n _result = json.loads(json_str)\n\n return yaml.dump(_result)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.run","title":"run","text":"run()\n
Alias to .execute()
Source code in src/koheesio/steps/__init__.py
def run(self):\n \"\"\"Alias to .execute()\"\"\"\n return self.execute()\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepMetaClass","title":"koheesio.steps.StepMetaClass","text":"StepMetaClass has to be set up as a Metaclass extending ModelMetaclass to allow Pydantic to be unaffected while allowing for the execute method to be auto-decorated with do_execute
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput","title":"koheesio.steps.StepOutput","text":"Class for the StepOutput model
Usage Setting up the StepOutputs class is done like this:
class YourOwnOutput(StepOutput):\n a: str\n b: int\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.validate_output","title":"validate_output","text":"validate_output() -> StepOutput\n
Validate the output of the Step
Essentially, this method is a wrapper around the validate method of the BaseModel class
Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n \"\"\"Validate the output of the Step\n\n Essentially, this method is a wrapper around the validate method of the BaseModel class\n \"\"\"\n validated_model = self.validate()\n return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/steps/dummy.html","title":"Dummy","text":"Dummy step for testing purposes.
This module contains a dummy step for testing purposes. It is used to test the Koheesio framework or to provide a simple example of how to create a new step.
Example s = DummyStep(a=\"a\", b=2)\ns.execute()\n
In this case, s.output
will be equivalent to the following dictionary: {\"a\": \"a\", \"b\": 2, \"c\": \"aa\"}\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput","title":"koheesio.steps.dummy.DummyOutput","text":"Dummy output for testing purposes.
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.a","title":"a instance-attribute
","text":"a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.b","title":"b instance-attribute
","text":"b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep","title":"koheesio.steps.dummy.DummyStep","text":"Dummy step for testing purposes.
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.a","title":"a instance-attribute
","text":"a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.b","title":"b instance-attribute
","text":"b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output","title":"Output","text":"Dummy output for testing purposes.
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output.c","title":"c instance-attribute
","text":"c: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.execute","title":"execute","text":"execute()\n
Dummy execute for testing purposes.
Source code in src/koheesio/steps/dummy.py
def execute(self):\n \"\"\"Dummy execute for testing purposes.\"\"\"\n self.output.a = self.a\n self.output.b = self.b\n self.output.c = self.a * self.b\n
"},{"location":"api_reference/steps/http.html","title":"Http","text":"This module contains a few simple HTTP Steps that can be used to perform API Calls to HTTP endpoints
Example from koheesio.steps.http import HttpGetStep\n\nresponse = HttpGetStep(url=\"https://google.com\").execute().json_payload\n
In the above example, the response
variable will contain the JSON response from the HTTP request.
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep","title":"koheesio.steps.http.HttpDeleteStep","text":"send DELETE requests
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = DELETE\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep","title":"koheesio.steps.http.HttpGetStep","text":"send GET requests
Example response = HttpGetStep(url=\"https://google.com\").execute().json_payload\n
In the above example, the response
variable will contain the JSON response from the HTTP request."},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = GET\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod","title":"koheesio.steps.http.HttpMethod","text":"Enumeration of allowed http methods
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.DELETE","title":"DELETE class-attribute
instance-attribute
","text":"DELETE = 'delete'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.GET","title":"GET class-attribute
instance-attribute
","text":"GET = 'get'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.POST","title":"POST class-attribute
instance-attribute
","text":"POST = 'post'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.PUT","title":"PUT class-attribute
instance-attribute
","text":"PUT = 'put'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.from_string","title":"from_string classmethod
","text":"from_string(value: str)\n
Allows for getting the right Method Enum by simply passing a string value This method is not case-sensitive
Source code in src/koheesio/steps/http.py
@classmethod\ndef from_string(cls, value: str):\n \"\"\"Allows for getting the right Method Enum by simply passing a string value\n This method is not case-sensitive\n \"\"\"\n return getattr(cls, value.upper())\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep","title":"koheesio.steps.http.HttpPostStep","text":"send POST requests
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = POST\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep","title":"koheesio.steps.http.HttpPutStep","text":"send PUT requests
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = PUT\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep","title":"koheesio.steps.http.HttpStep","text":"Can be used to perform API Calls to HTTP endpoints
Understanding Retries This class includes a built-in retry mechanism for handling temporary issues, such as network errors or server downtime, that might cause the HTTP request to fail. The retry mechanism is controlled by three parameters: max_retries
, initial_delay
, and backoff
.
-
max_retries
determines the number of retries after the initial request. For example, if max_retries
is set to 4, the request will be attempted a total of 5 times (1 initial attempt + 4 retries). If max_retries
is set to 0, no retries will be attempted, and the request will be tried only once.
-
initial_delay
sets the waiting period before the first retry. If initial_delay
is set to 3, the delay before the first retry will be 3 seconds. Changing the initial_delay
value directly affects the amount of delay before each retry.
-
backoff
controls the rate at which the delay increases for each subsequent retry. If backoff
is set to 2 (the default), the delay will double with each retry. If backoff
is set to 1, the delay between retries will remain constant. Changing the backoff
value affects how quickly the delay increases.
Given the default values of max_retries=3
, initial_delay=2
, and backoff=2
, the delays between retries would be 2 seconds, 4 seconds, and 8 seconds, respectively. This results in a total delay of 14 seconds before all retries are exhausted.
For example, if you set initial_delay=3
and backoff=2
, the delays before the retries would be 3 seconds
, 6 seconds
, and 12 seconds
. If you set initial_delay=2
and backoff=3
, the delays before the retries would be 2 seconds
, 6 seconds
, and 18 seconds
. If you set initial_delay=2
and backoff=1
, the delays before the retries would be 2 seconds
, 2 seconds
, and 2 seconds
.
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.data","title":"data class-attribute
instance-attribute
","text":"data: Optional[Union[Dict[str, str], str]] = Field(default_factory=dict, description='[Optional] Data to be sent along with the request', alias='body')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.headers","title":"headers class-attribute
instance-attribute
","text":"headers: Optional[Dict[str, Union[str, SecretStr]]] = Field(default_factory=dict, description='Request headers', alias='header')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.method","title":"method class-attribute
instance-attribute
","text":"method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to HTTP request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.session","title":"session class-attribute
instance-attribute
","text":"session: Session = Field(default_factory=Session, description='Requests session object to be used for making HTTP requests', exclude=True, repr=False)\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.timeout","title":"timeout class-attribute
instance-attribute
","text":"timeout: Optional[int] = Field(default=3, description='[Optional] Request timeout')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., description='API endpoint URL', alias='uri')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output","title":"Output","text":"Output class for HttpStep
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.json_payload","title":"json_payload property
","text":"json_payload\n
Alias for response_json
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.raw_payload","title":"raw_payload class-attribute
instance-attribute
","text":"raw_payload: Optional[str] = Field(default=None, alias='response_text', description='The raw response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_json","title":"response_json class-attribute
instance-attribute
","text":"response_json: Optional[Union[Dict, List]] = Field(default=None, alias='json_payload', description='The JSON response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_raw","title":"response_raw class-attribute
instance-attribute
","text":"response_raw: Optional[Response] = Field(default=None, alias='response', description='The raw requests.Response object returned by the appropriate requests.request() call')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.status_code","title":"status_code class-attribute
instance-attribute
","text":"status_code: Optional[int] = Field(default=None, description='The status return code of the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.decode_sensitive_headers","title":"decode_sensitive_headers","text":"decode_sensitive_headers(headers)\n
Authorization headers are being converted into SecretStr under the hood to avoid dumping any sensitive content into logs by the encode_sensitive_headers
method.
However, when calling the get_headers
method, the SecretStr should be converted back to string, otherwise sensitive info would have looked like '**********'.
This method decodes values of the headers
dictionary that are of type SecretStr into plain text.
Source code in src/koheesio/steps/http.py
@field_serializer(\"headers\", when_used=\"json\")\ndef decode_sensitive_headers(self, headers):\n \"\"\"\n Authorization headers are being converted into SecretStr under the hood to avoid dumping any\n sensitive content into logs by the `encode_sensitive_headers` method.\n\n However, when calling the `get_headers` method, the SecretStr should be converted back to\n string, otherwise sensitive info would have looked like '**********'.\n\n This method decodes values of the `headers` dictionary that are of type SecretStr into plain text.\n \"\"\"\n for k, v in headers.items():\n headers[k] = v.get_secret_value() if isinstance(v, SecretStr) else v\n return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.delete","title":"delete","text":"delete() -> Response\n
Execute an HTTP DELETE call
Source code in src/koheesio/steps/http.py
def delete(self) -> requests.Response:\n \"\"\"Execute an HTTP DELETE call\"\"\"\n self.method = HttpMethod.DELETE\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.encode_sensitive_headers","title":"encode_sensitive_headers","text":"encode_sensitive_headers(headers)\n
Encode potentially sensitive data into pydantic.SecretStr class to prevent them being displayed as plain text in logs.
Source code in src/koheesio/steps/http.py
@field_validator(\"headers\", mode=\"before\")\ndef encode_sensitive_headers(cls, headers):\n \"\"\"\n Encode potentially sensitive data into pydantic.SecretStr class to prevent them\n being displayed as plain text in logs.\n \"\"\"\n if auth := headers.get(\"Authorization\"):\n headers[\"Authorization\"] = auth if isinstance(auth, SecretStr) else SecretStr(auth)\n return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.execute","title":"execute","text":"execute() -> Output\n
Executes the HTTP request.
This method simply calls self.request()
, which includes the retry logic. If self.request()
raises an exception, it will be propagated to the caller of this method.
Raises:
Type Description (RequestException, HTTPError)
The last exception that was caught if self.request()
fails after self.max_retries
attempts.
Source code in src/koheesio/steps/http.py
def execute(self) -> Output:\n \"\"\"\n Executes the HTTP request.\n\n This method simply calls `self.request()`, which includes the retry logic. If `self.request()` raises an\n exception, it will be propagated to the caller of this method.\n\n Raises\n ------\n requests.RequestException, requests.HTTPError\n The last exception that was caught if `self.request()` fails after `self.max_retries` attempts.\n \"\"\"\n self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get","title":"get","text":"get() -> Response\n
Execute an HTTP GET call
Source code in src/koheesio/steps/http.py
def get(self) -> requests.Response:\n \"\"\"Execute an HTTP GET call\"\"\"\n self.method = HttpMethod.GET\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_headers","title":"get_headers","text":"get_headers()\n
Dump headers into JSON without SecretStr masking.
Source code in src/koheesio/steps/http.py
def get_headers(self):\n \"\"\"\n Dump headers into JSON without SecretStr masking.\n \"\"\"\n return json.loads(self.model_dump_json()).get(\"headers\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_options","title":"get_options","text":"get_options()\n
options to be passed to requests.request()
Source code in src/koheesio/steps/http.py
def get_options(self):\n \"\"\"options to be passed to requests.request()\"\"\"\n return {\n \"url\": self.url,\n \"headers\": self.get_headers(),\n \"data\": self.data,\n \"timeout\": self.timeout,\n **self.params, # type: ignore\n }\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_proper_http_method_from_str_value","title":"get_proper_http_method_from_str_value","text":"get_proper_http_method_from_str_value(method_value)\n
Converts string value to HttpMethod enum value
Source code in src/koheesio/steps/http.py
@field_validator(\"method\")\ndef get_proper_http_method_from_str_value(cls, method_value):\n \"\"\"Converts string value to HttpMethod enum value\"\"\"\n if isinstance(method_value, str):\n try:\n method_value = HttpMethod.from_string(method_value)\n except AttributeError as e:\n raise AttributeError(\n \"Only values from HttpMethod class are allowed! \"\n f\"Provided value: '{method_value}', allowed values: {', '.join(HttpMethod.__members__.keys())}\"\n ) from e\n\n return method_value\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.post","title":"post","text":"post() -> Response\n
Execute an HTTP POST call
Source code in src/koheesio/steps/http.py
def post(self) -> requests.Response:\n \"\"\"Execute an HTTP POST call\"\"\"\n self.method = HttpMethod.POST\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.put","title":"put","text":"put() -> Response\n
Execute an HTTP PUT call
Source code in src/koheesio/steps/http.py
def put(self) -> requests.Response:\n \"\"\"Execute an HTTP PUT call\"\"\"\n self.method = HttpMethod.PUT\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.request","title":"request","text":"request(method: Optional[HttpMethod] = None) -> Response\n
Executes the HTTP request with retry logic.
Actual http_method execution is abstracted into this method. This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.
This method will try to execute requests.request
up to self.max_retries
times. If self.request()
raises an exception, it logs a warning message and the error message, then waits for self.initial_delay * (self.backoff ** i)
seconds before retrying. The delay increases exponentially after each failed attempt due to the self.backoff ** i
term.
If self.request()
still fails after self.max_retries
attempts, it logs an error message and re-raises the last exception that was caught.
This is a good way to handle temporary issues that might cause self.request()
to fail, such as network errors or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with requests if it's struggling to respond.
Parameters:
Name Type Description Default method
HttpMethod
Optional parameter that allows calls to different HTTP methods and bypassing class level method
parameter.
None
Raises:
Type Description (RequestException, HTTPError)
The last exception that was caught if requests.request()
fails after self.max_retries
attempts.
Source code in src/koheesio/steps/http.py
def request(self, method: Optional[HttpMethod] = None) -> requests.Response:\n \"\"\"\n Executes the HTTP request with retry logic.\n\n Actual http_method execution is abstracted into this method.\n This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.\n\n This method will try to execute `requests.request` up to `self.max_retries` times. If `self.request()` raises\n an exception, it logs a warning message and the error message, then waits for\n `self.initial_delay * (self.backoff ** i)` seconds before retrying. The delay increases exponentially\n after each failed attempt due to the `self.backoff ** i` term.\n\n If `self.request()` still fails after `self.max_retries` attempts, it logs an error message and re-raises the\n last exception that was caught.\n\n This is a good way to handle temporary issues that might cause `self.request()` to fail, such as network errors\n or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with\n requests if it's struggling to respond.\n\n Parameters\n ----------\n method : HttpMethod\n Optional parameter that allows calls to different HTTP methods and bypassing class level `method`\n parameter.\n\n Raises\n ------\n requests.RequestException, requests.HTTPError\n The last exception that was caught if `requests.request()` fails after `self.max_retries` attempts.\n \"\"\"\n _method = (method or self.method).value.upper()\n options = self.get_options()\n\n self.log.debug(f\"Making {_method} request to {options['url']} with headers {options['headers']}\")\n\n response = self.session.request(method=_method, **options)\n response.raise_for_status()\n\n self.log.debug(f\"Received response with status code {response.status_code} and body {response.text}\")\n self.set_outputs(response)\n\n return response\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.set_outputs","title":"set_outputs","text":"set_outputs(response)\n
Types of response output
Source code in src/koheesio/steps/http.py
def set_outputs(self, response):\n \"\"\"\n Types of response output\n \"\"\"\n self.output.response_raw = response\n self.output.raw_payload = response.text\n self.output.status_code = response.status_code\n\n # Only decode non empty payloads to avoid triggering decoding error unnecessarily.\n if self.output.raw_payload:\n try:\n self.output.response_json = response.json()\n\n except json.decoder.JSONDecodeError as e:\n self.log.info(f\"An error occurred while processing the JSON payload. Error message:\\n{e.msg}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep","title":"koheesio.steps.http.PaginatedHtppGetStep","text":"Represents a paginated HTTP GET step.
Parameters:
Name Type Description Default paginate
bool
Whether to paginate the API response. Defaults to False.
required pages
int
Number of pages to paginate. Defaults to 1.
required offset
int
Offset for paginated API calls. Offset determines the starting page. Defaults to 1.
required limit
int
Limit for paginated API calls. Defaults to 100.
required"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.limit","title":"limit class-attribute
instance-attribute
","text":"limit: Optional[int] = Field(default=100, description='Limit for paginated API calls. The url should (optionally) contain a named limit parameter, for example: api.example.com/data?limit={limit}')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.offset","title":"offset class-attribute
instance-attribute
","text":"offset: Optional[int] = Field(default=1, description=\"Offset for paginated API calls. Offset determines the starting page. Defaults to 1. The url can (optionally) contain a named 'offset' parameter, for example: api.example.com/data?offset={offset}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.pages","title":"pages class-attribute
instance-attribute
","text":"pages: Optional[int] = Field(default=1, description='Number of pages to paginate. Defaults to 1')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.paginate","title":"paginate class-attribute
instance-attribute
","text":"paginate: Optional[bool] = Field(default=False, description=\"Whether to paginate the API response. Defaults to False. When set to True, the API response will be paginated. The url should contain a named 'page' parameter for example: api.example.com/data?page={page}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.execute","title":"execute","text":"execute() -> Output\n
Executes the HTTP GET request and handles pagination.
Returns:
Type Description Output
The output of the HTTP GET request.
Source code in src/koheesio/steps/http.py
def execute(self) -> HttpGetStep.Output:\n \"\"\"\n Executes the HTTP GET request and handles pagination.\n\n Returns\n -------\n HttpGetStep.Output\n The output of the HTTP GET request.\n \"\"\"\n # Set up pagination parameters\n offset, pages = (self.offset, self.pages + 1) if self.paginate else (1, 1) # type: ignore\n data = []\n _basic_url = self.url\n\n for page in range(offset, pages):\n if self.paginate:\n self.log.info(f\"Fetching page {page} of {pages - 1}\")\n\n self.url = self._url(basic_url=_basic_url, page=page)\n self.request()\n\n if isinstance(self.output.response_json, list):\n data += self.output.response_json\n else:\n data.append(self.output.response_json)\n\n self.url = _basic_url\n self.output.response_json = data\n self.output.response_raw = None\n self.output.raw_payload = None\n self.output.status_code = None\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.get_options","title":"get_options","text":"get_options()\n
Returns the options to be passed to the requests.request() function.
Returns:
Type Description dict
The options.
Source code in src/koheesio/steps/http.py
def get_options(self):\n \"\"\"\n Returns the options to be passed to the requests.request() function.\n\n Returns\n -------\n dict\n The options.\n \"\"\"\n options = {\n \"url\": self.url,\n \"headers\": self.get_headers(),\n \"data\": self.data,\n \"timeout\": self.timeout,\n **self._adjust_params(), # type: ignore\n }\n\n return options\n
"},{"location":"community/approach-documentation.html","title":"Approach documentation","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#scope","title":"Scope","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#the-system","title":"The System","text":"We will be adopting \"The Documentation System\".
From documentation.divio.com:
There is a secret that needs to be understood in order to write good software documentation: there isn\u2019t one thing called documentation, there are four.
They are: tutorials, how-to guides, technical reference and explanation. They represent four different purposes or functions, and require four different approaches to their creation. Understanding the implications of this will help improve most documentation - often immensely.
About the system The documentation system outlined here is a simple, comprehensive and nearly universally-applicable scheme. It is proven in practice across a wide variety of fields and applications.
There are some very simple principles that govern documentation that are very rarely if ever spelled out. They seem to be a secret, though they shouldn\u2019t be.
If you can put these principles into practice, it will make your documentation better and your project, product or team more successful - that\u2019s a promise.
The system is widely adopted for large and small, open and proprietary documentation projects.
Video Presentation on YouTube:
","tags":["doctype/explanation"]},{"location":"community/contribute.html","title":"Contribute","text":""},{"location":"community/contribute.html#how-to-contribute","title":"How to contribute","text":"There are a few guidelines that we need contributors to follow so that we are able to process requests as efficiently as possible. If you have any questions or concerns please feel free to contact us at opensource@nike.com.
"},{"location":"community/contribute.html#getting-started","title":"Getting Started","text":" - Review our Code of Conduct
- Make sure you have a GitHub account
- Submit a ticket for your issue, assuming one does not already exist.
- Clearly describe the issue including steps to reproduce when it is a bug.
- Make sure you fill in the earliest version that you know has the issue.
- Fork the repository on GitHub
"},{"location":"community/contribute.html#making-changes","title":"Making Changes","text":" - Create a feature branch off of
main
before you start your work. - Please avoid working directly on the
main
branch.
- Setup the required package manager hatch
- Setup the dev environment see below
- Make commits of logical units.
- You may be asked to squash unnecessary commits down to logical units.
- Check for unnecessary whitespace with
git diff --check
before committing. - Write meaningful, descriptive commit messages.
- Please follow existing code conventions when working on a file
- Make sure to check the standards on the code, see below
- Make sure to test the code before you push changes see below
"},{"location":"community/contribute.html#submitting-changes","title":"\ud83e\udd1d Submitting Changes","text":" - Push your changes to a topic branch in your fork of the repository.
- Submit a pull request to the repository in the Nike-Inc organization.
- After feedback has been given we expect responses within two weeks. After two weeks we may close the pull request if it isn't showing any activity.
- Bug fixes or features that lack appropriate tests may not be considered for merge.
- Changes that lower test coverage may not be considered for merge.
"},{"location":"community/contribute.html#make-commands","title":"\ud83d\udd28 Make commands","text":"We use make
for managing different steps of setup and maintenance in the project. You can install make by following the instructions here
For a full list of available make commands, you can run:
make help\n
"},{"location":"community/contribute.html#package-manager","title":"\ud83d\udce6 Package manager","text":"We use hatch
as our package manager.
Note: Please DO NOT use pip or conda to install the dependencies. Instead, use hatch.
To install hatch, run the following command:
make init\n
or,
make hatch-install\n
This will install hatch using brew if you are on a Mac.
If you are on a different OS, you can follow the instructions here
"},{"location":"community/contribute.html#dev-environment-setup","title":"\ud83d\udccc Dev Environment Setup","text":"To ensure our standards, make sure to install the required packages.
make dev\n
This will install all the required packages for development in the project under the .venv
directory. Use this virtual environment to run the code and tests during local development.
"},{"location":"community/contribute.html#linting-and-standards","title":"\ud83e\uddf9 Linting and Standards","text":"We use ruff
, pylint
, isort
, black
and mypy
to maintain standards in the codebase.
Run the following two commands to check the codebase for any issues:
make check\n
This will run all the checks including pylint and mypy. make fmt\n
This will format the codebase using black, isort, and ruff. Make sure that the linters and formatters do not report any errors or warnings before submitting a pull request.
"},{"location":"community/contribute.html#testing","title":"\ud83e\uddea Testing","text":"We use pytest
to test our code.
You can run the tests by running one of the following commands:
make cov # to run the tests and check the coverage\nmake all-tests # to run all the tests\nmake spark-tests # to run the spark tests\nmake non-spark-tests # to run the non-spark tests\n
Make sure that all tests pass and that you have adequate coverage before submitting a pull request.
"},{"location":"community/contribute.html#additional-resources","title":"Additional Resources","text":" - General GitHub documentation
- GitHub pull request documentation
- Nike's Code of Conduct
- Nike's Individual Contributor License Agreement
- Nike OSS
"},{"location":"includes/glossary.html","title":"Glossary","text":""},{"location":"includes/glossary.html#pydantic","title":"Pydantic","text":"Pydantic is a Python library for data validation and settings management using Python type annotations. It allows Koheesio to bring in strong typing and a high level of type safety. Essentially, it allows Koheesio to consider configurations of a pipeline (i.e. the settings used inside Steps, Tasks, etc.) as data that can be validated and structured.
"},{"location":"includes/glossary.html#pyspark","title":"PySpark","text":"PySpark is a Python library for Apache Spark, a powerful open-source data processing engine. It allows Koheesio to handle large-scale data processing tasks efficiently.
"},{"location":"misc/info.html","title":"Info","text":"{{ macros_info() }}
"},{"location":"reference/concepts/concepts.html","title":"Concepts","text":"The framework architecture is built from a set of core components. Each of the implementations that the framework provides out of the box, can be swapped out for custom implementations as long as they match the API.
The core components are the following:
Note: click on the 'Concept' to take you to the corresponding module. The module documentation will have greater detail on the specifics of the implementation
"},{"location":"reference/concepts/concepts.html#step","title":"Step","text":"A custom unit of logic that can be executed. A Step is an atomic operation and serves as the building block of data pipelines built with the framework. A step can be seen as an operation on a set of inputs, and returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.
\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% is for increasing the box size without having to mess with CSS settings\nStep[\"\n \n \n \nStep\n \n \n \n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n
Step is the core abstraction of the framework. Meaning, that it is the core building block of the framework and is used to define all the operations that can be executed.
Please see the Step documentation for more details.
"},{"location":"reference/concepts/concepts.html#task","title":"Task","text":"The unit of work of one execution of the framework.
An execution usually consists of an Extract - Transform - Load
approach of one data object. Tasks typically consist of a series of Steps.
Please see the Task documentation for more details.
"},{"location":"reference/concepts/concepts.html#context","title":"Context","text":"The Context is used to configure the environment where a Task or Step runs.
It is often based on configuration files and can be used to adapt behaviour of a Task or Step based on the environment it runs in.
Please see the Context documentation for more details.
"},{"location":"reference/concepts/concepts.html#logger","title":"logger","text":"A logger object to log messages with different levels.
Please see the Logging documentation for more details.
The interactions between the base concepts of the model is visible in the below diagram:
---\ntitle: Koheesio Class Diagram\n---\nclassDiagram\n Step .. Task\n Step .. Transformation\n Step .. Reader\n Step .. Writer\n\n class Context\n\n class LoggingFactory\n\n class Task{\n <<abstract>>\n + List~Step~ steps\n ...\n + execute() Output\n }\n\n class Step{\n <<abstract>>\n ...\n Output: ...\n + execute() Output\n }\n\n class Transformation{\n <<abstract>>\n + df: DataFrame\n ...\n Output:\n + df: DataFrame\n + transform(df: DataFrame) DataFrame\n }\n\n class Reader{\n <<abstract>>\n ...\n Output:\n + df: DataFrame\n + read() DataFrame\n }\n\n class Writer{\n <<abstract>>\n + df: DataFrame\n ...\n + write(df: DataFrame)\n }
"},{"location":"reference/concepts/context.html","title":"Context in Koheesio","text":"In the Koheesio framework, the Context
class plays a pivotal role. It serves as a flexible and powerful tool for managing configuration data and shared variables across tasks and steps in your application.
Context
behaves much like a Python dictionary, but with additional features that enhance its usability and flexibility. It allows you to store and retrieve values, including complex Python objects, with ease. You can access these values using dictionary-like methods or as class attributes, providing a simple and intuitive interface.
Moreover, Context
supports nested keys and recursive merging of contexts, making it a versatile tool for managing complex configurations. It also provides serialization and deserialization capabilities, allowing you to easily save and load configurations in JSON, YAML, or TOML formats.
Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context
provides a robust and efficient solution.
This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.
"},{"location":"reference/concepts/context.html#api-reference","title":"API Reference","text":"See API Reference for a detailed description of the Context
class and its methods.
"},{"location":"reference/concepts/context.html#key-features","title":"Key Features","text":" -
Accessing Values: Context
simplifies accessing configuration values. You can access them using dictionary-like methods or as class attributes. This allows for a more intuitive interaction with the Context
object. For example:
context = Context({\"bronze_table\": \"catalog.schema.table_name\"})\nprint(context.bronze_table) # Outputs: catalog.schema.table_name\n
-
Nested Keys: Context
supports nested keys, allowing you to access and add nested keys in a straightforward way. This is useful when dealing with complex configurations that require a hierarchical structure. For example:
context = Context({\"bronze\": {\"table\": \"catalog.schema.table_name\"}})\nprint(context.bronze.table) # Outputs: catalog.schema.table_name\n
-
Merging Contexts: You can merge two Contexts
together, with the incoming Context
having priority. Recursive merging is also supported. This is particularly useful when you want to update a Context
with new data without losing the existing values. For example:
context1 = Context({\"bronze_table\": \"catalog.schema.table_name\"})\ncontext2 = Context({\"silver_table\": \"catalog.schema.table_name\"})\ncontext1.merge(context2)\nprint(context1.silver_table) # Outputs: catalog.schema.table_name\n
-
Adding Keys: You can add keys to a Context by using the add
method. This allows you to dynamically update the Context
as needed. For example:
context.add(\"silver_table\", \"catalog.schema.table_name\")\n
-
Checking Key Existence: You can check if a key exists in a Context by using the contains
method. This is useful when you want to ensure a key is present before attempting to access its value. For example:
context.contains(\"silver_table\") # Returns: True\n
-
Getting Key-Value Pair: You can get a key-value pair from a Context by using the get_item
method. This can be useful when you want to extract a specific piece of data from the Context
. For example:
context.get_item(\"silver_table\") # Returns: {\"silver_table\": \"catalog.schema.table_name\"}\n
-
Converting to Dictionary: You can convert a Context to a dictionary by using the to_dict
method. This can be useful when you need to interact with code that expects a standard Python dictionary. For example:
context_dict = context.to_dict()\n
-
Creating from Dictionary: You can create a Context from a dictionary by using the from_dict
method. This allows you to easily convert existing data structures into a Context
. For example:
context = Context.from_dict({\"bronze_table\": \"catalog.schema.table_name\"})\n
"},{"location":"reference/concepts/context.html#advantages-over-a-dictionary","title":"Advantages over a Dictionary","text":"While a dictionary can be used to store configuration values, Context
provides several advantages:
-
Support for nested keys: Unlike a standard Python dictionary, Context
allows you to access nested keys as if they were attributes. This makes it easier to work with complex, hierarchical data.
-
Recursive merging of two Contexts
: Context
allows you to merge two Contexts
together, with the incoming Context
having priority. This is useful when you want to update a Context
with new data without losing the existing values.
-
Accessing keys as if they were class attributes: This provides a more intuitive way to interact with the Context
, as you can use dot notation to access values.
-
Code completion in IDEs: Because you can access keys as if they were attributes, IDEs can provide code completion for Context
keys. This can make your coding process more efficient and less error-prone.
-
Easy creation from a YAML, JSON, or TOML file: Context
provides methods to easily load data from YAML or JSON files, making it a great tool for managing configuration data.
"},{"location":"reference/concepts/context.html#data-formats-and-serialization","title":"Data Formats and Serialization","text":"Context
leverages JSON, YAML, and TOML for serialization and deserialization. These formats are widely used in the industry and provide a balance between readability and ease of use.
-
JSON: A lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It's widely used for APIs and web-based applications.
-
YAML: A human-friendly data serialization standard often used for configuration files. It's more readable than JSON and supports complex data structures.
-
TOML: A minimal configuration file format that's easy to read due to its clear and simple syntax. It's often used for configuration files in Python applications.
"},{"location":"reference/concepts/context.html#examples","title":"Examples","text":"In this section, we provide a variety of examples to demonstrate the capabilities of the Context
class in Koheesio.
"},{"location":"reference/concepts/context.html#basic-operations","title":"Basic Operations","text":"Here are some basic operations you can perform with Context
. These operations form the foundation of how you interact with a Context
object:
# Create a Context\ncontext = Context({\"bronze_table\": \"catalog.schema.table_name\"})\n\n# Access a value\nvalue = context.bronze_table\n\n# Add a key\ncontext.add(\"silver_table\", \"catalog.schema.table_name\")\n\n# Merge two Contexts\ncontext.merge(Context({\"silver_table\": \"catalog.schema.table_name\"}))\n
"},{"location":"reference/concepts/context.html#serialization-and-deserialization","title":"Serialization and Deserialization","text":"Context
supports serialization and deserialization to and from JSON, YAML, and TOML formats. This allows you to easily save and load Context
data:
# Load context from a JSON file\ncontext = Context.from_json(\"path/to/context.json\")\n\n# Save context to a JSON file\ncontext.to_json(\"path/to/context.json\")\n\n# Load context from a YAML file\ncontext = Context.from_yaml(\"path/to/context.yaml\")\n\n# Save context to a YAML file\ncontext.to_yaml(\"path/to/context.yaml\")\n\n# Load context from a TOML file\ncontext = Context.from_toml(\"path/to/context.toml\")\n\n# Save context to a TOML file\ncontext.to_toml(\"path/to/context.toml\")\n
"},{"location":"reference/concepts/context.html#nested-keys","title":"Nested Keys","text":"Context
supports nested keys, allowing you to create hierarchical configurations. This is useful when dealing with complex data structures:
# Create a Context with nested keys\ncontext = Context({\n \"database\": {\n \"bronze_table\": \"catalog.schema.bronze_table\",\n \"silver_table\": \"catalog.schema.silver_table\",\n \"gold_table\": \"catalog.schema.gold_table\"\n }\n})\n\n# Access a nested key\nprint(context.database.bronze_table) # Outputs: catalog.schema.bronze_table\n
"},{"location":"reference/concepts/context.html#recursive-merging","title":"Recursive Merging","text":"Context
also supports recursive merging, allowing you to merge two Contexts
together at all levels of their hierarchy. This is particularly useful when you want to update a Context
with new data without losing the existing values:
# Create two Contexts with nested keys\ncontext1 = Context({\n \"database\": {\n \"bronze_table\": \"catalog.schema.bronze_table\",\n \"silver_table\": \"catalog.schema.silver_table\"\n }\n})\n\ncontext2 = Context({\n \"database\": {\n \"silver_table\": \"catalog.schema.new_silver_table\",\n \"gold_table\": \"catalog.schema.gold_table\"\n }\n})\n\n# Merge the two Contexts\ncontext1.merge(context2)\n\n# Print the merged Context\nprint(context1.to_dict()) \n# Outputs: \n# {\n# \"database\": {\n# \"bronze_table\": \"catalog.schema.bronze_table\",\n# \"silver_table\": \"catalog.schema.new_silver_table\",\n# \"gold_table\": \"catalog.schema.gold_table\"\n# }\n# }\n
"},{"location":"reference/concepts/context.html#jsonpickle-and-complex-python-objects","title":"Jsonpickle and Complex Python Objects","text":"The Context
class in Koheesio also uses jsonpickle
for serialization and deserialization of complex Python objects to and from JSON. This allows you to convert complex Python objects, including custom classes, into a format that can be easily stored and transferred.
Here's an example of how this works:
# Import necessary modules\nfrom koheesio.context import Context\n\n# Initialize SnowflakeReader and store in a Context\nsnowflake_reader = SnowflakeReader(...) # fill in with necessary arguments\ncontext = Context({\"snowflake_reader\": snowflake_reader})\n\n# Serialize the Context to a JSON string\njson_str = context.to_json()\n\n# Print the serialized Context\nprint(json_str)\n\n# Deserialize the JSON string back into a Context\ndeserialized_context = Context.from_json(json_str)\n\n# Access the deserialized SnowflakeReader\ndeserialized_snowflake_reader = deserialized_context.snowflake_reader\n\n# Now you can use the deserialized SnowflakeReader as you would the original\n
This feature is particularly useful when you need to save the state of your application, transfer it over a network, or store it in a database. When you're ready to use the stored data, you can easily convert it back into the original Python objects.
However, there are a few things to keep in mind:
-
The classes you're serializing must be importable (i.e., they must be in the Python path) when you're deserializing the JSON. jsonpickle
needs to be able to import the class to reconstruct the object. This holds true for most Koheesio classes, as they are designed to be importable and reconstructible.
-
Not all Python objects can be serialized. For example, objects that hold a reference to a file or a network connection can't be serialized because their state can't be easily captured in a static file.
-
As mentioned in the code comments, jsonpickle
is not secure against malicious data. You should only deserialize data that you trust.
So, while the Context
class provides a powerful tool for handling complex Python objects, it's important to be aware of these limitations.
"},{"location":"reference/concepts/context.html#conclusion","title":"Conclusion","text":"In this document, we've covered the key features of the Context
class in the Koheesio framework, including its ability to handle complex Python objects, support for nested keys and recursive merging, and its serialization and deserialization capabilities.
Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context
provides a robust and efficient solution.
"},{"location":"reference/concepts/context.html#further-reading","title":"Further Reading","text":"For more information, you can refer to the following resources:
- Python jsonpickle Documentation
- Python JSON Documentation
- Python YAML Documentation
- Python TOML Documentation
Refer to the API documentation for more details on the Context
class and its methods.
"},{"location":"reference/concepts/logger.html","title":"Python Logger Code Instructions","text":"Here you can find instructions on how to use the Koheesio Logging Factory.
"},{"location":"reference/concepts/logger.html#logging-factory","title":"Logging Factory","text":"The LoggingFactory
class is a factory for creating and configuring loggers. To use it, follow these steps:
-
Import the necessary modules:
from koheesio.logger import LoggingFactory\n
-
Initialize logging factory for koheesio modules:
factory = LoggingFactory(name=\"replace_koheesio_parent_name\", env=\"local\", logger_id=\"your_run_id\")\n# Or use default \nfactory = LoggingFactory()\n# Or just specify log level for koheesio modules\nfactory = LoggingFactory(level=\"DEBUG\")\n
-
Create a logger by calling the create_logger
method of the LoggingFactory
class, you can inherit from koheesio logger:
python logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME) # Or for koheesio modules logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME,inherit_from_koheesio=True)
-
You can now use the logger
object to log messages:
logger.debug(\"Debug message\")\nlogger.info(\"Info message\")\nlogger.warning(\"Warning message\")\nlogger.error(\"Error message\")\nlogger.critical(\"Critical message\")\n
-
(Optional) You can add additional handlers to the logger by calling the add_handlers
method of the LoggingFactory
class:
handlers = [\n (\"your_handler_module.YourHandlerClass\", {\"level\": \"INFO\"}),\n # Add more handlers if needed\n]\nfactory.add_handlers(handlers)\n
-
(Optional) You can create child loggers based on the parent logger by calling the get_logger
method of the LoggingFactory
class:
child_logger = factory.get_logger(name=\"your_child_logger_name\")\n
-
(Optional) Get an independent logger without inheritance
If you need an independent logger without inheriting from the LoggingFactory
logger, you can use the get_logger
method:
your_logger = factory.get_logger(name=\"your_logger_name\", inherit=False)\n
By setting inherit
to False
, you will obtain a logger that is not tied to the LoggingFactory
logger hierarchy, only format of message will be the same, but you can also change it. This allows you to have an independent logger with its own configuration. You can use the your_logger
object to log messages:
```python\nyour_logger.debug(\"Debug message\")\nyour_logger.info(\"Info message\")\nyour_logger.warning(\"Warning message\")\nyour_logger.error(\"Error message\")\nyour_logger.critical(\"Critical message\")\n```\n
-
(Optional) You can use Masked types to masked secrets/tokens/passwords in output. The Masked types are special types provided by the koheesio library to handle sensitive data that should not be logged or printed in plain text. They are used to wrap sensitive data and override their string representation to prevent accidental exposure of the data.Here are some examples of how to use Masked types:
import logging\nfrom koheesio.logger import MaskedString, MaskedInt, MaskedFloat, MaskedDict\n\n# Set up logging\nlogger = logging.getLogger(__name__)\nlogging.basicConfig(level=logging.INFO)\n\n# Using MaskedString\nmasked_string = MaskedString(\"my secret string\")\nlogger.info(masked_string) # This will not log the actual string\n\n# Using MaskedInt\nmasked_int = MaskedInt(12345)\nlogger.info(masked_int) # This will not log the actual integer\n\n# Using MaskedFloat\nmasked_float = MaskedFloat(3.14159)\nlogger.info(masked_float) # This will not log the actual float\n\n# Using MaskedDict\nmasked_dict = MaskedDict({\"key\": \"value\"})\nlogger.info(masked_dict) # This will not log the actual dictionary\n
Please make sure to replace \"your_logger_name\", \"your_run_id\", \"your_handler_module.YourHandlerClass\", \"your_child_logger_name\", and other placeholders with your own values according to your application's requirements.
By following these steps, you can obtain an independent logger without inheriting from the LoggingFactory
logger. This allows you to customize the logger configuration and use it separately in your code.
Note: Ensure that you have imported the necessary modules, instantiated the LoggingFactory
class, and customized the logger name and other parameters according to your application's requirements.
"},{"location":"reference/concepts/logger.html#example","title":"Example","text":"import logging\n\n# Step 2: Instantiate the LoggingFactory class\nfactory = LoggingFactory(env=\"local\")\n\n# Step 3: Create an independent logger with a custom log level\nyour_logger = factory.get_logger(\"your_logger\", inherit_from_koheesio=False)\nyour_logger.setLevel(logging.DEBUG)\n\n# Step 4: Create a logger using the create_logger method from LoggingFactory with a different log level\nfactory_logger = LoggingFactory(level=\"WARNING\").get_logger(name=factory.LOGGER_NAME)\n\n# Step 5: Create a child logger with a debug level\nchild_logger = factory.get_logger(name=\"child\")\nchild_logger.setLevel(logging.DEBUG)\n\nchild2_logger = factory.get_logger(name=\"child2\")\nchild2_logger.setLevel(logging.INFO)\n\n# Step 6: Log messages at different levels for both loggers\nyour_logger.debug(\"Debug message\") # This message will be displayed\nyour_logger.info(\"Info message\") # This message will be displayed\nyour_logger.warning(\"Warning message\") # This message will be displayed\nyour_logger.error(\"Error message\") # This message will be displayed\nyour_logger.critical(\"Critical message\") # This message will be displayed\n\nfactory_logger.debug(\"Debug message\") # This message will not be displayed\nfactory_logger.info(\"Info message\") # This message will not be displayed\nfactory_logger.warning(\"Warning message\") # This message will be displayed\nfactory_logger.error(\"Error message\") # This message will be displayed\nfactory_logger.critical(\"Critical message\") # This message will be displayed\n\nchild_logger.debug(\"Debug message\") # This message will be displayed\nchild_logger.info(\"Info message\") # This message will be displayed\nchild_logger.warning(\"Warning message\") # This message will be displayed\nchild_logger.error(\"Error message\") # This message will be displayed\nchild_logger.critical(\"Critical message\") # This message will be displayed\n\nchild2_logger.debug(\"Debug message\") # This message will be displayed\nchild2_logger.info(\"Info message\") # This message will be displayed\nchild2_logger.warning(\"Warning message\") # This message will be displayed\nchild2_logger.error(\"Error message\") # This message will be displayed\nchild2_logger.critical(\"Critical message\") # This message will be displayed\n
Output:
[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [your_logger] {__init__.py:<module>:118} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [your_logger] {__init__.py:<module>:119} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [your_logger] {__init__.py:<module>:120} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [your_logger] {__init__.py:<module>:121} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [your_logger] {__init__.py:<module>:122} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio] {__init__.py:<module>:126} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio] {__init__.py:<module>:127} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio] {__init__.py:<module>:128} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [koheesio.child] {__init__.py:<module>:130} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child] {__init__.py:<module>:131} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child] {__init__.py:<module>:132} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child] {__init__.py:<module>:133} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child] {__init__.py:<module>:134} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child2] {__init__.py:<module>:137} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child2] {__init__.py:<module>:138} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child2] {__init__.py:<module>:139} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child2] {__init__.py:<module>:140} - Critical message\n
"},{"location":"reference/concepts/logger.html#loggeridfilter-class","title":"LoggerIDFilter Class","text":"The LoggerIDFilter
class is a filter that injects run_id
information into the log. To use it, follow these steps:
-
Import the necessary modules:
import logging\n
-
Create an instance of the LoggerIDFilter
class:
logger_filter = LoggerIDFilter()\n
-
Set the LOGGER_ID
attribute of the LoggerIDFilter
class to the desired run ID:
LoggerIDFilter.LOGGER_ID = \"your_run_id\"\n
-
Add the logger_filter
to your logger or handler:
logger = logging.getLogger(\"your_logger_name\")\nlogger.addFilter(logger_filter)\n
"},{"location":"reference/concepts/logger.html#loggingfactory-set-up-optional","title":"LoggingFactory Set Up (Optional)","text":" -
Import the LoggingFactory
class in your application code.
-
Set the value for the LOGGER_FILTER
variable:
- If you want to assign a specific
logging.Filter
instance, replace None
with your desired filter instance. -
If you want to keep the default value of None
, leave it unchanged.
-
Set the value for the LOGGER_LEVEL
variable:
- If you want to use the value from the
\"KOHEESIO_LOGGING_LEVEL\"
environment variable, leave the code as is. -
If you want to use a different environment variable or a specific default value, modify the code accordingly.
-
Set the value for the LOGGER_ENV
variable:
-
Replace \"local\"
with your desired environment name.
-
Set the value for the LOGGER_FORMAT
variable:
- If you want to customize the log message format, modify the value within the double quotes.
-
The format should follow the desired log message format pattern.
-
Set the value for the LOGGER_FORMATTER
variable:
- If you want to assign a specific
Formatter
instance, replace Formatter(LOGGER_FORMAT)
with your desired formatter instance. -
If you want to keep the default formatter with the defined log message format, leave it unchanged.
-
Set the value for the CONSOLE_HANDLER
variable:
- If you want to assign a specific
logging.Handler
instance, replace None
with your desired handler instance. - If you want to keep the default value of
None
, leave it unchanged.
-
Set the value for the ENV
variable:
- Replace
None
with your desired environment value if applicable. - If you don't need to set this variable, leave it as
None
.
-
Save the changes to the file.
"},{"location":"reference/concepts/step.html","title":"Steps in Koheesio","text":"In the Koheesio framework, the Step
class and its derivatives play a crucial role. They serve as the building blocks for creating data pipelines, allowing you to define custom units of logic that can be executed. This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.
Several type of Steps are available in Koheesio, including Reader
, Transformation
, Writer
, and Task
.
"},{"location":"reference/concepts/step.html#what-is-a-step","title":"What is a Step?","text":"A Step
is an atomic operation serving as the building block of data pipelines built with the Koheesio framework. Tasks typically consist of a series of Steps.
A step can be seen as an operation on a set of inputs, that returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.
\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% is for increasing the box size without having to mess with CSS settings\nStep[\"\n \n \n \nStep\n \n \n \n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n
"},{"location":"reference/concepts/step.html#how-to-read-a-step","title":"How to Read a Step?","text":"A Step
in Koheesio is a class that represents a unit of work in a data pipeline. It's similar to a Python built-in data class, but with additional features for execution, validation, and logging.
When you look at a Step
, you'll typically see the following components:
-
Class Definition: The Step
is defined as a class that inherits from the base Step
class in Koheesio. For example, class MyStep(Step):
.
-
Input Fields: These are defined as class attributes with type annotations, similar to attributes in a Python data class. These fields represent the inputs to the Step
. For example, a: str
defines an input field a
of type str
. Additionally, you will often see these fields defined using Pydantic's Field
class, which allows for more detailed validation and documentation as well as default values and aliasing.
-
Output Fields: These are defined in a nested class called Output
that inherits from StepOutput
. This class represents the output of the Step
. For example, class Output(StepOutput): b: str
defines an output field b
of type str
.
-
Execute Method: This is a method that you need to implement when you create a new Step
. It contains the logic of the Step
and is where you use the input fields and populate the output fields. For example, def execute(self): self.output.b = f\"{self.a}-some-suffix\"
.
Here's an example of a Step
:
class MyStep(Step):\n a: str # input\n\n class Output(StepOutput): # output\n b: str\n\n def execute(self) -> MyStep.Output:\n self.output.b = f\"{self.a}-some-suffix\"\n
In this Step
, a
is an input field of type str
, b
is an output field of type str
, and the execute
method appends -some-suffix
to the input a
and assigns it to the output b
.
When you see a Step
, you can think of it as a function where the class attributes are the inputs, the Output
class defines the outputs, and the execute
method is the function body. The main difference is that a Step
also includes automatic validation of inputs and outputs (thanks to Pydantic), logging, and error handling.
"},{"location":"reference/concepts/step.html#understanding-inheritance-in-steps","title":"Understanding Inheritance in Steps","text":"Inheritance is a core concept in object-oriented programming where a class (child or subclass) inherits properties and methods from another class (parent or superclass). In the context of Koheesio, when you create a new Step
, you're creating a subclass that inherits from the base Step
class.
When a new Step is defined (like class MyStep(Step):
), it inherits all the properties and methods from the Step
class. This includes the execute
method, which is then overridden to provide the specific functionality for that Step.
Here's a simple breakdown:
-
Parent Class (Superclass): This is the Step
class in Koheesio. It provides the basic structure and functionalities of a Step, including input and output validation, logging, and error handling.
-
Child Class (Subclass): This is the new Step you define, like MyStep
. It inherits all the properties and methods from the Step
class and can add or override them as needed.
-
Inheritance: This is the process where MyStep
inherits the properties and methods from the Step
class. In Python, this is done by mentioning the parent class in parentheses when defining the child class, like class MyStep(Step):
.
-
Overriding: This is when you provide a new implementation of a method in the child class that is already defined in the parent class. In the case of Steps, you override the execute
method to define the specific logic of your Step.
Understanding inheritance is key to understanding how Steps work in Koheesio. It allows you to leverage the functionalities provided by the Step
class and focus on implementing the specific logic of your Step.
"},{"location":"reference/concepts/step.html#benefits-of-using-steps-in-data-pipelines","title":"Benefits of Using Steps in Data Pipelines","text":"The concept of a Step
is beneficial when creating Data Pipelines or Data Products for several reasons:
-
Modularity: Each Step
represents a self-contained unit of work, which makes the pipeline modular. This makes it easier to understand, test, and maintain the pipeline. If a problem arises, you can pinpoint which step is causing the issue.
-
Reusability: Steps can be reused across different pipelines. Once a Step
is defined, it can be used in any number of pipelines. This promotes code reuse and consistency across projects.
-
Readability: Steps make the pipeline code more readable. Each Step
has a clear input, output, and execution logic, which makes it easier to understand what each part of the pipeline is doing.
-
Validation: Steps automatically validate their inputs and outputs. This ensures that the data flowing into and out of each step is of the expected type and format, which can help catch errors early.
-
Logging: Steps automatically log the start and end of their execution, along with the input and output data. This can be very useful for debugging and understanding the flow of data through the pipeline.
-
Error Handling: Steps provide built-in error handling. If an error occurs during the execution of a step, it is caught, logged, and then re-raised. This provides a clear indication of where the error occurred.
-
Scalability: Steps can be easily parallelized or distributed, which is crucial for processing large datasets. This is especially true for steps that are designed to work with distributed computing frameworks like Apache Spark.
By using the concept of a Step
, you can create data pipelines that are modular, reusable, readable, and robust, while also being easier to debug and scale.
"},{"location":"reference/concepts/step.html#compared-to-a-regular-pydantic-basemodel","title":"Compared to a regular Pydantic Basemodel","text":"A Step
in Koheesio, while built on top of Pydantic's BaseModel
, provides additional features specifically designed for creating data pipelines. Here are some key differences:
-
Execution Method: A Step
includes an execute
method that needs to be implemented. This method contains the logic of the step and is automatically decorated with functionalities such as logging and output validation.
-
Input and Output Validation: A Step
uses Pydantic models to define and validate its inputs and outputs. This ensures that the data flowing into and out of the step is of the expected type and format.
-
Automatic Logging: A Step
automatically logs the start and end of its execution, along with the input and output data. This is done through the do_execute
decorator applied to the execute
method.
-
Error Handling: A Step
provides built-in error handling. If an error occurs during the execution of the step, it is caught, logged, and then re-raised. This should help in debugging and understanding the flow of data.
-
Serialization: A Step
can be serialized to a YAML string using the to_yaml
method. This can be useful for saving and loading steps.
-
Lazy Mode Support: The StepOutput
class in a Step
supports lazy mode, which allows validation of the items stored in the class to be called at will instead of being forced to run it upfront.
In contrast, a regular Pydantic BaseModel
is a simple data validation model that doesn't include these additional features. It's used for data parsing and validation, but doesn't include methods for execution, automatic logging, error handling, or serialization to YAML.
"},{"location":"reference/concepts/step.html#key-features-of-a-step","title":"Key Features of a Step","text":""},{"location":"reference/concepts/step.html#defining-a-step","title":"Defining a Step","text":"To define a new step, you subclass the Step
class and implement the execute
method. The inputs of the step can be accessed using self.input_name
. The output of the step can be accessed using self.output.output_name
. For example:
class MyStep(Step):\n input1: str = Field(...)\n input2: int = Field(...)\n\n class Output(StepOutput):\n output1: str = Field(...)\n\n def execute(self):\n # Your logic here\n self.output.output1 = \"result\"\n
"},{"location":"reference/concepts/step.html#running-a-step","title":"Running a Step","text":"To run a step, you can call the execute
method. You can also use the run
method, which is an alias to execute
. For example:
step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-step-output","title":"Accessing Step Output","text":"The output of a step can be accessed using self.output.output_name
. For example:
step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\nprint(step.output.output1) # Outputs: \"result\"\n
"},{"location":"reference/concepts/step.html#serializing-a-step","title":"Serializing a Step","text":"You can serialize a step to a YAML string using the to_yaml
method. For example:
step = MyStep(input1=\"value1\", input2=2)\nyaml_str = step.to_yaml()\n
"},{"location":"reference/concepts/step.html#getting-step-description","title":"Getting Step Description","text":"You can get the description of a step using the get_description
method. For example:
step = MyStep(input1=\"value1\", input2=2)\ndescription = step.get_description()\n
"},{"location":"reference/concepts/step.html#defining-a-step-with-multiple-inputs-and-outputs","title":"Defining a Step with Multiple Inputs and Outputs","text":"Here's an example of how to define a new step with multiple inputs and outputs:
class MyStep(Step):\n input1: str = Field(...)\n input2: int = Field(...)\n input3: int = Field(...)\n\n class Output(StepOutput):\n output1: str = Field(...)\n output2: int = Field(...)\n\n def execute(self):\n # Your logic here\n self.output.output1 = \"result\"\n self.output.output2 = self.input2 + self.input3\n
"},{"location":"reference/concepts/step.html#running-a-step-with-multiple-inputs","title":"Running a Step with Multiple Inputs","text":"To run a step with multiple inputs, you can do the following:
step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-multiple-step-outputs","title":"Accessing Multiple Step Outputs","text":"The outputs of a step can be accessed using self.output.output_name
. For example:
step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\nprint(step.output.output1) # Outputs: \"result\"\nprint(step.output.output2) # Outputs: 5\n
"},{"location":"reference/concepts/step.html#special-features","title":"Special Features","text":""},{"location":"reference/concepts/step.html#the-execute-method","title":"The Execute method","text":"The execute
method in the Step
class is automatically decorated with the StepMetaClass._execute_wrapper
function due to the metaclass StepMetaClass
. This provides several advantages:
-
Automatic Output Validation: The decorator ensures that the output of the execute
method is always a StepOutput
instance. This means that the output is automatically validated against the defined output model, ensuring data integrity and consistency.
-
Logging: The decorator provides automatic logging at the start and end of the execute
method. This includes logging the input and output of the step, which can be useful for debugging and understanding the flow of data.
-
Error Handling: If an error occurs during the execution of the Step
, the decorator catches the exception and logs an error message before re-raising the exception. This provides a clear indication of where the error occurred.
-
Simplifies Step Implementation: Since the decorator handles output validation, logging, and error handling, the user can focus on implementing the logic of the execute
method without worrying about these aspects.
-
Consistency: By automatically decorating the execute
method, the library ensures that these features are consistently applied across all steps, regardless of who implements them or how they are used. This makes the behavior of steps predictable and consistent.
-
Prevents Double Wrapping: The decorator checks if the function is already wrapped with StepMetaClass._execute_wrapper
and prevents double wrapping. This ensures that the decorator doesn't interfere with itself if execute
is overridden in subclasses.
Notice that you never have to explicitly return anything from the execute
method. The StepMetaClass._execute_wrapper
decorator takes care of that for you.
Implementation examples for custom metaclass which can be used to override the default behavior of the StepMetaClass._execute_wrapper
:
class MyMetaClass(StepMetaClass):\n @classmethod\n def _log_end_message(cls, step: Step, skip_logging: bool = False, *args, **kwargs):\n print(\"It's me from custom meta class\")\n super()._log_end_message(step, skip_logging, *args, **kwargs)\n\n class MyMetaClass2(StepMetaClass):\n @classmethod\n def _validate_output(cls, step: Step, skip_validating: bool = False, *args, **kwargs):\n # i want always have a dummy value in the output\n step.output.dummy_value = \"dummy\"\n\n class YourClassWithCustomMeta(Step, metaclass=MyMetaClass):\n def execute(self):\n self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n\n class YourClassWithCustomMeta2(Step, metaclass=MyMetaClass2):\n def execute(self):\n self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n
"},{"location":"reference/concepts/step.html#sparkstep","title":"SparkStep","text":"The SparkStep
class is a subclass of Step
that is designed for steps that interact with Spark. It extends the Step
class with SparkSession support. Spark steps are expected to return a Spark DataFrame as output. The spark
property is available to access the active SparkSession instance. Output
in a SparkStep
is expected to be a DataFrame
although optional.
"},{"location":"reference/concepts/step.html#using-a-sparkstep","title":"Using a SparkStep","text":"Here's an example of how to use a SparkStep
:
class MySparkStep(SparkStep):\n input1: str = Field(...)\n\n class Output(StepOutput):\n output1: DataFrame = Field(...)\n\n def execute(self):\n # Your logic here\n df = self.spark.read.text(self.input1)\n self.output.output1 = df\n
To run a SparkStep
, you can do the following:
step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\n
To access the output of a SparkStep
, you can do the following:
step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\ndf = step.output.output1\ndf.show()\n
"},{"location":"reference/concepts/step.html#conclusion","title":"Conclusion","text":"In this document, we've covered the key features of the Step
class in the Koheesio framework, including its ability to define custom units of logic, manage inputs and outputs, and support for serialization. The automatic decoration of the execute
method provides several advantages that simplify step implementation and ensure consistency across all steps.
Whether you're defining a new operation in your data pipeline or managing the flow of data between steps, Step
provides a robust and efficient solution.
"},{"location":"reference/concepts/step.html#further-reading","title":"Further Reading","text":"For more information, you can refer to the following resources:
- Python Pydantic Documentation
- Python YAML Documentation
Refer to the API documentation for more details on the Step
class and its methods.
"},{"location":"reference/spark/readers.html","title":"Reader Module","text":"The Reader
module in Koheesio provides a set of classes for reading data from various sources. A Reader
is a type of SparkStep
that reads data from a source based on the input parameters and stores the result in self.output.df
for subsequent steps.
"},{"location":"reference/spark/readers.html#what-is-a-reader","title":"What is a Reader?","text":"A Reader
is a subclass of SparkStep
that reads data from a source and stores the result. The source could be a file, a database, a web API, or any other data source. The result is stored in a DataFrame, which is accessible through the df
property of the Reader
.
"},{"location":"reference/spark/readers.html#api-reference","title":"API Reference","text":"See API Reference for a detailed description of the Reader
class and its methods.
"},{"location":"reference/spark/readers.html#key-features-of-a-reader","title":"Key Features of a Reader","text":" - Read Method: The
Reader
class provides a read
method that calls the execute
method and returns the result. Essentially, calling .read()
is a shorthand for calling .execute().output.df
. This allows you to read data from a Reader
without having to call the execute
method directly. This is a convenience method that simplifies the usage of a Reader
.
Here's an example of how to use the .read()
method:
# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the .read() method to get the data as a DataFrame\ndf = my_reader.read()\n\n# Now df is a DataFrame with the data read by MyReader\n
In this example, MyReader
is a subclass of Reader
that you've defined. After creating an instance of MyReader
, you call the .read()
method to read the data and get it back as a DataFrame. The DataFrame df
now contains the data read by MyReader
.
- DataFrame Property: The
Reader
class provides a df
property as a shorthand for accessing self.output.df
. If self.output.df
is None
, the execute
method is run first. This property ensures that the data is loaded and ready to be used, even if the execute
method hasn't been explicitly called.
Here's an example of how to use the df
property:
# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the df property to get the data as a DataFrame\ndf = my_reader.df\n\n# Now df is a DataFrame with the data read by MyReader\n
In this example, MyReader
is a subclass of Reader
that you've defined. After creating an instance of MyReader
, you access the df
property to get the data as a DataFrame. The DataFrame df
now contains the data read by MyReader
.
- SparkSession: Every
Reader
has a SparkSession
available as self.spark
. This is the currently active SparkSession
, which can be used to perform distributed data processing tasks.
Here's an example of how to use the spark
property:
# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the spark property to get the SparkSession\nspark = my_reader.spark\n\n# Now spark is the SparkSession associated with MyReader\n
In this example, MyReader
is a subclass of Reader
that you've defined. After creating an instance of MyReader
, you access the spark
property to get the SparkSession
. The SparkSession
spark
can now be used to perform distributed data processing tasks.
"},{"location":"reference/spark/readers.html#how-to-define-a-reader","title":"How to Define a Reader?","text":"To define a Reader
, you create a subclass of the Reader
class and implement the execute
method. The execute
method should read from the source and store the result in self.output.df
. This is an abstract method, which means it must be implemented in any subclass of Reader
.
Here's an example of a Reader
:
class MyReader(Reader):\n def execute(self):\n # read data from source\n data = read_from_source()\n # store result in self.output.df\n self.output.df = data\n
"},{"location":"reference/spark/readers.html#understanding-inheritance-in-readers","title":"Understanding Inheritance in Readers","text":"Just like a Step
, a Reader
is defined as a subclass that inherits from the base Reader
class. This means it inherits all the properties and methods from the Reader
class and can add or override them as needed. The main method that needs to be overridden is the execute
method, which should implement the logic for reading data from the source and storing it in self.output.df
.
"},{"location":"reference/spark/readers.html#benefits-of-using-readers-in-data-pipelines","title":"Benefits of Using Readers in Data Pipelines","text":"Using Reader
classes in your data pipelines has several benefits:
-
Simplicity: Readers abstract away the details of reading data from various sources, allowing you to focus on the logic of your pipeline.
-
Consistency: By using Readers, you ensure that data is read in a consistent manner across different parts of your pipeline.
-
Flexibility: Readers can be easily swapped out for different data sources without changing the rest of your pipeline.
-
Efficiency: Readers automatically manage resources like connections and file handles, ensuring efficient use of resources.
By using the concept of a Reader
, you can create data pipelines that are simple, consistent, flexible, and efficient.
"},{"location":"reference/spark/readers.html#examples-of-reader-classes-in-koheesio","title":"Examples of Reader Classes in Koheesio","text":"Koheesio provides a variety of Reader
subclasses for reading data from different sources. Here are just a few examples:
-
Teradata Reader: A Reader
subclass for reading data from Teradata databases. It's defined in the koheesio/steps/readers/teradata.py
file.
-
Snowflake Reader: A Reader
subclass for reading data from Snowflake databases. It's defined in the koheesio/steps/readers/snowflake.py
file.
-
Box Reader: A Reader
subclass for reading data from Box. It's defined in the koheesio/steps/integrations/box.py
file.
These are just a few examples of the many Reader
subclasses available in Koheesio. Each Reader
subclass is designed to read data from a specific source. They all inherit from the base Reader
class and implement the execute
method to read data from their respective sources and store it in self.output.df
.
Please note that this is not an exhaustive list. Koheesio provides many more Reader
subclasses for a wide range of data sources. For a complete list, please refer to the Koheesio documentation or the source code.
More readers can be found in the koheesio/steps/readers
module.
"},{"location":"reference/spark/transformations.html","title":"Transformation Module","text":"The Transformation
module in Koheesio provides a set of classes for transforming data within a DataFrame. A Transformation
is a type of SparkStep
that takes a DataFrame as input, applies a transformation, and returns a DataFrame as output. The transformation logic is implemented in the execute
method of each Transformation
subclass.
"},{"location":"reference/spark/transformations.html#what-is-a-transformation","title":"What is a Transformation?","text":"A Transformation
is a subclass of SparkStep
that applies a transformation to a DataFrame and stores the result. The transformation could be any operation that modifies the data or structure of the DataFrame, such as adding a new column, filtering rows, or aggregating data.
Using Transformation
classes ensures that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.
"},{"location":"reference/spark/transformations.html#api-reference","title":"API Reference","text":"See API Reference for a detailed description of the Transformation
classes and their methods.
"},{"location":"reference/spark/transformations.html#types-of-transformations","title":"Types of Transformations","text":"There are three main types of transformations in Koheesio:
-
Transformation
: This is the base class for all transformations. It takes a DataFrame as input and returns a DataFrame as output. The transformation logic is implemented in the execute
method.
-
ColumnsTransformation
: This is an extended Transformation
class with a preset validator for handling column(s) data. It standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.
-
ColumnsTransformationWithTarget
: This is an extended ColumnsTransformation
class with an additional target_column
field. This field can be used to store the result of the transformation in a new column. If the target_column
is not provided, the result will be stored in the source column.
Each type of transformation has its own use cases and advantages. The right one to use depends on the specific requirements of your data pipeline.
"},{"location":"reference/spark/transformations.html#how-to-define-a-transformation","title":"How to Define a Transformation","text":"To define a Transformation
, you create a subclass of the Transformation
class and implement the execute
method. The execute
method should take a DataFrame from self.input.df
, apply a transformation, and store the result in self.output.df
.
Transformation
classes abstract away some of the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.
Here's an example of a Transformation
:
class MyTransformation(Transformation):\n def execute(self):\n # get data from self.input.df\n data = self.input.df\n # apply transformation\n transformed_data = apply_transformation(data)\n # store result in self.output.df\n self.output.df = transformed_data\n
In this example, MyTransformation
is a subclass of Transformation
that you've defined. The execute
method gets the data from self.input.df
, applies a transformation called apply_transformation
(undefined in this example), and stores the result in self.output.df
.
"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformation","title":"How to Define a ColumnsTransformation","text":"To define a ColumnsTransformation
, you create a subclass of the ColumnsTransformation
class and implement the execute
method. The execute
method should apply a transformation to the specified columns of the DataFrame.
ColumnsTransformation
classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.
Here's an example of a ColumnsTransformation
:
from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\nclass AddOne(ColumnsTransformation):\n def execute(self):\n for column in self.get_columns():\n self.output.df = self.df.withColumn(column, f.col(column) + 1)\n
In this example, AddOne
is a subclass of ColumnsTransformation
that you've defined. The execute
method adds 1 to each column in self.get_columns()
.
The ColumnsTransformation
class has a ColumnConfig
class that can be used to configure the behavior of the class. This class has the following fields:
run_for_all_data_type
: Allows to run the transformation for all columns of a given type. limit_data_type
: Allows to limit the transformation to a specific data type. data_type_strict_mode
: Toggles strict mode for data type validation. Will only work if limit_data_type
is set.
Note that data types need to be specified as a SparkDatatype
enum. Users should not have to interact with the ColumnConfig
class directly.
"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformationwithtarget","title":"How to Define a ColumnsTransformationWithTarget","text":"To define a ColumnsTransformationWithTarget
, you create a subclass of the ColumnsTransformationWithTarget
class and implement the func
method. The func
method should return the transformation that will be applied to the column(s). The execute
method, which is already preset, will use the get_columns_with_target
method to loop over all the columns and apply this function to transform the DataFrame.
Here's an example of a ColumnsTransformationWithTarget
:
from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n def func(self, col: Column):\n return col + 1\n
In this example, AddOneWithTarget
is a subclass of ColumnsTransformationWithTarget
that you've defined. The func
method adds 1 to the values of a given column.
The ColumnsTransformationWithTarget
class has an additional target_column
field. This field can be used to store the result of the transformation in a new column. If the target_column
is not provided, the result will be stored in the source column. If more than one column is passed, the target_column
will be used as a suffix. Leaving this blank will result in the original columns being renamed.
The ColumnsTransformationWithTarget
class also has a get_columns_with_target
method. This method returns an iterator of the columns and handles the target_column
as well.
"},{"location":"reference/spark/transformations.html#key-features-of-a-transformation","title":"Key Features of a Transformation","text":" -
Execute Method: The Transformation
class provides an execute
method to implement in your subclass. This method should take a DataFrame from self.input.df
, apply a transformation, and store the result in self.output.df
.
For ColumnsTransformation
and ColumnsTransformationWithTarget
, the execute
method is already implemented in the base class. Instead of overriding execute
, you implement a func
method in your subclass. This func
method should return the transformation to be applied to each column. The execute
method will then apply this func to each column in a loop.
-
DataFrame Property: The Transformation
class provides a df
property as a shorthand for accessing self.input.df
. This property ensures that the data is ready to be transformed, even if the execute
method hasn't been explicitly called. This is useful for 'early validation' of the input data.
-
SparkSession: Every Transformation
has a SparkSession
available as self.spark
. This is the currently active SparkSession
, which can be used to perform distributed data processing tasks.
-
Columns Property: The ColumnsTransformation
and ColumnsTransformationWithTarget
classes provide a columns
property. This property standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.
-
Target Column Property: The ColumnsTransformationWithTarget
class provides a target_column
property. This field can be used to store the result of the transformation in a new column. If the target_column
is not provided, the result will be stored in the source column.
"},{"location":"reference/spark/transformations.html#examples-of-transformation-classes-in-koheesio","title":"Examples of Transformation Classes in Koheesio","text":"Koheesio provides a variety of Transformation
subclasses for transforming data in different ways. Here are some examples:
-
DataframeLookup
: This transformation joins two dataframes together based on a list of join mappings. It allows you to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.
Here's an example of how to use the DataframeLookup
transformation:
from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\nspark = SparkSession.builder.getOrCreate()\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\nlookup = DataframeLookup(\n df=left_df,\n other=right_df,\n on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n how=JoinType.LEFT,\n)\n\noutput_df = lookup.execute().df\n
-
HashUUID5
: This transformation is a subclass of Transformation
and provides an interface to generate a UUID5 hash for each row in the DataFrame. The hash is generated based on the values of the specified source columns.
Here's an example of how to use the HashUUID5
transformation:
from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\n\nspark = SparkSession.builder.getOrCreate()\ndf = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\n\nhash_transform = HashUUID5(\n df=df,\n source_columns=[\"id\", \"value\"],\n target_column=\"hash\"\n)\n\noutput_df = hash_transform.execute().df\n
In this example, HashUUID5
is a subclass of Transformation
. After creating an instance of HashUUID5
, you call the execute
method to apply the transformation. The execute
method generates a UUID5 hash for each row in the DataFrame based on the values of the id
and value
columns and stores the result in a new column named hash
.
"},{"location":"reference/spark/transformations.html#benefits-of-using-koheesio-transformations","title":"Benefits of using Koheesio Transformations","text":"Using a Koheesio Transformation
over plain Spark provides several benefits:
-
Consistency: By using Transformation
classes, you ensure that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.
-
Abstraction: Transformation
classes abstract away the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.
-
Flexibility: Transformation
classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.
-
Early Input Validation: As a Transformation
is a type of SparkStep
, which in turn is a Step
and a type of Pydantic BaseModel
, all inputs are validated when an instance of a Transformation
class is created. This early validation helps catch errors related to invalid input, such as an invalid column name, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.
-
Ease of Testing: Transformation
classes are designed to be easily testable. This can make it easier to write unit tests for your data pipeline, helping to ensure its correctness and reliability.
-
Robustness: Koheesio has been extensively tested with hundreds of unit tests, ensuring that the Transformation
classes work as expected under a wide range of conditions. This makes your data pipelines more robust and less likely to fail due to unexpected inputs or edge cases.
By using the concept of a Transformation
, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.
"},{"location":"reference/spark/transformations.html#advanced-usage-of-transformations","title":"Advanced Usage of Transformations","text":"Transformations can be combined and chained together to create complex data processing pipelines. Here's an example of how to chain transformations:
from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\n# Create a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Define two DataFrames\ndf1 = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\ndf2 = spark.createDataFrame([(1, \"C\"), (3, \"D\")], [\"id\", \"value\"])\n\n# Define the first transformation\nlookup = DataframeLookup(\n other=df2,\n on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n how=JoinType.LEFT,\n)\n\n# Apply the first transformation\noutput_df = lookup.transform(df1)\n\n# Define the second transformation\nhash_transform = HashUUID5(\n source_columns=[\"id\", \"value\", \"right_value\"],\n target_column=\"hash\"\n)\n\n# Apply the second transformation\noutput_df2 = hash_transform.transform(output_df)\n
In this example, DataframeLookup
is a subclass of ColumnsTransformation
and HashUUID5
is a subclass of Transformation
. After creating instances of DataframeLookup
and HashUUID5
, you call the transform
method to apply each transformation. The transform
method of DataframeLookup
performs a left join with df2
on the id
column and adds the value
column from df2
to the result DataFrame as right_value
. The transform
method of HashUUID5
generates a UUID5 hash for each row in the DataFrame based on the values of the id
, value
, and right_value
columns and stores the result in a new column named hash
.
"},{"location":"reference/spark/transformations.html#troubleshooting-transformations","title":"Troubleshooting Transformations","text":"If you encounter an error when using a transformation, here are some steps you can take to troubleshoot:
-
Check the Input Data: Make sure the input DataFrame to the transformation is correct. You can use the show
method of the DataFrame to print the first few rows of the DataFrame.
-
Check the Transformation Parameters: Make sure the parameters passed to the transformation are correct. For example, if you're using a DataframeLookup
, make sure the join mappings and target columns are correctly specified.
-
Check the Transformation Logic: If the input data and parameters are correct, there might be an issue with the transformation logic. You can use PySpark's logging utilities to log intermediate results and debug the transformation logic.
-
Check the Output Data: If the transformation executes without errors but the output data is not as expected, you can use the show
method of the DataFrame to print the first few rows of the output DataFrame. This can help you identify any issues with the transformation logic.
"},{"location":"reference/spark/transformations.html#conclusion","title":"Conclusion","text":"The Transformation
module in Koheesio provides a powerful and flexible way to transform data in a DataFrame. By using Transformation
classes, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable. Whether you're performing simple transformations like adding a new column, or complex transformations like joining multiple DataFrames, the Transformation
module has you covered.
"},{"location":"reference/spark/writers.html","title":"Writer Module","text":"The Writer
module in Koheesio provides a set of classes for writing data to various destinations. A Writer
is a type of SparkStep
that takes data from self.input.df
and writes it to a destination based on the output parameters.
"},{"location":"reference/spark/writers.html#what-is-a-writer","title":"What is a Writer?","text":"A Writer
is a subclass of SparkStep
that writes data to a destination. The data to be written is taken from a DataFrame, which is accessible through the df
property of the Writer
.
"},{"location":"reference/spark/writers.html#how-to-define-a-writer","title":"How to Define a Writer?","text":"To define a Writer
, you create a subclass of the Writer
class and implement the execute
method. The execute
method should take data from self.input.df
and write it to the destination.
Here's an example of a Writer
:
class MyWriter(Writer):\n def execute(self):\n # get data from self.input.df\n data = self.input.df\n # write data to destination\n write_to_destination(data)\n
"},{"location":"reference/spark/writers.html#key-features-of-a-writer","title":"Key Features of a Writer","text":" -
Write Method: The Writer
class provides a write
method that calls the execute
method and writes the data to the destination. Essentially, calling .write()
is a shorthand for calling .execute().output.df
. This allows you to write data to a Writer
without having to call the execute
method directly. This is a convenience method that simplifies the usage of a Writer
.
Here's an example of how to use the .write()
method:
# Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the .write() method to write the data\nmy_writer.write()\n\n# The data from MyWriter's DataFrame is now written to the destination\n
In this example, MyWriter
is a subclass of Writer
that you've defined. After creating an instance of MyWriter
, you call the .write()
method to write the data to the destination. The data from MyWriter
's DataFrame is now written to the destination.
-
DataFrame Property: The Writer
class provides a df
property as a shorthand for accessing self.input.df
. This property ensures that the data is ready to be written, even if the execute
method hasn't been explicitly called.
Here's an example of how to use the df
property:
# Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the df property to get the data as a DataFrame\ndf = my_writer.df\n\n# Now df is a DataFrame with the data that will be written by MyWriter\n
In this example, MyWriter
is a subclass of Writer
that you've defined. After creating an instance of MyWriter
, you access the df
property to get the data as a DataFrame. The DataFrame df
now contains the data that will be written by MyWriter
.
-
SparkSession: Every Writer
has a SparkSession
available as self.spark
. This is the currently active SparkSession
, which can be used to perform distributed data processing tasks.
Here's an example of how to use the spark
property:
# Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the spark property to get the SparkSession\nspark = my_writer.spark\n\n# Now spark is the SparkSession associated with MyWriter\n
In this example, MyWriter
is a subclass of Writer
that you've defined. After creating an instance of MyWriter
, you access the spark
property to get the SparkSession
. The SparkSession
spark
can now be used to perform distributed data processing tasks.
"},{"location":"reference/spark/writers.html#understanding-inheritance-in-writers","title":"Understanding Inheritance in Writers","text":"Just like a Step
, a Writer
is defined as a subclass that inherits from the base Writer
class. This means it inherits all the properties and methods from the Writer
class and can add or override them as needed. The main method that needs to be overridden is the execute
method, which should implement the logic for writing data from self.input.df
to the destination.
"},{"location":"reference/spark/writers.html#examples-of-writer-classes-in-koheesio","title":"Examples of Writer Classes in Koheesio","text":"Koheesio provides a variety of Writer
subclasses for writing data to different destinations. Here are just a few examples:
BoxFileWriter
DeltaTableStreamWriter
DeltaTableWriter
DummyWriter
ForEachBatchStreamWriter
KafkaWriter
SnowflakeWriter
StreamWriter
Please note that this is not an exhaustive list. Koheesio provides many more Writer
subclasses for a wide range of data destinations. For a complete list, please refer to the Koheesio documentation or the source code.
"},{"location":"reference/spark/writers.html#benefits-of-using-writers-in-data-pipelines","title":"Benefits of Using Writers in Data Pipelines","text":"Using Writer
classes in your data pipelines has several benefits:
- Simplicity: Writers abstract away the details of writing data to various destinations, allowing you to focus on the logic of your pipeline.
- Consistency: By using Writers, you ensure that data is written in a consistent manner across different parts of your pipeline.
- Flexibility: Writers can be easily swapped out for different data destinations without changing the rest of your pipeline.
- Efficiency: Writers automatically manage resources like connections and file handles, ensuring efficient use of resources.
- Early Input Validation: As a
Writer
is a type of SparkStep
, which in turn is a Step
and a type of Pydantic BaseModel
, all inputs are validated when an instance of a Writer
class is created. This early validation helps catch errors related to invalid input, such as an invalid URL for a database, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.
By using the concept of a Writer
, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.
"},{"location":"tutorials/advanced-data-processing.html","title":"Advanced Data Processing with Koheesio","text":"In this guide, we will explore some advanced data processing techniques using Koheesio. We will cover topics such as complex transformations, handling large datasets, and optimizing performance.
"},{"location":"tutorials/advanced-data-processing.html#complex-transformations","title":"Complex Transformations","text":"Koheesio provides a variety of built-in transformations, but sometimes you may need to perform more complex operations on your data. In such cases, you can create custom transformations.
Here's an example of a custom transformation that normalizes a column in a DataFrame:
from pyspark.sql import DataFrame\nfrom koheesio.spark.transformations.transform import Transform\n\ndef normalize_column(df: DataFrame, column: str) -> DataFrame:\n max_value = df.agg({column: \"max\"}).collect()[0][0]\n min_value = df.agg({column: \"min\"}).collect()[0][0]\n return df.withColumn(column, (df[column] - min_value) / (max_value - min_value))\n\n\nclass NormalizeColumnTransform(Transform):\n column: str\n\n def transform(self, df: DataFrame) -> DataFrame:\n return normalize_column(df, self.column)\n
"},{"location":"tutorials/advanced-data-processing.html#handling-large-datasets","title":"Handling Large Datasets","text":"When working with large datasets, it's important to manage resources effectively to ensure good performance. Koheesio provides several features to help with this.
"},{"location":"tutorials/advanced-data-processing.html#partitioning","title":"Partitioning","text":"Partitioning is a technique that divides your data into smaller, more manageable pieces, called partitions. Koheesio allows you to specify the partitioning scheme for your data when writing it to a target.
from koheesio.steps.writers.delta import DeltaTableWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\nclass MyTask(EtlTask):\n target = DeltaTableWriter(table=\"my_table\", partitionBy=[\"column1\", \"column2\"])\n
"},{"location":"tutorials/getting-started.html","title":"Getting Started with Koheesio","text":""},{"location":"tutorials/getting-started.html#requirements","title":"Requirements","text":" - Python 3.9+
"},{"location":"tutorials/getting-started.html#installation","title":"Installation","text":""},{"location":"tutorials/getting-started.html#poetry","title":"Poetry","text":"If you're using Poetry, add the following entry to the pyproject.toml
file:
pyproject.toml[[tool.poetry.source]]\nname = \"nike\"\nurl = \"https://artifactory.nike.com/artifactory/api/pypi/python-virtual/simple\"\nsecondary = true\n
poetry add koheesio\n
"},{"location":"tutorials/getting-started.html#pip","title":"pip","text":"If you're using pip, run the following command to install Koheesio:
Requires pip.
pip install koheesio\n
"},{"location":"tutorials/getting-started.html#basic-usage","title":"Basic Usage","text":"Once you've installed Koheesio, you can start using it in your Python scripts. Here's a basic example:
from koheesio import Step\n\n# Define a step\nclass MyStep(Step):\n def execute(self):\n # Your step logic here\n\n# Create an instance of the step\nstep = MyStep()\n\n# Run the step\nstep.execute()\n
"},{"location":"tutorials/getting-started.html#advanced-usage","title":"Advanced Usage","text":"from pyspark.sql.functions import lit\nfrom pyspark.sql import DataFrame, SparkSession\n\n# Step 1: import Koheesio dependencies\nfrom koheesio.context import Context\nfrom koheesio.steps.readers.dummy import DummyReader\nfrom koheesio.steps.transformations.camel_to_snake import CamelToSnakeTransformation\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\n# Step 2: Set up a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Step 3: Configure your Context\ncontext = Context({\n \"source\": DummyReader(),\n \"transformations\": [CamelToSnakeTransformation()],\n \"target\": DummyWriter(),\n \"my_favorite_movie\": \"inception\",\n})\n\n# Step 4: Create a Task\nclass MyFavoriteMovieTask(EtlTask):\n my_favorite_movie: str\n\n def transform(self, df: DataFrame = None) -> DataFrame:\n df = df.withColumn(\"MyFavoriteMovie\", lit(self.my_favorite_movie))\n return super().transform(df)\n\n# Step 5: Run your Task\ntask = MyFavoriteMovieTask(**context)\ntask.run()\n
"},{"location":"tutorials/getting-started.html#contributing","title":"Contributing","text":"If you want to contribute to Koheesio, check out the CONTRIBUTING.md
file in this repository. It contains guidelines for contributing, including how to submit issues and pull requests.
"},{"location":"tutorials/getting-started.html#testing","title":"Testing","text":"To run the tests for Koheesio, use the following command:
make dev-test\n
This will run all the tests in the tests
directory.
"},{"location":"tutorials/hello-world.html","title":"Simple Examples","text":""},{"location":"tutorials/hello-world.html#creating-a-custom-step","title":"Creating a Custom Step","text":"This example demonstrates how to use the SparkStep
class from the koheesio
library to create a custom step named HelloWorldStep
.
"},{"location":"tutorials/hello-world.html#code","title":"Code","text":"from koheesio.steps.step import SparkStep\n\nclass HelloWorldStep(SparkStep):\n message: str\n\n def execute(self) -> SparkStep.Output:\n # create a DataFrame with a single row containing the message\n self.output.df = self.spark.createDataFrame([(1, self.message)], [\"id\", \"message\"])\n
"},{"location":"tutorials/hello-world.html#usage","title":"Usage","text":"hello_world_step = HelloWorldStep(message=\"Hello, World!\")\nhello_world_step.execute()\n\nhello_world_step.output.df.show()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code","title":"Understanding the Code","text":"The HelloWorldStep
class is a SparkStep
in Koheesio, designed to generate a DataFrame with a single row containing a custom message. Here's a more detailed overview:
HelloWorldStep
inherits from SparkStep
, a fundamental building block in Koheesio for creating data processing steps with Apache Spark. - It has a
message
attribute. When creating an instance of HelloWorldStep
, you can pass a custom message that will be used in the DataFrame. SparkStep
has a spark
attribute, which is the active SparkSession. This is the entry point for any Spark functionality, allowing the step to interact with the Spark cluster. SparkStep
also includes an Output
class, used to store the output of the step. In this case, Output
has a df
attribute to store the output DataFrame. - The
execute
method creates a DataFrame with the custom message and stores it in output.df
. It doesn't return a value explicitly; instead, the output DataFrame can be accessed via output.df
. - Koheesio uses pydantic for automatic validation of the step's input and output, ensuring they are correctly defined and of the correct types.
Note: Pydantic is a data validation library that provides a way to validate that the data (in this case, the input and output of the step) conforms to the expected format.
"},{"location":"tutorials/hello-world.html#creating-a-custom-task","title":"Creating a Custom Task","text":"This example demonstrates how to use the EtlTask
from the koheesio
library to create a custom task named MyFavoriteMovieTask
.
"},{"location":"tutorials/hello-world.html#code_1","title":"Code","text":"from typing import Any\nfrom pyspark.sql import DataFrame, functions as f\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.tasks.etl_task import EtlTask\n\n\ndef add_column(df: DataFrame, target_column: str, value: Any):\n return df.withColumn(target_column, f.lit(value))\n\n\nclass MyFavoriteMovieTask(EtlTask):\n my_favorite_movie: str\n\n def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n df = df or self.extract()\n\n # pre-transformations specific to this class\n pre_transformations = [\n Transform(add_column, target_column=\"myFavoriteMovie\", value=self.my_favorite_movie)\n ]\n\n # execute transformations one by one\n for t in pre_transformations:\n df = t.transform(df)\n\n self.output.transform_df = df\n return df\n
"},{"location":"tutorials/hello-world.html#configuration","title":"Configuration","text":"Here is the sample.yaml
configuration file used in this example:
raw_layer:\n catalog: development\n schema: my_favorite_team\n table: some_random_table\nmovies:\n favorite: Office Space\nhash_settings:\n source_columns:\n - id\n - foo\n target_column: hash_uuid5\nsource:\n range: 4\n
"},{"location":"tutorials/hello-world.html#usage_1","title":"Usage","text":"from pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\n\ncontext = Context.from_yaml(\"sample.yaml\")\n\nSparkSession.builder.getOrCreate()\n\nmy_fav_mov_task = MyFavoriteMovieTask(\n source=DummyReader(**context.raw_layer),\n target=DummyWriter(truncate=False),\n my_favorite_movie=context.movies.favorite,\n)\nmy_fav_mov_task.execute()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code_1","title":"Understanding the Code","text":"This example creates a MyFavoriteMovieTask
that adds a column named myFavoriteMovie
to the DataFrame. The value for this column is provided when the task is instantiated.
The MyFavoriteMovieTask
class is a custom task that extends the EtlTask
from the koheesio
library. It demonstrates how to add a custom transformation to a DataFrame. Here's a detailed breakdown:
-
MyFavoriteMovieTask
inherits from EtlTask
, a base class in Koheesio for creating Extract-Transform-Load (ETL) tasks with Apache Spark.
-
It has a my_favorite_movie
attribute. When creating an instance of MyFavoriteMovieTask
, you can pass a custom movie title that will be used in the DataFrame.
-
The transform
method is where the main logic of the task is implemented. It first extracts the data (if not already provided), then applies a series of transformations to the DataFrame.
-
In this case, the transformation is adding a new column to the DataFrame named myFavoriteMovie
, with the value set to the my_favorite_movie
attribute. This is done using the add_column
function and the Transform
class from Koheesio.
-
The transformed DataFrame is then stored in self.output.transform_df
.
-
The sample.yaml
configuration file is used to provide the context for the task, including the source data and the favorite movie title.
-
In the usage example, an instance of MyFavoriteMovieTask
is created with a DummyReader
as the source, a DummyWriter
as the target, and the favorite movie title from the context. The task is then executed, which runs the transformations and stores the result in self.output.transform_df
.
"},{"location":"tutorials/learn-koheesio.html","title":"Learn Koheesio","text":"Koheesio is designed to simplify the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.
"},{"location":"tutorials/learn-koheesio.html#core-concepts","title":"Core Concepts","text":"Koheesio is built around several core concepts:
- Step: The fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.
See the Step documentation for more information.
- Context: A configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.
See the Context documentation for more information.
- Logger: A class for logging messages at different levels.
See the Logger documentation for more information.
The Logger and Context classes provide support, enabling detailed logging of the pipeline's execution and customization of the pipeline's behavior based on the environment, respectively.
"},{"location":"tutorials/learn-koheesio.html#implementations","title":"Implementations","text":"In the context of Koheesio, an implementation refers to a specific way of executing Steps, the fundamental units of work in Koheesio. Each implementation uses a different technology or approach to process data along with its own set of Steps, designed to work with the specific technology or approach used by the implementation.
For example, the Spark implementation includes Steps for reading data from a Spark DataFrame, transforming the data using Spark operations, and writing the data to a Spark-supported destination.
Currently, Koheesio supports two implementations: Spark, and AsyncIO.
"},{"location":"tutorials/learn-koheesio.html#spark","title":"Spark","text":"Requires: Apache Spark (pyspark) Installation: pip install koheesio[spark]
Module: koheesio.spark
This implementation uses Apache Spark, a powerful open-source unified analytics engine for large-scale data processing.
Steps that use this implementation can leverage Spark's capabilities for distributed data processing, making it suitable for handling large volumes of data. The Spark implementation includes the following types of Steps:
-
Reader: from koheesio.spark.readers import Reader
A type of Step that reads data from a source and stores the result (to make it available for subsequent steps). For more information, see the Reader documentation.
-
Writer: from koheesio.spark.writers import Writer
This controls how data is written to the output in both batch and streaming contexts. For more information, see the Writer documentation.
-
Transformation: from koheesio.spark.transformations import Transformation
A type of Step that takes a DataFrame as input and returns a DataFrame as output. For more information, see the Transformation documentation.
In any given pipeline, you can expect to use Readers, Writers, and Transformations to express the ETL logic. Readers are responsible for extracting data from various sources, such as databases, files, or APIs. Transformations then process this data, performing operations like filtering, aggregation, or conversion. Finally, Writers handle the loading of the transformed data to the desired destination, which could be a database, a file, or a data stream.
"},{"location":"tutorials/learn-koheesio.html#async","title":"Async","text":"Module: koheesio.asyncio
This implementation uses Python's asyncio library for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives. Steps that use this implementation can perform data processing tasks asynchronously, which can be beneficial for IO-bound tasks.
"},{"location":"tutorials/learn-koheesio.html#best-practices","title":"Best Practices","text":"Here are some best practices for using Koheesio:
-
Use Context: The Context
class in Koheesio is designed to behave like a dictionary, but with added features. It's a good practice to use Context
to customize the behavior of a task. This allows you to share variables across tasks and adapt the behavior of a task based on its environment; for example, by changing the source or target of the data between development and production environments.
-
Modular Design: Each step in the pipeline (reading, transformation, writing) should be encapsulated in its own class, making the code easier to understand and maintain. This also promotes re-usability as steps can be reused across different tasks.
-
Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks. Make sure to leverage this feature to make your pipelines robust and fault-tolerant.
-
Logging: Use the built-in logging feature in Koheesio to log information and errors in data processing tasks. This can be very helpful for debugging and monitoring the pipeline. Koheesio sets the log level to WARNING
by default, but you can change it to INFO
or DEBUG
as needed.
-
Testing: Each step can be tested independently, making it easier to write unit tests. It's a good practice to write tests for your steps to ensure they are working as expected.
-
Use Transformations: The Transform
class in Koheesio allows you to define transformations on your data. It's a good practice to encapsulate your transformation logic in Transform
classes for better readability and maintainability.
-
Consistent Structure: Koheesio enforces a consistent structure for data processing tasks. Stick to this structure to make your codebase easier to understand for new developers.
-
Use Readers and Writers: Use the built-in Reader
and Writer
classes in Koheesio to handle data extraction and loading. This not only simplifies your code but also makes it more robust and efficient.
Remember, these are general best practices and might need to be adapted based on your specific use case and requirements.
"},{"location":"tutorials/learn-koheesio.html#pydantic","title":"Pydantic","text":"Koheesio Steps are Pydantic models, which means they can be validated and serialized. This makes it easy to define the inputs and outputs of a Step, and to validate them before running the Step. Pydantic models also provide a consistent way to define the schema of the data that a Step expects and produces, making it easier to understand and maintain the code.
Learn more about Pydantic here.
"},{"location":"tutorials/onboarding.html","title":"Onboarding","text":"tags: - doctype/how-to
"},{"location":"tutorials/onboarding.html#onboarding-to-koheesio","title":"Onboarding to Koheesio","text":"Koheesio is a Python library that simplifies the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.
This guide will walk you through the process of transforming a traditional Spark application into a Koheesio pipeline along with explaining the advantages of using Koheesio over raw Spark.
"},{"location":"tutorials/onboarding.html#traditional-spark-application","title":"Traditional Spark Application","text":"First let's create a simple Spark application that you might use to process data.
The following Spark application reads a CSV file, performs a transformation, and writes the result to a Delta table. The transformation includes filtering data where age is greater than 18 and performing an aggregation to calculate the average salary per country. The result is then written to a Delta table partitioned by country.
from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, avg\n\nspark = SparkSession.builder.getOrCreate()\n\n# Read data from CSV file\ndf = spark.read.csv(\"input.csv\", header=True, inferSchema=True)\n\n# Filter data where age is greater than 18\ndf = df.filter(col(\"age\") > 18)\n\n# Perform aggregation\ndf = df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n\n# Write data to Delta table with partitioning\ndf.write.format(\"delta\").partitionBy(\"country\").save(\"/path/to/delta_table\")\n
"},{"location":"tutorials/onboarding.html#transforming-to-koheesio","title":"Transforming to Koheesio","text":"The same pipeline can be rewritten using Koheesio's EtlTask
. In this version, each step (reading, transformations, writing) is encapsulated in its own class, making the code easier to understand and maintain.
First, a CsvReader
is defined to read the input CSV file. Then, a DeltaTableWriter
is defined to write the result to a Delta table partitioned by country.
Two transformations are defined: 1. one to filter data where age is greater than 18 2. and, another to calculate the average salary per country.
These transformations are then passed to an EtlTask
along with the reader and writer. Finally, the EtlTask
is executed to run the pipeline.
from koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta.batch import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\nfrom pyspark.sql.functions import col, avg\n\n# Define reader\nreader = CsvReader(path=\"input.csv\", header=True, inferSchema=True)\n\n# Define writer\nwriter = DeltaTableWriter(table=\"delta_table\", partition_by=[\"country\"])\n\n# Define transformations\nage_transformation = Transform(\n func=lambda df: df.filter(col(\"age\") > 18)\n)\navg_salary_per_country = Transform(\n func=lambda df: df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n)\n\n# Define and execute EtlTask\ntask = EtlTask(\n source=reader, \n target=writer, \n transformations=[\n age_transformation,\n avg_salary_per_country\n ]\n)\ntask.execute()\n
This approach with Koheesio provides several advantages. It makes the code more modular and easier to test. Each step can be tested independently and reused across different tasks. It also makes the pipeline more readable and easier to maintain."},{"location":"tutorials/onboarding.html#advantages-of-koheesio","title":"Advantages of Koheesio","text":"Using Koheesio instead of raw Spark has several advantages:
- Modularity: Each step in the pipeline (reading, transformation, writing) is encapsulated in its own class, making the code easier to understand and maintain.
- Reusability: Steps can be reused across different tasks, reducing code duplication.
- Testability: Each step can be tested independently, making it easier to write unit tests.
- Flexibility: The behavior of a task can be customized using a
Context
class. - Consistency: Koheesio enforces a consistent structure for data processing tasks, making it easier for new developers to understand the codebase.
- Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks.
- Logging: Koheesio provides a consistent way to log information and errors in data processing tasks.
In contrast, using the plain PySpark API for transformations can lead to more verbose and less structured code, which can be harder to understand, maintain, and test. It also doesn't provide the same level of error handling, logging, and flexibility as the Koheesio Transform class.
"},{"location":"tutorials/onboarding.html#using-a-context-class","title":"Using a Context Class","text":"Here's a simple example of how to use a Context
class to customize the behavior of a task. The Context class in Koheesio is designed to behave like a dictionary, but with added features.
from koheesio import Context\nfrom koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\n\ncontext = Context({ # this could be stored in a JSON or YAML\n \"age_threshold\": 18,\n \"reader_options\": {\n \"path\": \"input.csv\",\n \"header\": True,\n \"inferSchema\": True\n },\n \"writer_options\": {\n \"table\": \"delta_table\",\n \"partition_by\": [\"country\"]\n }\n})\n\ntask = EtlTask(\n source = CsvReader(**context.reader_options),\n target = DeltaTableWriter(**context.writer_options),\n transformations = [\n Transform(func=lambda df: df.filter(df[\"age\"] > context.age_threshold))\n ]\n)\n\ntask.execute()\n
In this example, we're using CsvReader
to read the input data, DeltaTableWriter
to write the output data, and a Transform
step to filter the data based on the age threshold. The options for the reader and writer are stored in a Context
object, which can be easily updated or loaded from a JSON or YAML file.
"},{"location":"tutorials/testing-koheesio-steps.html","title":"Testing Koheesio Tasks","text":"Testing is a crucial part of any software development process. Koheesio provides a structured way to define and execute data processing tasks, which makes it easier to build, test, and maintain complex data workflows. This guide will walk you through the process of testing Koheesio tasks.
"},{"location":"tutorials/testing-koheesio-steps.html#unit-testing","title":"Unit Testing","text":"Unit testing involves testing individual components of the software in isolation. In the context of Koheesio, this means testing individual tasks or steps.
Here's an example of how to unit test a Koheesio task:
from koheesio.tasks.etl_task import EtlTask\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.steps.transformations import Transform\nfrom pyspark.sql import SparkSession, DataFrame\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df: DataFrame) -> DataFrame:\n return df.filter(col(\"Age\") > 18)\n\n\ndef test_etl_task():\n # Initialize SparkSession\n spark = SparkSession.builder.getOrCreate()\n\n # Create a DataFrame for testing\n data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n df = spark.createDataFrame(data, [\"Name\", \"Age\"])\n\n # Define the task\n task = EtlTask(\n source=DummyReader(df=df),\n target=DummyWriter(),\n transformations=[\n Transform(filter_age)\n ]\n )\n\n # Execute the task\n task.execute()\n\n # Assert the result\n result_df = task.output.df\n assert result_df.count() == 2\n assert result_df.filter(\"Name == 'Tom'\").count() == 0\n
In this example, we're testing an EtlTask that reads data from a DataFrame, applies a filter transformation, and writes the result to another DataFrame. The test asserts that the task correctly filters out rows where the age is less than or equal to 18.
"},{"location":"tutorials/testing-koheesio-steps.html#integration-testing","title":"Integration Testing","text":"Integration testing involves testing the interactions between different components of the software. In the context of Koheesio, this means testing the entirety of data flowing through one or more tasks.
We'll create a simple test for a hypothetical EtlTask that uses DeltaReader and DeltaWriter. We'll use pytest and unittest.mock to mock the responses of the reader and writer. First, let's assume that you have an EtlTask defined in a module named my_module. This task reads data from a Delta table, applies some transformations, and writes the result to another Delta table.
Here's an example of how to write an integration test for this task:
# my_module.py\nfrom koheesio.tasks.etl_task import EtlTask\nfrom koheesio.spark.readers.delta import DeltaReader\nfrom koheesio.steps.writers.delta import DeltaWriter\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.context import Context\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df):\n return df.filter(col(\"Age\") > 18)\n\n\ncontext = Context({\n \"reader_options\": {\n \"table\": \"input_table\"\n },\n \"writer_options\": {\n \"table\": \"output_table\"\n }\n})\n\ntask = EtlTask(\n source=DeltaReader(**context.reader_options),\n target=DeltaWriter(**context.writer_options),\n transformations=[\n Transform(filter_age)\n ]\n)\n
Now, let's create a test for this task. We'll use pytest and unittest.mock to mock the responses of the reader and writer. We'll also use a pytest fixture to create a test context and a test DataFrame.
# test_my_module.py\nimport pytest\nfrom unittest.mock import MagicMock, patch\nfrom pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import Reader\nfrom koheesio.steps.writers import Writer\n\nfrom my_module import task\n\n@pytest.fixture(scope=\"module\")\ndef spark():\n return SparkSession.builder.getOrCreate()\n\n@pytest.fixture(scope=\"module\")\ndef test_context():\n return Context({\n \"reader_options\": {\n \"table\": \"test_input_table\"\n },\n \"writer_options\": {\n \"table\": \"test_output_table\"\n }\n })\n\n@pytest.fixture(scope=\"module\")\ndef test_df(spark):\n data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n return spark.createDataFrame(data, [\"Name\", \"Age\"])\n\ndef test_etl_task(spark, test_context, test_df):\n # Mock the read method of the Reader class\n with patch.object(Reader, \"read\", return_value=test_df):\n # Mock the write method of the Writer class\n with patch.object(Writer, \"write\") as mock_write:\n # Execute the task\n task.execute()\n\n # Assert the result\n result_df = task.output.df\n assert result_df.count() == 2\n assert result_df.filter(\"Name == 'Tom'\").count() == 0\n\n # Assert that the reader and writer were called with the correct arguments\n Reader.read.assert_called_once_with(**test_context.reader_options)\n mock_write.assert_called_once_with(**test_context.writer_options)\n
In this test, we're mocking the DeltaReader and DeltaWriter to return a test DataFrame and check that they're called with the correct arguments. We're also asserting that the task correctly filters out rows where the age is less than or equal to 18.
"},{"location":"misc/tags.html","title":"{{ page.title }}","text":""},{"location":"misc/tags.html#doctypeexplanation","title":"doctype/explanation","text":" - Approach documentation
"},{"location":"misc/tags.html#doctypehow-to","title":"doctype/how-to","text":" - How to
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":""},{"location":"index.html#koheesio","title":"Koheesio","text":"CI/CD Package Meta Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
The framework is versatile, aiming to support multiple implementations and working seamlessly with various data processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale.
Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type safety and structured configurations within pipeline components.
Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines.
"},{"location":"index.html#what-sets-koheesio-apart-from-other-libraries","title":"What sets Koheesio apart from other libraries?\"","text":"Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart.
Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus on utility in a multitude of scenarios. Our preference is for integration, not competition...
We invite contributions from all, promoting collaboration and innovation in the data engineering community.
"},{"location":"index.html#koheesio-core-components","title":"Koheesio Core Components","text":"Here are the key components included in Koheesio:
- Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
- Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.
- Logger: This is a class for logging messages at different levels.
"},{"location":"index.html#installation","title":"Installation","text":"You can install Koheesio using either pip or poetry.
"},{"location":"index.html#using-pip","title":"Using Pip","text":"To install Koheesio using pip, run the following command in your terminal:
pip install koheesio\n
"},{"location":"index.html#using-hatch","title":"Using Hatch","text":"If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your pyproject.toml
.
"},{"location":"index.html#using-poetry","title":"Using Poetry","text":"If you're using poetry for package management, you can add Koheesio to your project with the following command:
poetry add koheesio\n
or add the following line to your pyproject.toml
(under [tool.poetry.dependencies]
), making sure to replace ...
with the version you want to have installed:
koheesio = {version = \"...\"}\n
"},{"location":"index.html#extras","title":"Extras","text":"Koheesio also provides some additional features that can be useful in certain scenarios. These include:
-
Spark Expectations: Available through the koheesio.steps.integration.spark.dq.spark_expectations
module; installable through the se
extra.
- SE Provides Data Quality checks for Spark DataFrames.
- For more information, refer to the Spark Expectations docs.
-
Box: Available through the koheesio.steps.integration.box
module; installable through the box
extra.
- Box is a cloud content management and file sharing service for businesses.
-
SFTP: Available through the koheesio.steps.integration.spark.sftp
module; installable through the sftp
extra.
- SFTP is a network protocol used for secure file transfer over a secure shell.
Note: Some of the steps require extra dependencies. See the Extras section for additional info. Extras can be added to Poetry by adding extras=['name_of_the_extra']
to the toml entry mentioned above
"},{"location":"index.html#contributing","title":"Contributing","text":""},{"location":"index.html#how-to-contribute","title":"How to Contribute","text":"We welcome contributions to our project! Here's a brief overview of our development process:
-
Code Standards: We use pylint
, black
, and mypy
to maintain code standards. Please ensure your code passes these checks by running make check
. No errors or warnings should be reported by the linter before you submit a pull request.
-
Testing: We use pytest
for testing. Run the tests with make test
and ensure all tests pass before submitting a pull request.
-
Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with admin rights will create a new release on GitHub and publish the new version to PyPI.
For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement.
"},{"location":"index.html#additional-resources","title":"Additional Resources","text":" - General GitHub documentation
- GitHub pull request documentation
- Nike OSS
"},{"location":"api_reference/index.html","title":"API Reference","text":""},{"location":"api_reference/index.html#koheesio.ABOUT","title":"koheesio.ABOUT module-attribute
","text":"ABOUT = _about()\n
"},{"location":"api_reference/index.html#koheesio.VERSION","title":"koheesio.VERSION module-attribute
","text":"VERSION = __version__\n
"},{"location":"api_reference/index.html#koheesio.BaseModel","title":"koheesio.BaseModel","text":"Base model for all models.
Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.
Additional methods and properties: Different Modes This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.
-
Normal mode: you need to know the values ahead of time
normal_mode = YourOwnModel(a=\"foo\", b=42)\n
-
Lazy mode: being able to defer the validation until later
lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end. -
With statements: With statements are also allowed. The validate_output
method from the earlier example will run upon exit of the with-statement.
with YourOwnModel.lazy() as with_output:\n with_output.a = \"foo\"\n with_output.b = 42\n
Note: that a lazy mode BaseModel object is required to work with a with-statement.
Examples:
from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n name: str\n age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n
In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output
method is then called to validate the instance.
Koheesio specific configuration: Koheesio models are configured differently from Pydantic defaults. The following configuration is used:
-
extra=\"allow\"
This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.
-
arbitrary_types_allowed=True
This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.
-
populate_by_name=True
This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.
-
validate_assignment=False
This setting determines whether the model should be revalidated when the data is changed. If set to True
, every time a field is assigned a new value, the entire model is validated again.
Pydantic default is (also) False
, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.
-
revalidate_instances=\"subclass-instances\"
This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never
, which means that the model and dataclass instances are not revalidated during validation.
-
validate_default=True
This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.
-
frozen=False
This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.
-
coerce_numbers_to_str=True
This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number
type to str
. Pydantic doesn't allow number types (int
, float
, Decimal
) to be coerced as type str
by default.
-
use_enum_values=True
This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.
"},{"location":"api_reference/index.html#koheesio.BaseModel--fields","title":"Fields","text":"Every Koheesio BaseModel has two fields: name
and description
. These fields are used to provide a name and a description to the model.
-
name
: This is the name of the Model. If not provided, it defaults to the class name.
-
description
: This is the description of the Model. It has several default behaviors:
- If not provided, it defaults to the docstring of the class.
- If the docstring is not provided, it defaults to the name of the class.
- For multi-line descriptions, it has the following behaviors:
- Only the first non-empty line is used.
- Empty lines are removed.
- Only the first 3 lines are considered.
- Only the first 120 characters are considered.
"},{"location":"api_reference/index.html#koheesio.BaseModel--validators","title":"Validators","text":" _set_name_and_description
: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/index.html#koheesio.BaseModel--properties","title":"Properties","text":" log
: Returns a logger with the name of the class.
"},{"location":"api_reference/index.html#koheesio.BaseModel--class-methods","title":"Class Methods","text":" from_basemodel
: Returns a new BaseModel instance based on the data of another BaseModel. from_context
: Creates BaseModel instance from a given Context. from_dict
: Creates BaseModel instance from a given dictionary. from_json
: Creates BaseModel instance from a given JSON string. from_toml
: Creates BaseModel object from a given toml file. from_yaml
: Creates BaseModel object from a given yaml file. lazy
: Constructs the model without doing validation.
"},{"location":"api_reference/index.html#koheesio.BaseModel--dunder-methods","title":"Dunder Methods","text":" __add__
: Allows to add two BaseModel instances together. __enter__
: Allows for using the model in a with-statement. __exit__
: Allows for using the model in a with-statement. __setitem__
: Set Item dunder method for BaseModel. __getitem__
: Get Item dunder method for BaseModel.
"},{"location":"api_reference/index.html#koheesio.BaseModel--instance-methods","title":"Instance Methods","text":" hasattr
: Check if given key is present in the model. get
: Get an attribute of the model, but don't fail if not present. merge
: Merge key,value map with self. set
: Allows for subscribing / assigning to class[key]
. to_context
: Converts the BaseModel instance to a Context object. to_dict
: Converts the BaseModel instance to a dictionary. to_json
: Converts the BaseModel instance to a JSON string. to_yaml
: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/index.html#koheesio.BaseModel.description","title":"description class-attribute
instance-attribute
","text":"description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.log","title":"log property
","text":"log: Logger\n
Returns a logger with the name of the class
"},{"location":"api_reference/index.html#koheesio.BaseModel.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.name","title":"name class-attribute
instance-attribute
","text":"name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_basemodel","title":"from_basemodel classmethod
","text":"from_basemodel(basemodel: BaseModel, **kwargs)\n
Returns a new BaseModel instance based on the data of another BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n kwargs = {**basemodel.model_dump(), **kwargs}\n return cls(**kwargs)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_context","title":"from_context classmethod
","text":"from_context(context: Context) -> BaseModel\n
Creates BaseModel instance from a given Context
You have to make sure that the Context object has the necessary attributes to create the model.
Examples:
class SomeStep(BaseModel):\n foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo) # prints 'bar'\n
Parameters:
Name Type Description Default context
Context
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given Context\n\n You have to make sure that the Context object has the necessary attributes to create the model.\n\n Examples\n --------\n ```python\n class SomeStep(BaseModel):\n foo: str\n\n\n context = Context(foo=\"bar\")\n some_step = SomeStep.from_context(context)\n print(some_step.foo) # prints 'bar'\n ```\n\n Parameters\n ----------\n context: Context\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_dict","title":"from_dict classmethod
","text":"from_dict(data: Dict[str, Any]) -> BaseModel\n
Creates BaseModel instance from a given dictionary
Parameters:
Name Type Description Default data
Dict[str, Any]
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given dictionary\n\n Parameters\n ----------\n data: Dict[str, Any]\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**data)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel instance from a given JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.from_json : Deserializes a JSON string to a Context object
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.from_json : Deserializes a JSON string to a Context object\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_json(json_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel object from a given toml file
Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file, or string containing toml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel object from a given toml file\n\n Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n Parameters\n ----------\n toml_file_or_str: str or Path\n Pathlike string or Path that points to the toml file, or string containing toml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_toml(toml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> BaseModel\n
Creates BaseModel object from a given yaml file
Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n \"\"\"Creates BaseModel object from a given yaml file\n\n Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_yaml(yaml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.get","title":"get","text":"get(key: str, default: Optional[Any] = None)\n
Get an attribute of the model, but don't fail if not present
Similar to dict.get()
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\") # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n
Parameters:
Name Type Description Default key
str
name of the key to get
required default
Optional[Any]
Default value in case the attribute does not exist
None
Returns:
Type Description Any
The value of the attribute
Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n \"\"\"Get an attribute of the model, but don't fail if not present\n\n Similar to dict.get()\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.get(\"foo\") # returns 'bar'\n step_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n ```\n\n Parameters\n ----------\n key: str\n name of the key to get\n default: Optional[Any]\n Default value in case the attribute does not exist\n\n Returns\n -------\n Any\n The value of the attribute\n \"\"\"\n if self.hasattr(key):\n return self.__getitem__(key)\n return default\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.hasattr","title":"hasattr","text":"hasattr(key: str) -> bool\n
Check if given key is present in the model
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n \"\"\"Check if given key is present in the model\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n return hasattr(self, key)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.lazy","title":"lazy classmethod
","text":"lazy()\n
Constructs the model without doing validation
Essentially an alias to BaseModel.construct()
Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n \"\"\"Constructs the model without doing validation\n\n Essentially an alias to BaseModel.construct()\n \"\"\"\n return cls.model_construct()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.merge","title":"merge","text":"merge(other: Union[Dict, BaseModel])\n
Merge key,value map with self
Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n
Parameters:
Name Type Description Default other
Union[Dict, BaseModel]
Dict or another instance of a BaseModel class that will be added to self
required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n \"\"\"Merge key,value map with self\n\n Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n ```\n\n Parameters\n ----------\n other: Union[Dict, BaseModel]\n Dict or another instance of a BaseModel class that will be added to self\n \"\"\"\n if isinstance(other, BaseModel):\n other = other.model_dump() # ensures we really have a dict\n\n for k, v in other.items():\n self.set(k, v)\n\n return self\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.set","title":"set","text":"set(key: str, value: Any)\n
Allows for subscribing / assigning to class[key]
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n
Parameters:
Name Type Description Default key
str
The key of the attribute to assign to
required value
Any
Value that should be assigned to the given key
required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n ```\n\n Parameters\n ----------\n key: str\n The key of the attribute to assign to\n value: Any\n Value that should be assigned to the given key\n \"\"\"\n self.__setitem__(key, value)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_context","title":"to_context","text":"to_context() -> Context\n
Converts the BaseModel instance to a Context object
Returns:
Type Description Context
Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n \"\"\"Converts the BaseModel instance to a Context object\n\n Returns\n -------\n Context\n \"\"\"\n return Context(**self.to_dict())\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Converts the BaseModel instance to a dictionary
Returns:
Type Description Dict[str, Any]
Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Converts the BaseModel instance to a dictionary\n\n Returns\n -------\n Dict[str, Any]\n \"\"\"\n return self.model_dump()\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_json","title":"to_json","text":"to_json(pretty: bool = False)\n
Converts the BaseModel instance to a JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.to_json : Serializes a Context object to a JSON string
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n \"\"\"Converts the BaseModel instance to a JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.to_json : Serializes a Context object to a JSON string\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Converts the BaseModel instance to a YAML string
BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Converts the BaseModel instance to a YAML string\n\n BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/index.html#koheesio.BaseModel.validate","title":"validate","text":"validate() -> BaseModel\n
Validate the BaseModel instance
This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.
This method is intended to be used with the lazy
method. The lazy
method is used to create an instance of the BaseModel without immediate validation. The validate
method is then used to validate the instance after.
Note: in the Pydantic BaseModel, the validate
method throws a deprecated warning. This is because Pydantic recommends using the validate_model
method instead. However, we are using the validate
method here in a different context and a slightly different way.
Examples:
class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model
instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate
method is then called to validate the instance. Returns:
Type Description BaseModel
The BaseModel instance
Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n \"\"\"Validate the BaseModel instance\n\n This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n validate the instance after all the attributes have been set.\n\n This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n different context and a slightly different way.\n\n Examples\n --------\n ```python\n class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\n foo_model = FooModel.lazy()\n foo_model.foo = \"bar\"\n foo_model.lorem = \"ipsum\"\n foo_model.validate()\n ```\n In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n are set afterward. The `validate` method is then called to validate the instance.\n\n Returns\n -------\n BaseModel\n The BaseModel instance\n \"\"\"\n return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/index.html#koheesio.Context","title":"koheesio.Context","text":"Context(*args, **kwargs)\n
The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.
Key Features - Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
- Recursive merging: Merges two Contexts together, with the incoming Context having priority.
- Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
- Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.
For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.
Methods:
Name Description add
Add a key/value pair to the context.
get
Get value of a given key.
get_item
Acts just like .get
, except that it returns the key also.
contains
Check if the context contains a given key.
merge
Merge this context with the context of another, where the incoming context has priority.
to_dict
Returns all parameters of the context as a dict.
from_dict
Creates Context object from the given dict.
from_yaml
Creates Context object from a given yaml file.
from_json
Creates Context object from a given json file.
Dunder methods - _
_iter__()
: Allows for iteration across a Context. __len__()
: Returns the length of the Context. __getitem__(item)
: Makes class subscriptable.
Inherited from Mapping items()
: Returns all items of the Context. keys()
: Returns all keys of the Context. values()
: Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n \"\"\"Initializes the Context object with given arguments.\"\"\"\n for arg in args:\n if isinstance(arg, dict):\n kwargs.update(arg)\n if isinstance(arg, Context):\n kwargs = kwargs.update(arg.to_dict())\n\n for key, value in kwargs.items():\n self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/index.html#koheesio.Context.add","title":"add","text":"add(key: str, value: Any) -> Context\n
Add a key/value pair to the context
Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n \"\"\"Add a key/value pair to the context\"\"\"\n self.__dict__[key] = value\n return self\n
"},{"location":"api_reference/index.html#koheesio.Context.contains","title":"contains","text":"contains(key: str) -> bool\n
Check if the context contains a given key
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n \"\"\"Check if the context contains a given key\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n try:\n self.get(key, safe=False)\n return True\n except KeyError:\n return False\n
"},{"location":"api_reference/index.html#koheesio.Context.from_dict","title":"from_dict classmethod
","text":"from_dict(kwargs: dict) -> Context\n
Creates Context object from the given dict
Parameters:
Name Type Description Default kwargs
dict
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n \"\"\"Creates Context object from the given dict\n\n Parameters\n ----------\n kwargs: dict\n\n Returns\n -------\n Context\n \"\"\"\n return cls(kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given json file
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Security (from https://jsonpickle.github.io/)
jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given json file\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Security\n --------\n (from https://jsonpickle.github.io/)\n\n > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n ### ! Warning !\n > The jsonpickle module is not secure. Only unpickle data you trust.\n It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n untrusted data.\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n Context\n \"\"\"\n json_str = json_file_or_str\n\n # check if json_str is pathlike\n if (json_file := Path(json_file_or_str)).exists():\n json_str = json_file.read_text(encoding=\"utf-8\")\n\n json_dict = jsonpickle.loads(json_str)\n return cls.from_dict(json_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_json--warning","title":"! Warning !","text":"The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.
"},{"location":"api_reference/index.html#koheesio.Context.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given toml file
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file or string containing toml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given toml file\n\n Parameters\n ----------\n toml_file_or_str: Union[str, Path]\n Pathlike string or Path that points to the toml file or string containing toml\n\n Returns\n -------\n Context\n \"\"\"\n toml_str = toml_file_or_str\n\n # check if toml_str is pathlike\n if (toml_file := Path(toml_file_or_str)).exists():\n toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n toml_dict = tomli.loads(toml_str)\n return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> Context\n
Creates Context object from a given yaml file
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n \"\"\"Creates Context object from a given yaml file\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n Context\n \"\"\"\n yaml_str = yaml_file_or_str\n\n # check if yaml_str is pathlike\n if (yaml_file := Path(yaml_file_or_str)).exists():\n yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n # Bandit: disable yaml.load warning\n yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader) # nosec B506: yaml_load\n\n return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/index.html#koheesio.Context.get","title":"get","text":"get(key: str, default: Any = None, safe: bool = True) -> Any\n
Get value of a given key
The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get()
method otherwise.
Parameters:
Name Type Description Default key
str
Can be a real key, or can be a dotted notation of a nested key
required default
Any
Default value to return
None
safe
bool
Toggles whether to fail or not when item cannot be found
True
Returns:
Type Description Any
Value of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n
Returns c
Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n \"\"\"Get value of a given key\n\n The key can either be an actual key (top level) or the key of a nested value.\n Behaves a lot like a dict's `.get()` method otherwise.\n\n Parameters\n ----------\n key:\n Can be a real key, or can be a dotted notation of a nested key\n default:\n Default value to return\n safe:\n Toggles whether to fail or not when item cannot be found\n\n Returns\n -------\n Any\n Value of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get(\"a.b\")\n ```\n\n Returns `c`\n \"\"\"\n try:\n if \".\" not in key:\n return self.__dict__[key]\n\n # handle nested keys\n nested_keys = key.split(\".\")\n value = self # parent object\n for k in nested_keys:\n value = value[k] # iterate through nested values\n return value\n\n except (AttributeError, KeyError, TypeError) as e:\n if not safe:\n raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n return default\n
"},{"location":"api_reference/index.html#koheesio.Context.get_all","title":"get_all","text":"get_all() -> dict\n
alias to to_dict()
Source code in src/koheesio/context.py
def get_all(self) -> dict:\n \"\"\"alias to to_dict()\"\"\"\n return self.to_dict()\n
"},{"location":"api_reference/index.html#koheesio.Context.get_item","title":"get_item","text":"get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n
Acts just like .get
, except that it returns the key also
Returns:
Type Description Dict[str, Any]
key/value-pair of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n
Returns {'a.b': 'c'}
Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n \"\"\"Acts just like `.get`, except that it returns the key also\n\n Returns\n -------\n Dict[str, Any]\n key/value-pair of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get_item(\"a.b\")\n ```\n\n Returns `{'a.b': 'c'}`\n \"\"\"\n value = self.get(key, default, safe)\n return {key: value}\n
"},{"location":"api_reference/index.html#koheesio.Context.merge","title":"merge","text":"merge(context: Context, recursive: bool = False) -> Context\n
Merge this context with the context of another, where the incoming context has priority.
Parameters:
Name Type Description Default context
Context
Another Context class
required recursive
bool
Recursively merge two dictionaries to an arbitrary depth
False
Returns:
Type Description Context
updated context
Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n Parameters\n ----------\n context: Context\n Another Context class\n recursive: bool\n Recursively merge two dictionaries to an arbitrary depth\n\n Returns\n -------\n Context\n updated context\n \"\"\"\n if recursive:\n return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n # just merge on the top level keys\n return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/index.html#koheesio.Context.process_value","title":"process_value","text":"process_value(value: Any) -> Any\n
Processes the given value, converting dictionaries to Context objects as needed.
Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n if isinstance(value, dict):\n return self.from_dict(value)\n\n if isinstance(value, (list, set)):\n return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n return value\n
"},{"location":"api_reference/index.html#koheesio.Context.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Returns all parameters of the context as a dict
Returns:
Type Description dict
containing all parameters of the context
Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Returns all parameters of the context as a dict\n\n Returns\n -------\n dict\n containing all parameters of the context\n \"\"\"\n result = {}\n\n for key, value in self.__dict__.items():\n if isinstance(value, Context):\n result[key] = value.to_dict()\n elif isinstance(value, list):\n result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n else:\n result[key] = value\n\n return result\n
"},{"location":"api_reference/index.html#koheesio.Context.to_json","title":"to_json","text":"to_json(pretty: bool = False) -> str\n
Returns all parameters of the context as a json string
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a json string\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n d = self.to_dict()\n return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/index.html#koheesio.Context.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Returns all parameters of the context as a yaml string
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a yaml string\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n # sort_keys=False to preserve order of keys\n yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n # remove `!!python/object:...` from yaml\n if clean:\n remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n return yaml_str\n
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin","title":"koheesio.ExtraParamsMixin","text":"Mixin class that adds support for arbitrary keyword arguments to Pydantic models.
The keyword arguments are extracted from the model's values
and moved to a params
dictionary.
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.extra_params","title":"extra_params cached
property
","text":"extra_params: Dict[str, Any]\n
Extract params (passed as arbitrary kwargs) from values and move them to params dict
"},{"location":"api_reference/index.html#koheesio.ExtraParamsMixin.params","title":"params class-attribute
instance-attribute
","text":"params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory","title":"koheesio.LoggingFactory","text":"LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n
Logging factory to be used to generate logger instances.
Parameters:
Name Type Description Default name
Optional[str]
None
env
Optional[str]
None
logger_id
Optional[str]
None
Source code in src/koheesio/logger.py
def __init__(\n self,\n name: Optional[str] = None,\n env: Optional[str] = None,\n level: Optional[str] = None,\n logger_id: Optional[str] = None,\n):\n \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n Parameters\n ----------\n name logger name.\n env environment (\"local\", \"qa\", \"prod).\n logger_id unique identifier for the logger.\n \"\"\"\n\n LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n LoggingFactory.ENV = env or LoggingFactory.ENV\n\n console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n # WARNING is default level for root logger in python\n logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n LoggingFactory.CONSOLE_HANDLER = console_handler\n\n logger = getLogger(LoggingFactory.LOGGER_NAME)\n logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute
instance-attribute
","text":"CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.ENV","title":"ENV class-attribute
instance-attribute
","text":"ENV: Optional[str] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER","title":"LOGGER class-attribute
instance-attribute
","text":"LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute
instance-attribute
","text":"LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute
instance-attribute
","text":"LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute
instance-attribute
","text":"LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute
instance-attribute
","text":"LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute
instance-attribute
","text":"LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute
instance-attribute
","text":"LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.add_handlers","title":"add_handlers staticmethod
","text":"add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n
Add handlers to existing root logger.
Parameters:
Name Type Description Default handler_class
required handlers_config
required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n \"\"\"Add handlers to existing root logger.\n\n Parameters\n ----------\n handler_class handler module and class for importing.\n handlers_config configuration for handler.\n\n \"\"\"\n for handler_module_class, handler_conf in handlers:\n handler_class: logging.Handler = import_class(handler_module_class)\n handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n # noinspection PyCallingNonCallable\n handler = handler_class(**handler_conf)\n handler.setLevel(handler_level)\n handler.addFilter(LoggingFactory.LOGGER_FILTER)\n handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/index.html#koheesio.LoggingFactory.get_logger","title":"get_logger staticmethod
","text":"get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n
Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.
Parameters:
Name Type Description Default name
str
required inherit_from_koheesio
bool
False
Returns:
Name Type Description logger
Logger
Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n Parameters\n ----------\n name: Name of logger.\n inherit_from_koheesio: Inherit logger from koheesio\n\n Returns\n -------\n logger: Logger\n\n \"\"\"\n if inherit_from_koheesio:\n LoggingFactory.__check_koheesio_logger_initialized()\n name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n return getLogger(name)\n
"},{"location":"api_reference/index.html#koheesio.Step","title":"koheesio.Step","text":"Base class for a step
A custom unit of logic that can be executed.
The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self)
method, specifying the expected inputs and outputs.
Note: since the Step class is meta classed, the execute method is wrapped with the do_execute
function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.
Methods and Attributes The Step class has several attributes and methods.
Background A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.
A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!
The diagram serves to illustrate the concept of a Step:
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.
- Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
- Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the
execute
method of the Step class with the _execute_wrapper
function. This ensures that the execute
method always returns the output of the Step along with providing logging and validation of the output. - Step has an
Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. - The
Output
class can be extended to add additional fields to the output of the Step.
Examples:
class MyStep(Step):\n a: str # input\n\n class Output(StepOutput): # output\n b: str\n\n def execute(self) -> MyStep.Output:\n self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/index.html#koheesio.Step--input","title":"INPUT","text":"The following fields are available by default on the Step class: - name
: Name of the Step. If not set, the name of the class will be used. - description
: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.
When subclassing a Step, any additional pydantic field will be treated as input
to the Step. See also the explanation on the .execute()
method below.
"},{"location":"api_reference/index.html#koheesio.Step--output","title":"OUTPUT","text":"Every Step has an Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. The Output
class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute()
.
Output
: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class. output
: Allows you to interact with the Output of the Step lazily (see above and StepOutput)
When subclassing a Step, any additional pydantic field added to the nested Output
class will be treated as output
of the Step. See also the description of StepOutput
for more information.
"},{"location":"api_reference/index.html#koheesio.Step--methods","title":"Methods:","text":" execute
: Abstract method to implement for new steps. - The Inputs of the step can be accessed, using
self.input_name
. - The output of the step can be accessed, using
self.output.output_name
.
run
: Alias to .execute() method. You can use this to run the step, but execute is preferred. to_yaml
: YAML dump the step get_description
: Get the description of the Step
When subclassing a Step, execute
is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.
Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute
function making it always return a StepOutput. See also the explanation on the do_execute
function.
"},{"location":"api_reference/index.html#koheesio.Step--class-methods","title":"class methods:","text":" from_step
: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
get_description
: Get the description of the Step
"},{"location":"api_reference/index.html#koheesio.Step--dunder-methods","title":"dunder methods:","text":" __getattr__
: Allows input to be accessed through self.input_name
__repr__
and __str__
: String representation of a step
"},{"location":"api_reference/index.html#koheesio.Step.output","title":"output property
writable
","text":"output: Output\n
Interact with the output of the Step
"},{"location":"api_reference/index.html#koheesio.Step.Output","title":"Output","text":"Output class for Step
"},{"location":"api_reference/index.html#koheesio.Step.execute","title":"execute abstractmethod
","text":"execute()\n
Abstract method to implement for new steps.
The Inputs of the step can be accessed, using self.input_name
Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute
function making it always return the Steps output
Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Abstract method to implement for new steps.\n\n The Inputs of the step can be accessed, using `self.input_name`\n\n Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n it always return the Steps output\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/index.html#koheesio.Step.from_step","title":"from_step classmethod
","text":"from_step(step: Step, **kwargs)\n
Returns a new Step instance based on the data of another Step or BaseModel instance
Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_json","title":"repr_json","text":"repr_json(simple=False) -> str\n
dump the step to json, meant for representation
Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid json
Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n \"\"\"dump the step to json, meant for representation\n\n Note: use to_json if you want to dump the step to json for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_json())\n {\"input\": {\"a\": \"foo\"}}\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid json\n \"\"\"\n model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n _result = {}\n\n # extract input\n _input = self.model_dump(**model_dump_options)\n\n # remove name and description from input and add to result if simple is not set\n name = _input.pop(\"name\", None)\n description = _input.pop(\"description\", None)\n if not simple:\n if name:\n _result[\"name\"] = name\n if description:\n _result[\"description\"] = description\n else:\n model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n # extract output\n _output = self.output.model_dump(**model_dump_options)\n\n # add output to result\n if _output:\n _result[\"output\"] = _output\n\n # add input to result\n _result[\"input\"] = _input\n\n class MyEncoder(json.JSONEncoder):\n \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n def default(self, o: Any) -> Any:\n try:\n return super().default(o)\n except TypeError:\n return o.__class__.__name__\n\n # Use MyEncoder when converting the dictionary to a JSON string\n json_str = json.dumps(_result, cls=MyEncoder)\n\n return json_str\n
"},{"location":"api_reference/index.html#koheesio.Step.repr_yaml","title":"repr_yaml","text":"repr_yaml(simple=False) -> str\n
dump the step to yaml, meant for representation
Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n a: foo\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid yaml
Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n \"\"\"dump the step to yaml, meant for representation\n\n Note: use to_yaml if you want to dump the step to yaml for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_yaml())\n input:\n a: foo\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid yaml\n \"\"\"\n json_str = self.repr_json(simple=simple)\n\n # Parse the JSON string back into a dictionary\n _result = json.loads(json_str)\n\n return yaml.dump(_result)\n
"},{"location":"api_reference/index.html#koheesio.Step.run","title":"run","text":"run()\n
Alias to .execute()
Source code in src/koheesio/steps/__init__.py
def run(self):\n \"\"\"Alias to .execute()\"\"\"\n return self.execute()\n
"},{"location":"api_reference/index.html#koheesio.StepOutput","title":"koheesio.StepOutput","text":"Class for the StepOutput model
Usage Setting up the StepOutputs class is done like this:
class YourOwnOutput(StepOutput):\n a: str\n b: int\n
"},{"location":"api_reference/index.html#koheesio.StepOutput.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/index.html#koheesio.StepOutput.validate_output","title":"validate_output","text":"validate_output() -> StepOutput\n
Validate the output of the Step
Essentially, this method is a wrapper around the validate method of the BaseModel class
Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n \"\"\"Validate the output of the Step\n\n Essentially, this method is a wrapper around the validate method of the BaseModel class\n \"\"\"\n validated_model = self.validate()\n return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/index.html#koheesio.print_logo","title":"koheesio.print_logo","text":"print_logo()\n
Source code in src/koheesio/__init__.py
def print_logo():\n global _logo_printed\n global _koheesio_print_logo\n\n if not _logo_printed and _koheesio_print_logo:\n print(ABOUT)\n _logo_printed = True\n
"},{"location":"api_reference/context.html","title":"Context","text":"The Context module is a part of the Koheesio framework and is primarily used for managing the environment configuration where a Task or Step runs. It helps in adapting the behavior of a Task/Step based on the environment it operates in, thereby avoiding the repetition of configuration values across different tasks.
The Context class, which is a key component of this module, functions similarly to a dictionary but with additional features. It supports operations like handling nested keys, recursive merging of contexts, and serialization/deserialization to and from various formats like JSON, YAML, and TOML.
For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.
"},{"location":"api_reference/context.html#koheesio.context.Context","title":"koheesio.context.Context","text":"Context(*args, **kwargs)\n
The Context class is a key component of the Koheesio framework, designed to manage configuration data and shared variables across tasks and steps in your application. It behaves much like a dictionary, but with added functionalities.
Key Features - Nested keys: Supports accessing and adding nested keys similar to dictionary keys.
- Recursive merging: Merges two Contexts together, with the incoming Context having priority.
- Serialization/Deserialization: Easily created from a yaml, toml, or json file, or a dictionary, and can be converted back to a dictionary.
- Handling complex Python objects: Uses jsonpickle for serialization and deserialization of complex Python objects to and from JSON.
For a comprehensive guide on the usage, examples, and additional features of the Context class, please refer to the reference/concepts/context section of the Koheesio documentation.
Methods:
Name Description add
Add a key/value pair to the context.
get
Get value of a given key.
get_item
Acts just like .get
, except that it returns the key also.
contains
Check if the context contains a given key.
merge
Merge this context with the context of another, where the incoming context has priority.
to_dict
Returns all parameters of the context as a dict.
from_dict
Creates Context object from the given dict.
from_yaml
Creates Context object from a given yaml file.
from_json
Creates Context object from a given json file.
Dunder methods - _
_iter__()
: Allows for iteration across a Context. __len__()
: Returns the length of the Context. __getitem__(item)
: Makes class subscriptable.
Inherited from Mapping items()
: Returns all items of the Context. keys()
: Returns all keys of the Context. values()
: Returns all values of the Context.
Source code in src/koheesio/context.py
def __init__(self, *args, **kwargs):\n \"\"\"Initializes the Context object with given arguments.\"\"\"\n for arg in args:\n if isinstance(arg, dict):\n kwargs.update(arg)\n if isinstance(arg, Context):\n kwargs = kwargs.update(arg.to_dict())\n\n for key, value in kwargs.items():\n self.__dict__[key] = self.process_value(value)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.add","title":"add","text":"add(key: str, value: Any) -> Context\n
Add a key/value pair to the context
Source code in src/koheesio/context.py
def add(self, key: str, value: Any) -> Context:\n \"\"\"Add a key/value pair to the context\"\"\"\n self.__dict__[key] = value\n return self\n
"},{"location":"api_reference/context.html#koheesio.context.Context.contains","title":"contains","text":"contains(key: str) -> bool\n
Check if the context contains a given key
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/context.py
def contains(self, key: str) -> bool:\n \"\"\"Check if the context contains a given key\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n try:\n self.get(key, safe=False)\n return True\n except KeyError:\n return False\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_dict","title":"from_dict classmethod
","text":"from_dict(kwargs: dict) -> Context\n
Creates Context object from the given dict
Parameters:
Name Type Description Default kwargs
dict
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_dict(cls, kwargs: dict) -> Context:\n \"\"\"Creates Context object from the given dict\n\n Parameters\n ----------\n kwargs: dict\n\n Returns\n -------\n Context\n \"\"\"\n return cls(kwargs)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given json file
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Security (from https://jsonpickle.github.io/)
jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given json file\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python\u2019s pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Security\n --------\n (from https://jsonpickle.github.io/)\n\n > jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.\n\n ### ! Warning !\n > The jsonpickle module is not secure. Only unpickle data you trust.\n It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.\n Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\n Consider signing data with an HMAC if you need to ensure that it has not been tampered with.\n Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing\n untrusted data.\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n Context\n \"\"\"\n json_str = json_file_or_str\n\n # check if json_str is pathlike\n if (json_file := Path(json_file_or_str)).exists():\n json_str = json_file.read_text(encoding=\"utf-8\")\n\n json_dict = jsonpickle.loads(json_str)\n return cls.from_dict(json_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_json--warning","title":"! Warning !","text":"The jsonpickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with an HMAC if you need to ensure that it has not been tampered with. Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.
"},{"location":"api_reference/context.html#koheesio.context.Context.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> Context\n
Creates Context object from a given toml file
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file or string containing toml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> Context:\n \"\"\"Creates Context object from a given toml file\n\n Parameters\n ----------\n toml_file_or_str: Union[str, Path]\n Pathlike string or Path that points to the toml file or string containing toml\n\n Returns\n -------\n Context\n \"\"\"\n toml_str = toml_file_or_str\n\n # check if toml_str is pathlike\n if (toml_file := Path(toml_file_or_str)).exists():\n toml_str = toml_file.read_text(encoding=\"utf-8\")\n\n toml_dict = tomli.loads(toml_str)\n return cls.from_dict(toml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> Context\n
Creates Context object from a given yaml file
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description Context
Source code in src/koheesio/context.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> Context:\n \"\"\"Creates Context object from a given yaml file\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n Context\n \"\"\"\n yaml_str = yaml_file_or_str\n\n # check if yaml_str is pathlike\n if (yaml_file := Path(yaml_file_or_str)).exists():\n yaml_str = yaml_file.read_text(encoding=\"utf-8\")\n\n # Bandit: disable yaml.load warning\n yaml_dict = yaml.load(yaml_str, Loader=yaml.Loader) # nosec B506: yaml_load\n\n return cls.from_dict(yaml_dict)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get","title":"get","text":"get(key: str, default: Any = None, safe: bool = True) -> Any\n
Get value of a given key
The key can either be an actual key (top level) or the key of a nested value. Behaves a lot like a dict's .get()
method otherwise.
Parameters:
Name Type Description Default key
str
Can be a real key, or can be a dotted notation of a nested key
required default
Any
Default value to return
None
safe
bool
Toggles whether to fail or not when item cannot be found
True
Returns:
Type Description Any
Value of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get(\"a.b\")\n
Returns c
Source code in src/koheesio/context.py
def get(self, key: str, default: Any = None, safe: bool = True) -> Any:\n \"\"\"Get value of a given key\n\n The key can either be an actual key (top level) or the key of a nested value.\n Behaves a lot like a dict's `.get()` method otherwise.\n\n Parameters\n ----------\n key:\n Can be a real key, or can be a dotted notation of a nested key\n default:\n Default value to return\n safe:\n Toggles whether to fail or not when item cannot be found\n\n Returns\n -------\n Any\n Value of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get(\"a.b\")\n ```\n\n Returns `c`\n \"\"\"\n try:\n if \".\" not in key:\n return self.__dict__[key]\n\n # handle nested keys\n nested_keys = key.split(\".\")\n value = self # parent object\n for k in nested_keys:\n value = value[k] # iterate through nested values\n return value\n\n except (AttributeError, KeyError, TypeError) as e:\n if not safe:\n raise KeyError(f\"requested key '{key}' does not exist in {self}\") from e\n return default\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_all","title":"get_all","text":"get_all() -> dict\n
alias to to_dict()
Source code in src/koheesio/context.py
def get_all(self) -> dict:\n \"\"\"alias to to_dict()\"\"\"\n return self.to_dict()\n
"},{"location":"api_reference/context.html#koheesio.context.Context.get_item","title":"get_item","text":"get_item(key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]\n
Acts just like .get
, except that it returns the key also
Returns:
Type Description Dict[str, Any]
key/value-pair of the requested item
Example Example of a nested call:
context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\ncontext.get_item(\"a.b\")\n
Returns {'a.b': 'c'}
Source code in src/koheesio/context.py
def get_item(self, key: str, default: Any = None, safe: bool = True) -> Dict[str, Any]:\n \"\"\"Acts just like `.get`, except that it returns the key also\n\n Returns\n -------\n Dict[str, Any]\n key/value-pair of the requested item\n\n Example\n -------\n Example of a nested call:\n\n ```python\n context = Context({\"a\": {\"b\": \"c\", \"d\": \"e\"}, \"f\": \"g\"})\n context.get_item(\"a.b\")\n ```\n\n Returns `{'a.b': 'c'}`\n \"\"\"\n value = self.get(key, default, safe)\n return {key: value}\n
"},{"location":"api_reference/context.html#koheesio.context.Context.merge","title":"merge","text":"merge(context: Context, recursive: bool = False) -> Context\n
Merge this context with the context of another, where the incoming context has priority.
Parameters:
Name Type Description Default context
Context
Another Context class
required recursive
bool
Recursively merge two dictionaries to an arbitrary depth
False
Returns:
Type Description Context
updated context
Source code in src/koheesio/context.py
def merge(self, context: Context, recursive: bool = False) -> Context:\n \"\"\"Merge this context with the context of another, where the incoming context has priority.\n\n Parameters\n ----------\n context: Context\n Another Context class\n recursive: bool\n Recursively merge two dictionaries to an arbitrary depth\n\n Returns\n -------\n Context\n updated context\n \"\"\"\n if recursive:\n return Context.from_dict(self._recursive_merge(target_context=self, merge_context=context).to_dict())\n\n # just merge on the top level keys\n return Context.from_dict({**self.to_dict(), **context.to_dict()})\n
"},{"location":"api_reference/context.html#koheesio.context.Context.process_value","title":"process_value","text":"process_value(value: Any) -> Any\n
Processes the given value, converting dictionaries to Context objects as needed.
Source code in src/koheesio/context.py
def process_value(self, value: Any) -> Any:\n \"\"\"Processes the given value, converting dictionaries to Context objects as needed.\"\"\"\n if isinstance(value, dict):\n return self.from_dict(value)\n\n if isinstance(value, (list, set)):\n return [self.from_dict(v) if isinstance(v, dict) else v for v in value]\n\n return value\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Returns all parameters of the context as a dict
Returns:
Type Description dict
containing all parameters of the context
Source code in src/koheesio/context.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Returns all parameters of the context as a dict\n\n Returns\n -------\n dict\n containing all parameters of the context\n \"\"\"\n result = {}\n\n for key, value in self.__dict__.items():\n if isinstance(value, Context):\n result[key] = value.to_dict()\n elif isinstance(value, list):\n result[key] = [e.to_dict() if isinstance(e, Context) else e for e in value]\n else:\n result[key] = value\n\n return result\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_json","title":"to_json","text":"to_json(pretty: bool = False) -> str\n
Returns all parameters of the context as a json string
Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be stored in the Context object, which is not possible with the standard json library.
Why jsonpickle? (from https://jsonpickle.github.io/)
Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports json.
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_json(self, pretty: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a json string\n\n Note: jsonpickle is used to serialize/deserialize the Context object. This is done to allow for objects to be\n stored in the Context object, which is not possible with the standard json library.\n\n Why jsonpickle?\n ---------------\n (from https://jsonpickle.github.io/)\n\n > Data serialized with python's pickle (or cPickle or dill) is not easily readable outside of python. Using the\n json format, jsonpickle allows simple data types to be stored in a human-readable format, and more complex\n data types such as numpy arrays and pandas dataframes, to be machine-readable on any platform that supports\n json.\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n d = self.to_dict()\n return jsonpickle.dumps(d, indent=4) if pretty else jsonpickle.dumps(d)\n
"},{"location":"api_reference/context.html#koheesio.context.Context.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Returns all parameters of the context as a yaml string
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the context
Source code in src/koheesio/context.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Returns all parameters of the context as a yaml string\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the context\n \"\"\"\n # sort_keys=False to preserve order of keys\n yaml_str = yaml.dump(self.to_dict(), sort_keys=False)\n\n # remove `!!python/object:...` from yaml\n if clean:\n remove_pattern = re.compile(r\"!!python/object:.*?\\n\")\n yaml_str = re.sub(remove_pattern, \"\\n\", yaml_str)\n\n return yaml_str\n
"},{"location":"api_reference/intro_api.html","title":"Intro api","text":""},{"location":"api_reference/intro_api.html#api-reference","title":"API Reference","text":"You can navigate the API by clicking on the modules listed on the left to access the documentation.
"},{"location":"api_reference/logger.html","title":"Logger","text":"Loggers are used to log messages from your application.
For a comprehensive guide on the usage, examples, and additional features of the logging classes, please refer to the reference/concepts/logging section of the Koheesio documentation.
Classes:
Name Description LoggingFactory
Logging factory to be used to generate logger instances.
Masked
Represents a masked value.
MaskedString
Represents a masked string value.
MaskedInt
Represents a masked integer value.
MaskedFloat
Represents a masked float value.
MaskedDict
Represents a masked dictionary value.
LoggerIDFilter
Filter which injects run_id information into the log.
Functions:
Name Description warn
Issue a warning.
"},{"location":"api_reference/logger.html#koheesio.logger.T","title":"koheesio.logger.T module-attribute
","text":"T = TypeVar('T')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter","title":"koheesio.logger.LoggerIDFilter","text":"Filter which injects run_id information into the log.
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.LOGGER_ID","title":"LOGGER_ID class-attribute
instance-attribute
","text":"LOGGER_ID: str = str(uuid4())\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggerIDFilter.filter","title":"filter","text":"filter(record)\n
Source code in src/koheesio/logger.py
def filter(self, record):\n record.logger_id = LoggerIDFilter.LOGGER_ID\n\n return True\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory","title":"koheesio.logger.LoggingFactory","text":"LoggingFactory(name: Optional[str] = None, env: Optional[str] = None, level: Optional[str] = None, logger_id: Optional[str] = None)\n
Logging factory to be used to generate logger instances.
Parameters:
Name Type Description Default name
Optional[str]
None
env
Optional[str]
None
logger_id
Optional[str]
None
Source code in src/koheesio/logger.py
def __init__(\n self,\n name: Optional[str] = None,\n env: Optional[str] = None,\n level: Optional[str] = None,\n logger_id: Optional[str] = None,\n):\n \"\"\"Logging factory to be used in pipeline.Prepare logger instance.\n\n Parameters\n ----------\n name logger name.\n env environment (\"local\", \"qa\", \"prod).\n logger_id unique identifier for the logger.\n \"\"\"\n\n LoggingFactory.LOGGER_NAME = name or LoggingFactory.LOGGER_NAME\n LoggerIDFilter.LOGGER_ID = logger_id or LoggerIDFilter.LOGGER_ID\n LoggingFactory.LOGGER_FILTER = LoggingFactory.LOGGER_FILTER or LoggerIDFilter()\n LoggingFactory.ENV = env or LoggingFactory.ENV\n\n console_handler = logging.StreamHandler(sys.stdout if LoggingFactory.ENV == \"local\" else sys.stderr)\n console_handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n console_handler.addFilter(LoggingFactory.LOGGER_FILTER)\n # WARNING is default level for root logger in python\n logging.basicConfig(level=logging.WARNING, handlers=[console_handler], force=True)\n\n LoggingFactory.CONSOLE_HANDLER = console_handler\n\n logger = getLogger(LoggingFactory.LOGGER_NAME)\n logger.setLevel(level or LoggingFactory.LOGGER_LEVEL)\n LoggingFactory.LOGGER = logger\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.CONSOLE_HANDLER","title":"CONSOLE_HANDLER class-attribute
instance-attribute
","text":"CONSOLE_HANDLER: Optional[Handler] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.ENV","title":"ENV class-attribute
instance-attribute
","text":"ENV: Optional[str] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER","title":"LOGGER class-attribute
instance-attribute
","text":"LOGGER: Optional[Logger] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_ENV","title":"LOGGER_ENV class-attribute
instance-attribute
","text":"LOGGER_ENV: str = 'local'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FILTER","title":"LOGGER_FILTER class-attribute
instance-attribute
","text":"LOGGER_FILTER: Optional[Filter] = None\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMAT","title":"LOGGER_FORMAT class-attribute
instance-attribute
","text":"LOGGER_FORMAT: str = '[%(logger_id)s] [%(asctime)s] [%(levelname)s] [%(name)s] {%(module)s.py:%(funcName)s:%(lineno)d} - %(message)s'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_FORMATTER","title":"LOGGER_FORMATTER class-attribute
instance-attribute
","text":"LOGGER_FORMATTER: Formatter = Formatter(LOGGER_FORMAT)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_LEVEL","title":"LOGGER_LEVEL class-attribute
instance-attribute
","text":"LOGGER_LEVEL: str = get('KOHEESIO_LOGGING_LEVEL', 'WARNING')\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.LOGGER_NAME","title":"LOGGER_NAME class-attribute
instance-attribute
","text":"LOGGER_NAME: str = 'koheesio'\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.add_handlers","title":"add_handlers staticmethod
","text":"add_handlers(handlers: List[Tuple[str, Dict]]) -> None\n
Add handlers to existing root logger.
Parameters:
Name Type Description Default handler_class
required handlers_config
required Source code in src/koheesio/logger.py
@staticmethod\ndef add_handlers(handlers: List[Tuple[str, Dict]]) -> None:\n \"\"\"Add handlers to existing root logger.\n\n Parameters\n ----------\n handler_class handler module and class for importing.\n handlers_config configuration for handler.\n\n \"\"\"\n for handler_module_class, handler_conf in handlers:\n handler_class: logging.Handler = import_class(handler_module_class)\n handler_level = handler_conf.pop(\"level\") if \"level\" in handler_conf else \"WARNING\"\n # noinspection PyCallingNonCallable\n handler = handler_class(**handler_conf)\n handler.setLevel(handler_level)\n handler.addFilter(LoggingFactory.LOGGER_FILTER)\n handler.setFormatter(LoggingFactory.LOGGER_FORMATTER)\n LoggingFactory.LOGGER.addHandler(handler)\n
"},{"location":"api_reference/logger.html#koheesio.logger.LoggingFactory.get_logger","title":"get_logger staticmethod
","text":"get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger\n
Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.
Parameters:
Name Type Description Default name
str
required inherit_from_koheesio
bool
False
Returns:
Name Type Description logger
Logger
Source code in src/koheesio/logger.py
@staticmethod\ndef get_logger(name: str, inherit_from_koheesio: bool = False) -> Logger:\n \"\"\"Provide logger. If inherit_from_koheesio then inherit from LoggingFactory.PIPELINE_LOGGER_NAME.\n\n Parameters\n ----------\n name: Name of logger.\n inherit_from_koheesio: Inherit logger from koheesio\n\n Returns\n -------\n logger: Logger\n\n \"\"\"\n if inherit_from_koheesio:\n LoggingFactory.__check_koheesio_logger_initialized()\n name = f\"{LoggingFactory.LOGGER_NAME}.{name}\"\n\n return getLogger(name)\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked","title":"koheesio.logger.Masked","text":"Masked(value: T)\n
Represents a masked value.
Parameters:
Name Type Description Default value
T
The value to be masked.
required Attributes:
Name Type Description _value
T
The original value.
Methods:
Name Description __repr__
Returns a string representation of the masked value.
__str__
Returns a string representation of the masked value.
__get_validators__
Returns a generator of validators for the masked value.
validate
Validates the masked value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.Masked.validate","title":"validate classmethod
","text":"validate(v: Any, _values)\n
Validate the input value and return an instance of the class.
Parameters:
Name Type Description Default v
Any
The input value to validate.
required _values
Any
Additional values used for validation.
required Returns:
Name Type Description instance
cls
An instance of the class.
Source code in src/koheesio/logger.py
@classmethod\ndef validate(cls, v: Any, _values):\n \"\"\"\n Validate the input value and return an instance of the class.\n\n Parameters\n ----------\n v : Any\n The input value to validate.\n _values : Any\n Additional values used for validation.\n\n Returns\n -------\n instance : cls\n An instance of the class.\n\n \"\"\"\n return cls(v)\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedDict","title":"koheesio.logger.MaskedDict","text":"MaskedDict(value: T)\n
Represents a masked dictionary value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedFloat","title":"koheesio.logger.MaskedFloat","text":"MaskedFloat(value: T)\n
Represents a masked float value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedInt","title":"koheesio.logger.MaskedInt","text":"MaskedInt(value: T)\n
Represents a masked integer value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/logger.html#koheesio.logger.MaskedString","title":"koheesio.logger.MaskedString","text":"MaskedString(value: T)\n
Represents a masked string value.
Source code in src/koheesio/logger.py
def __init__(self, value: T):\n self._value = value\n
"},{"location":"api_reference/utils.html","title":"Utils","text":"Utility functions
"},{"location":"api_reference/utils.html#koheesio.utils.convert_str_to_bool","title":"koheesio.utils.convert_str_to_bool","text":"convert_str_to_bool(value) -> Any\n
Converts a string to a boolean if the string is either 'true' or 'false'
Source code in src/koheesio/utils.py
def convert_str_to_bool(value) -> Any:\n \"\"\"Converts a string to a boolean if the string is either 'true' or 'false'\"\"\"\n if isinstance(value, str) and (v := value.lower()) in [\"true\", \"false\"]:\n value = v == \"true\"\n return value\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_args_for_func","title":"koheesio.utils.get_args_for_func","text":"get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]\n
Helper function that matches keyword arguments (params) on a given function
This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to construct a new Callable (partial) function on which the input was mapped.
Example input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\ndef example_func(a: str):\n return a\n\n\nfunc, kwargs = get_args_for_func(example_func, input_dict)\n
In this example, - func
would be a callable with the input mapped toward it (i.e. can be called like any normal function) - kwargs
would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})
Parameters:
Name Type Description Default func
Callable
The function to inspect
required params
Dict
Dictionary with keyword values that will be mapped on the 'func'
required Returns:
Type Description Tuple[Callable, Dict[str, Any]]
- Callable a partial() func with the found keyword values mapped toward it
- Dict[str, Any] the keyword args that match the func
Source code in src/koheesio/utils.py
def get_args_for_func(func: Callable, params: Dict) -> Tuple[Callable, Dict[str, Any]]:\n \"\"\"Helper function that matches keyword arguments (params) on a given function\n\n This function uses inspect to extract the signature on the passed Callable, and then uses functools.partial to\n construct a new Callable (partial) function on which the input was mapped.\n\n Example\n -------\n ```python\n input_dict = {\"a\": \"foo\", \"b\": \"bar\"}\n\n\n def example_func(a: str):\n return a\n\n\n func, kwargs = get_args_for_func(example_func, input_dict)\n ```\n\n In this example,\n - `func` would be a callable with the input mapped toward it (i.e. can be called like any normal function)\n - `kwargs` would be a dict holding just the output needed to be able to run the function (e.g. {\"a\": \"foo\"})\n\n Parameters\n ----------\n func: Callable\n The function to inspect\n params: Dict\n Dictionary with keyword values that will be mapped on the 'func'\n\n Returns\n -------\n Tuple[Callable, Dict[str, Any]]\n - Callable\n a partial() func with the found keyword values mapped toward it\n - Dict[str, Any]\n the keyword args that match the func\n \"\"\"\n _kwargs = {k: v for k, v in params.items() if k in inspect.getfullargspec(func).args}\n return (\n partial(func, **_kwargs),\n _kwargs,\n )\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_project_root","title":"koheesio.utils.get_project_root","text":"get_project_root() -> Path\n
Returns project root path.
Source code in src/koheesio/utils.py
def get_project_root() -> Path:\n \"\"\"Returns project root path.\"\"\"\n cmd = Path(__file__)\n return Path([i for i in cmd.parents if i.as_uri().endswith(\"src\")][0]).parent\n
"},{"location":"api_reference/utils.html#koheesio.utils.get_random_string","title":"koheesio.utils.get_random_string","text":"get_random_string(length: int = 64, prefix: Optional[str] = None) -> str\n
Generate a random string of specified length
Source code in src/koheesio/utils.py
def get_random_string(length: int = 64, prefix: Optional[str] = None) -> str:\n \"\"\"Generate a random string of specified length\"\"\"\n if prefix:\n return f\"{prefix}_{uuid.uuid4().hex}\"[0:length]\n return f\"{uuid.uuid4().hex}\"[0:length]\n
"},{"location":"api_reference/utils.html#koheesio.utils.import_class","title":"koheesio.utils.import_class","text":"import_class(module_class: str) -> Any\n
Import class and module based on provided string.
Parameters:
Name Type Description Default module_class
str
required Returns:
Type Description object Class from specified input string.
Source code in src/koheesio/utils.py
def import_class(module_class: str) -> Any:\n \"\"\"Import class and module based on provided string.\n\n Parameters\n ----------\n module_class module+class to be imported.\n\n Returns\n -------\n object Class from specified input string.\n\n \"\"\"\n module_path, class_name = module_class.rsplit(\".\", 1)\n module = import_module(module_path)\n\n return getattr(module, class_name)\n
"},{"location":"api_reference/asyncio/index.html","title":"Asyncio","text":"This module provides classes for asynchronous steps in the koheesio package.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep","title":"koheesio.asyncio.AsyncStep","text":"Asynchronous step class that inherits from Step and uses the AsyncStepMetaClass metaclass.
Attributes:
Name Type Description Output
AsyncStepOutput
The output class for the asynchronous step.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStep.Output","title":"Output","text":"Output class for asyncio step.
This class represents the output of the asyncio step. It inherits from the AsyncStepOutput class.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepMetaClass","title":"koheesio.asyncio.AsyncStepMetaClass","text":"Metaclass for asynchronous steps.
This metaclass is used to define asynchronous steps in the Koheesio framework. It inherits from the StepMetaClass and provides additional functionality for executing asynchronous steps.
Attributes: None
Methods: _execute_wrapper: Wrapper method for executing asynchronous steps.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput","title":"koheesio.asyncio.AsyncStepOutput","text":"Represents the output of an asynchronous step.
This class extends the base Step.Output
class and provides additional functionality for merging key-value maps.
Attributes:
Name Type Description ...
Methods:
Name Description merge
Merge key-value map with self.
"},{"location":"api_reference/asyncio/index.html#koheesio.asyncio.AsyncStepOutput.merge","title":"merge","text":"merge(other: Union[Dict, StepOutput])\n
Merge key,value map with self
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n
Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}
.
Parameters:
Name Type Description Default other
Union[Dict, StepOutput]
Dict or another instance of a StepOutputs class that will be added to self
required Source code in src/koheesio/asyncio/__init__.py
def merge(self, other: Union[Dict, StepOutput]):\n \"\"\"Merge key,value map with self\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n ```\n\n Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n Parameters\n ----------\n other: Union[Dict, StepOutput]\n Dict or another instance of a StepOutputs class that will be added to self\n \"\"\"\n if isinstance(other, StepOutput):\n other = other.model_dump() # ensures we really have a dict\n\n if not iscoroutine(other):\n for k, v in other.items():\n self.set(k, v)\n\n return self\n
"},{"location":"api_reference/asyncio/http.html","title":"Http","text":"This module contains async implementation of HTTP step.
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep","title":"koheesio.asyncio.http.AsyncHttpGetStep","text":"Represents an asynchronous HTTP GET step.
This class inherits from the AsyncHttpStep class and specifies the HTTP method as GET.
Attributes: method (HttpMethod): The HTTP method for the step, set to HttpMethod.GET.
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpGetStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = GET\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep","title":"koheesio.asyncio.http.AsyncHttpStep","text":"Asynchronous HTTP step for making HTTP requests using aiohttp.
Parameters:
Name Type Description Default client_session
Optional[ClientSession]
Aiohttp ClientSession.
required url
List[URL]
List of yarl.URL.
required retry_options
Optional[RetryOptionsBase]
Retry options for the request.
required connector
Optional[BaseConnector]
Connector for the aiohttp request.
required headers
Optional[Dict[str, Union[str, SecretStr]]]
Request headers.
required Output responses_urls : Optional[List[Tuple[Dict[str, Any], yarl.URL]]] List of responses from the API and request URL.
Examples:
>>> import asyncio\n>>> from aiohttp import ClientSession\n>>> from aiohttp.connector import TCPConnector\n>>> from aiohttp_retry import ExponentialRetry\n>>> from koheesio.steps.async.http import AsyncHttpStep\n>>> from yarl import URL\n>>> from typing import Dict, Any, Union, List, Tuple\n>>>\n>>> # Initialize the AsyncHttpStep\n>>> async def main():\n>>> session = ClientSession()\n>>> urls = [URL('https://example.com/api/1'), URL('https://example.com/api/2')]\n>>> retry_options = ExponentialRetry()\n>>> connector = TCPConnector(limit=10)\n>>> headers = {'Content-Type': 'application/json'}\n>>> step = AsyncHttpStep(\n>>> client_session=session,\n>>> url=urls,\n>>> retry_options=retry_options,\n>>> connector=connector,\n>>> headers=headers\n>>> )\n>>>\n>>> # Execute the step\n>>> responses_urls= await step.get()\n>>>\n>>> return responses_urls\n>>>\n>>> # Run the main function\n>>> responses_urls = asyncio.run(main())\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.client_session","title":"client_session class-attribute
instance-attribute
","text":"client_session: Optional[ClientSession] = Field(default=None, description='Aiohttp ClientSession', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.connector","title":"connector class-attribute
instance-attribute
","text":"connector: Optional[BaseConnector] = Field(default=None, description='Connector for the aiohttp request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.headers","title":"headers class-attribute
instance-attribute
","text":"headers: Dict[str, Union[str, SecretStr]] = Field(default_factory=dict, description='Request headers', alias='header', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.method","title":"method class-attribute
instance-attribute
","text":"method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.retry_options","title":"retry_options class-attribute
instance-attribute
","text":"retry_options: Optional[RetryOptionsBase] = Field(default=None, description='Retry options for the request', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.timeout","title":"timeout class-attribute
instance-attribute
","text":"timeout: None = Field(default=None, description='[Optional] Request timeout')\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.url","title":"url class-attribute
instance-attribute
","text":"url: List[URL] = Field(default=None, alias='urls', description='Expecting list, as there is no value in executing async request for one value.\\n yarl.URL is preferable, because params/data can be injected into URL instance', exclude=True)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output","title":"Output","text":"Output class for Step
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.Output.responses_urls","title":"responses_urls class-attribute
instance-attribute
","text":"responses_urls: Optional[List[Tuple[Dict[str, Any], URL]]] = Field(default=None, description='List of responses from the API and request URL', repr=False)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.delete","title":"delete async
","text":"delete() -> List[Tuple[Dict[str, Any], URL]]\n
Make DELETE requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def delete(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make DELETE requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.DELETE)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.execute","title":"execute","text":"execute() -> Output\n
Execute the step.
Raises:
Type Description ValueError
If the specified HTTP method is not implemented in AsyncHttpStep.
Source code in src/koheesio/asyncio/http.py
def execute(self) -> AsyncHttpStep.Output:\n \"\"\"\n Execute the step.\n\n Raises\n ------\n ValueError\n If the specified HTTP method is not implemented in AsyncHttpStep.\n \"\"\"\n # By design asyncio does not allow its event loop to be nested. This presents a practical problem:\n # When in an environment where the event loop is already running\n # it\u2019s impossible to run tasks and wait for the result.\n # Trying to do so will give the error \u201cRuntimeError: This event loop is already running\u201d.\n # The issue pops up in various environments, such as web servers, GUI applications and in\n # Jupyter/DataBricks notebooks.\n nest_asyncio.apply()\n\n map_method_func = {\n HttpMethod.GET: self.get,\n HttpMethod.POST: self.post,\n HttpMethod.PUT: self.put,\n HttpMethod.DELETE: self.delete,\n }\n\n if self.method not in map_method_func:\n raise ValueError(f\"Method {self.method} not implemented in AsyncHttpStep.\")\n\n self.output.responses_urls = asyncio.run(map_method_func[self.method]())\n\n return self.output\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get","title":"get async
","text":"get() -> List[Tuple[Dict[str, Any], URL]]\n
Make GET requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def get(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make GET requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.GET)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_headers","title":"get_headers","text":"get_headers()\n
Get the request headers.
Returns:
Type Description Optional[Dict[str, Union[str, SecretStr]]]
The request headers.
Source code in src/koheesio/asyncio/http.py
def get_headers(self):\n \"\"\"\n Get the request headers.\n\n Returns\n -------\n Optional[Dict[str, Union[str, SecretStr]]]\n The request headers.\n \"\"\"\n _headers = None\n\n if self.headers:\n _headers = {k: v.get_secret_value() if isinstance(v, SecretStr) else v for k, v in self.headers.items()}\n\n for k, v in self.headers.items():\n if isinstance(v, SecretStr):\n self.headers[k] = v.get_secret_value()\n\n return _headers or self.headers\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.get_options","title":"get_options","text":"get_options()\n
Get the options of the step.
Source code in src/koheesio/asyncio/http.py
def get_options(self):\n \"\"\"\n Get the options of the step.\n \"\"\"\n warnings.warn(\"get_options is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.post","title":"post async
","text":"post() -> List[Tuple[Dict[str, Any], URL]]\n
Make POST requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def post(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make POST requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.POST)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.put","title":"put async
","text":"put() -> List[Tuple[Dict[str, Any], URL]]\n
Make PUT requests.
Returns:
Type Description List[Tuple[Dict[str, Any], URL]]
A list of response data and corresponding request URLs.
Source code in src/koheesio/asyncio/http.py
async def put(self) -> List[Tuple[Dict[str, Any], yarl.URL]]:\n \"\"\"\n Make PUT requests.\n\n Returns\n -------\n List[Tuple[Dict[str, Any], yarl.URL]]\n A list of response data and corresponding request URLs.\n \"\"\"\n tasks = self.__tasks_generator(method=HttpMethod.PUT)\n responses_urls = await self._execute(tasks=tasks)\n\n return responses_urls\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.request","title":"request async
","text":"request(method: HttpMethod, url: URL, **kwargs) -> Tuple[Dict[str, Any], URL]\n
Make an HTTP request.
Parameters:
Name Type Description Default method
HttpMethod
The HTTP method to use for the request.
required url
URL
The URL to make the request to.
required kwargs
Any
Additional keyword arguments to pass to the request.
{}
Returns:
Type Description Tuple[Dict[str, Any], URL]
A tuple containing the response data and the request URL.
Source code in src/koheesio/asyncio/http.py
async def request(\n self,\n method: HttpMethod,\n url: yarl.URL,\n **kwargs,\n) -> Tuple[Dict[str, Any], yarl.URL]:\n \"\"\"\n Make an HTTP request.\n\n Parameters\n ----------\n method : HttpMethod\n The HTTP method to use for the request.\n url : yarl.URL\n The URL to make the request to.\n kwargs : Any\n Additional keyword arguments to pass to the request.\n\n Returns\n -------\n Tuple[Dict[str, Any], yarl.URL]\n A tuple containing the response data and the request URL.\n \"\"\"\n async with self.__retry_client.request(method=method, url=url, **kwargs) as response:\n res = await response.json()\n\n return (res, response.request_info.url)\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.set_outputs","title":"set_outputs","text":"set_outputs(response)\n
Set the outputs of the step.
Parameters:
Name Type Description Default response
Any
The response data.
required Source code in src/koheesio/asyncio/http.py
def set_outputs(self, response):\n \"\"\"\n Set the outputs of the step.\n\n Parameters\n ----------\n response : Any\n The response data.\n \"\"\"\n warnings.warn(\"set outputs is not implemented in AsyncHttpStep.\")\n
"},{"location":"api_reference/asyncio/http.html#koheesio.asyncio.http.AsyncHttpStep.validate_timeout","title":"validate_timeout","text":"validate_timeout(timeout)\n
Validate the 'data' field.
Parameters:
Name Type Description Default data
Any
The value of the 'timeout' field.
required Raises:
Type Description ValueError
If 'data' is not allowed in AsyncHttpStep.
Source code in src/koheesio/asyncio/http.py
@field_validator(\"timeout\")\ndef validate_timeout(cls, timeout):\n \"\"\"\n Validate the 'data' field.\n\n Parameters\n ----------\n data : Any\n The value of the 'timeout' field.\n\n Raises\n ------\n ValueError\n If 'data' is not allowed in AsyncHttpStep.\n \"\"\"\n if timeout:\n raise ValueError(\"timeout is not allowed in AsyncHttpStep. Provide timeout through retry_options.\")\n
"},{"location":"api_reference/integrations/index.html","title":"Integrations","text":"Nothing to see here, move along.
"},{"location":"api_reference/integrations/box.html","title":"Box","text":"Box Module
The module is used to facilitate various interactions with Box service. The implementation is based on the functionalities available in Box Python SDK: https://github.com/box/box-python-sdk
Prerequisites - Box Application is created in the developer portal using the JWT auth method (Developer Portal - My Apps - Create)
- Application is authorized for the enterprise (Developer Portal - MyApp - Authorization)
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box","title":"koheesio.integrations.box.Box","text":"Box(**data)\n
Configuration details required for the authentication can be obtained in the Box Developer Portal by generating the Public / Private key pair in \"Application Name -> Configuration -> Add and Manage Public Keys\".
The downloaded JSON file will look like this:
{\n \"boxAppSettings\": {\n \"clientID\": \"client_id\",\n \"clientSecret\": \"client_secret\",\n \"appAuth\": {\n \"publicKeyID\": \"public_key_id\",\n \"privateKey\": \"private_key\",\n \"passphrase\": \"pass_phrase\"\n }\n },\n \"enterpriseID\": \"123456\"\n}\n
This class is used as a base for the rest of Box integrations, however it can also be used separately to obtain the Box client which is created at class initialization. Examples:
b = Box(\n client_id=\"client_id\",\n client_secret=\"client_secret\",\n enterprise_id=\"enterprise_id\",\n jwt_key_id=\"jwt_key_id\",\n rsa_private_key_data=\"rsa_private_key_data\",\n rsa_private_key_passphrase=\"rsa_private_key_passphrase\",\n)\nb.client\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.auth_options","title":"auth_options property
","text":"auth_options\n
Get a dictionary of authentication options, that can be handily used in the child classes
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client","title":"client class-attribute
instance-attribute
","text":"client: SkipValidation[Client] = None\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_id","title":"client_id class-attribute
instance-attribute
","text":"client_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientID', description='Client ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.client_secret","title":"client_secret class-attribute
instance-attribute
","text":"client_secret: Union[SecretStr, SecretBytes] = Field(default=..., alias='clientSecret', description='Client Secret from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.enterprise_id","title":"enterprise_id class-attribute
instance-attribute
","text":"enterprise_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='enterpriseID', description='Enterprise ID from the Box Developer console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.jwt_key_id","title":"jwt_key_id class-attribute
instance-attribute
","text":"jwt_key_id: Union[SecretStr, SecretBytes] = Field(default=..., alias='publicKeyID', description='PublicKeyID for the public/private generated key pair.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_data","title":"rsa_private_key_data class-attribute
instance-attribute
","text":"rsa_private_key_data: Union[SecretStr, SecretBytes] = Field(default=..., alias='privateKey', description='Private key generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.rsa_private_key_passphrase","title":"rsa_private_key_passphrase class-attribute
instance-attribute
","text":"rsa_private_key_passphrase: Union[SecretStr, SecretBytes] = Field(default=..., alias='passphrase', description='Private key passphrase generated in the app management console.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/box.py
def execute(self):\n # Plug to be able to unit test ABC\n pass\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.Box.init_client","title":"init_client","text":"init_client()\n
Set up the Box client.
Source code in src/koheesio/integrations/box.py
def init_client(self):\n \"\"\"Set up the Box client.\"\"\"\n if not self.client:\n self.client = Client(JWTAuth(**self.auth_options))\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader","title":"koheesio.integrations.box.BoxCsvFileReader","text":"BoxCsvFileReader(**data)\n
Class facilitates reading one or multiple CSV files with the same structure directly from Box and producing Spark Dataframe.
Notes To manually identify the ID of the file in Box, open the file through Web UI, and copy ID from the page URL, e.g. https://foo.ent.box.com/file/1234567890 , where 1234567890 is the ID.
Examples:
from koheesio.steps.integrations.box import BoxCsvFileReader\nfrom pyspark.sql.types import StructType\n\nschema = StructType(...)\nb = BoxCsvFileReader(\n client_id=\"\",\n client_secret=\"\",\n enterprise_id=\"\",\n jwt_key_id=\"\",\n rsa_private_key_data=\"\",\n rsa_private_key_passphrase=\"\",\n file=[\"1\", \"2\"],\n schema=schema,\n).execute()\nb.df.show()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.file","title":"file class-attribute
instance-attribute
","text":"file: Union[str, list[str]] = Field(default=..., description='ID or list of IDs for the files to read.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvFileReader.execute","title":"execute","text":"execute()\n
Loop through the list of provided file identifiers and load data into dataframe. For traceability purposes the following columns will be added to the dataframe: * meta_file_id: the identifier of the file on Box * meta_file_name: name of the file
Returns:
Type Description DataFrame
Source code in src/koheesio/integrations/box.py
def execute(self):\n \"\"\"\n Loop through the list of provided file identifiers and load data into dataframe.\n For traceability purposes the following columns will be added to the dataframe:\n * meta_file_id: the identifier of the file on Box\n * meta_file_name: name of the file\n\n Returns\n -------\n DataFrame\n \"\"\"\n df = None\n for f in self.file:\n self.log.debug(f\"Reading contents of file with the ID '{f}' into Spark DataFrame\")\n file = self.client.file(file_id=f)\n data = file.content().decode(\"utf-8\").splitlines()\n rdd = self.spark.sparkContext.parallelize(data)\n temp_df = self.spark.read.csv(rdd, header=True, schema=self.schema_, **self.params)\n temp_df = (\n temp_df\n # fmt: off\n .withColumn(\"meta_file_id\", lit(file.object_id))\n .withColumn(\"meta_file_name\", lit(file.get().name))\n .withColumn(\"meta_load_timestamp\", expr(\"to_utc_timestamp(current_timestamp(), current_timezone())\"))\n # fmt: on\n )\n\n df = temp_df if not df else df.union(temp_df)\n\n self.output.df = df\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader","title":"koheesio.integrations.box.BoxCsvPathReader","text":"BoxCsvPathReader(**data)\n
Read all CSV files from the specified path into the dataframe. Files can be filtered using the regular expression in the 'filter' parameter. The default behavior is to read all CSV / TXT files from the specified path.
Notes The class does not contain archival capability as it is presumed that the user wants to make sure that the full pipeline is successful (for example, the source data was transformed and saved) prior to moving the source files. Use BoxToBoxFileMove class instead and provide the list of IDs from 'file_id' output.
Examples:
from koheesio.steps.integrations.box import BoxCsvPathReader\n\nauth_params = {...}\nb = BoxCsvPathReader(**auth_params, path=\"foo/bar/\").execute()\nb.df # Spark Dataframe\n... # do something with the dataframe\nfrom koheesio.steps.integrations.box import BoxToBoxFileMove\n\nbm = BoxToBoxFileMove(**auth_params, file=b.file_id, path=\"/foo/bar/archive\")\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.filter","title":"filter class-attribute
instance-attribute
","text":"filter: Optional[str] = Field(default='.csv|.txt$', description='[Optional] Regexp to filter folder contents')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.path","title":"path class-attribute
instance-attribute
","text":"path: str = Field(default=..., description='Box path')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxCsvPathReader.execute","title":"execute","text":"execute()\n
Identify the list of files from the source Box path that match desired filter and load them into Dataframe
Source code in src/koheesio/integrations/box.py
def execute(self):\n \"\"\"\n Identify the list of files from the source Box path that match desired filter and load them into Dataframe\n \"\"\"\n folder = BoxFolderGet.from_step(self).execute().folder\n\n # Identify the list of files that should be processed\n files = [item for item in folder.get_items() if item.type == \"file\" and re.search(self.filter, item.name)]\n\n if len(files) > 0:\n self.log.info(\n f\"A total of {len(files)} files, that match the mask '{self.mask}' has been detected in {self.path}.\"\n f\" They will be loaded into Spark Dataframe: {files}\"\n )\n else:\n raise BoxPathIsEmptyError(f\"Path '{self.path}' is empty or none of files match the mask '{self.filter}'\")\n\n file = [file_id.object_id for file_id in files]\n self.output.df = BoxCsvFileReader.from_step(self, file=file).read()\n self.output.file = file # e.g. if files should be archived after pipeline is successful\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase","title":"koheesio.integrations.box.BoxFileBase","text":"BoxFileBase(**data)\n
Generic class to facilitate interactions with Box folders.
Box SDK provides File class that has various properties and methods to interact with Box files. The object can be obtained in multiple ways: * provide Box file identified to file
parameter (the identifier can be obtained, for example, from URL) * provide existing object to file
parameter (boxsdk.object.file.File)
Notes Refer to BoxFolderBase for mor info about folder
and path
parameters
See Also boxsdk.object.file.File
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.files","title":"files class-attribute
instance-attribute
","text":"files: conlist(Union[File, str], min_length=1) = Field(default=..., alias='file', description='List of Box file objects or identifiers')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.folder","title":"folder class-attribute
instance-attribute
","text":"folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.path","title":"path class-attribute
instance-attribute
","text":"path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.action","title":"action","text":"action(file: File, folder: Folder)\n
Abstract class for File level actions.
Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n \"\"\"\n Abstract class for File level actions.\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileBase.execute","title":"execute","text":"execute()\n
Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects from various parameter inputs
Source code in src/koheesio/integrations/box.py
def execute(self):\n \"\"\"\n Generic execute method for all BoxToBox interactions. Deals with getting the correct folder and file objects\n from various parameter inputs\n \"\"\"\n if self.path:\n _folder = BoxFolderGet.from_step(self).execute().folder\n else:\n _folder = self.client.folder(folder_id=self.folder) if isinstance(self.folder, str) else self.folder\n\n for _file in self.files:\n _file = self.client.file(file_id=_file) if isinstance(_file, str) else _file\n self.action(file=_file, folder=_folder)\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter","title":"koheesio.integrations.box.BoxFileWriter","text":"BoxFileWriter(**data)\n
Write file or a file-like object to Box.
Examples:
from koheesio.steps.integrations.box import BoxFileWriter\n\nauth_params = {...}\nf1 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=\"path/to/my/file.ext\").execute()\n# or\nimport io\n\nb = io.BytesIO(b\"my-sample-data\")\nf2 = BoxFileWriter(**auth_params, path=\"/foo/bar\", file=b, name=\"file.ext\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.description","title":"description class-attribute
instance-attribute
","text":"description: Optional[str] = Field(None, description='Optional description to add to the file in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file","title":"file class-attribute
instance-attribute
","text":"file: Union[str, BytesIO] = Field(default=..., description='Path to file or a file-like object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.file_name","title":"file_name class-attribute
instance-attribute
","text":"file_name: Optional[str] = Field(default=None, description=\"When file path or name is provided to 'file' parameter, this will override the original name.When binary stream is provided, the 'name' should be used to set the desired name for the Box file.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output","title":"Output","text":"Output class for BoxFileWriter.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.file","title":"file class-attribute
instance-attribute
","text":"file: File = Field(default=..., description='File object in Box')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.Output.shared_link","title":"shared_link class-attribute
instance-attribute
","text":"shared_link: str = Field(default=..., description='Shared link for the Box file')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.action","title":"action","text":"action()\n
Source code in src/koheesio/integrations/box.py
def action(self):\n _file = self.file\n _name = self.file_name\n\n if isinstance(_file, str):\n _name = _name if _name else PurePath(_file).name\n with open(_file, \"rb\") as f:\n _file = BytesIO(f.read())\n\n folder: Folder = BoxFolderGet.from_step(self, create_sub_folders=True).execute().folder\n folder.preflight_check(size=0, name=_name)\n\n self.log.info(f\"Uploading file '{_name}' to Box folder '{folder.get().name}'...\")\n _box_file: File = folder.upload_stream(file_stream=_file, file_name=_name, file_description=self.description)\n\n self.output.file = _box_file\n self.output.shared_link = _box_file.get_shared_link()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.execute","title":"execute","text":"execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFileWriter.validate_name_for_binary_data","title":"validate_name_for_binary_data","text":"validate_name_for_binary_data(values)\n
Validate 'file_name' parameter when providing a binary input for 'file'.
Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"before\")\ndef validate_name_for_binary_data(cls, values):\n \"\"\"Validate 'file_name' parameter when providing a binary input for 'file'.\"\"\"\n file, file_name = values.get(\"file\"), values.get(\"file_name\")\n if not isinstance(file, str) and not file_name:\n raise AttributeError(\"The parameter 'file_name' is mandatory when providing a binary input for 'file'.\")\n\n return values\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase","title":"koheesio.integrations.box.BoxFolderBase","text":"BoxFolderBase(**data)\n
Generic class to facilitate interactions with Box folders.
Box SDK provides Folder class that has various properties and methods to interact with Box folders. The object can be obtained in multiple ways: * provide Box folder identified to folder
parameter (the identifier can be obtained, for example, from URL) * provide existing object to folder
parameter (boxsdk.object.folder.Folder) * provide filesystem-like path to path
parameter
See Also boxsdk.object.folder.Folder
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.folder","title":"folder class-attribute
instance-attribute
","text":"folder: Optional[Union[Folder, str]] = Field(default=None, description='Existing folder object or folder identifier')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.path","title":"path class-attribute
instance-attribute
","text":"path: Optional[str] = Field(default=None, description='Path to the Box folder, for example: `folder/sub-folder/lz')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.root","title":"root class-attribute
instance-attribute
","text":"root: Optional[Union[Folder, str]] = Field(default='0', description='Folder object or identifier of the folder that should be used as root')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output","title":"Output","text":"Define outputs for the BoxFolderBase class
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.Output.folder","title":"folder class-attribute
instance-attribute
","text":"folder: Optional[Folder] = Field(default=None, description='Box folder object')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.action","title":"action","text":"action()\n
Placeholder for 'action' method, that should be implemented in the child classes
Returns:
Type Description Folder or None
Source code in src/koheesio/integrations/box.py
def action(self):\n \"\"\"\n Placeholder for 'action' method, that should be implemented in the child classes\n\n Returns\n -------\n Folder or None\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.execute","title":"execute","text":"execute() -> Output\n
Source code in src/koheesio/integrations/box.py
def execute(self) -> Output:\n self.output.folder = self.action()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderBase.validate_folder_or_path","title":"validate_folder_or_path","text":"validate_folder_or_path()\n
Validations for 'folder' and 'path' parameter usage
Source code in src/koheesio/integrations/box.py
@model_validator(mode=\"after\")\ndef validate_folder_or_path(self):\n \"\"\"\n Validations for 'folder' and 'path' parameter usage\n \"\"\"\n folder_value = self.folder\n path_value = self.path\n\n if folder_value and path_value:\n raise AttributeError(\"Cannot user 'folder' and 'path' parameter at the same time\")\n\n if not folder_value and not path_value:\n raise AttributeError(\"Neither 'folder' nor 'path' parameters are set\")\n\n return self\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate","title":"koheesio.integrations.box.BoxFolderCreate","text":"BoxFolderCreate(**data)\n
Explicitly create the new Box folder object and parent directories.
Examples:
from koheesio.steps.integrations.box import BoxFolderCreate\n\nauth_params = {...}\nfolder = BoxFolderCreate(**auth_params, path=\"/foo/bar\").execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.create_sub_folders","title":"create_sub_folders class-attribute
instance-attribute
","text":"create_sub_folders: bool = Field(default=True, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderCreate.validate_folder","title":"validate_folder","text":"validate_folder(folder)\n
Validate 'folder' parameter
Source code in src/koheesio/integrations/box.py
@field_validator(\"folder\")\ndef validate_folder(cls, folder):\n \"\"\"\n Validate 'folder' parameter\n \"\"\"\n if folder:\n raise AttributeError(\"Only 'path' parameter is allowed in the context of folder creation.\")\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete","title":"koheesio.integrations.box.BoxFolderDelete","text":"BoxFolderDelete(**data)\n
Delete existing Box folder based on object, identifier or path.
Examples:
from koheesio.steps.integrations.box import BoxFolderDelete\n\nauth_params = {...}\nBoxFolderDelete(**auth_params, path=\"/foo/bar\").execute()\n# or\nBoxFolderDelete(**auth_params, folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxFolderDelete(**auth_params, folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderDelete.action","title":"action","text":"action()\n
Delete folder action
Returns:
Type Description None
Source code in src/koheesio/integrations/box.py
def action(self):\n \"\"\"\n Delete folder action\n\n Returns\n -------\n None\n \"\"\"\n if self.folder:\n folder = self._obj_from_id\n else: # path\n folder = BoxFolderGet.from_step(self).action()\n\n self.log.info(f\"Deleting Box folder '{folder}'...\")\n folder.delete()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet","title":"koheesio.integrations.box.BoxFolderGet","text":"BoxFolderGet(**data)\n
Get the Box folder object for an existing folder or create a new folder and parent directories.
Examples:
from koheesio.steps.integrations.box import BoxFolderGet\n\nauth_params = {...}\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\n# or\nfolder = BoxFolderGet(**auth_params, path=\"1\").execute().folder\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.create_sub_folders","title":"create_sub_folders class-attribute
instance-attribute
","text":"create_sub_folders: Optional[bool] = Field(False, description='Create sub-folders recursively if the path does not exist.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderGet.action","title":"action","text":"action()\n
Get folder action
Returns:
Name Type Description folder
Folder
Box Folder object as specified in Box SDK
Source code in src/koheesio/integrations/box.py
def action(self):\n \"\"\"\n Get folder action\n\n Returns\n -------\n folder: Folder\n Box Folder object as specified in Box SDK\n \"\"\"\n current_folder_object = None\n\n if self.folder:\n current_folder_object = self._obj_from_id\n\n if self.path:\n cleaned_path_parts = [p for p in PurePath(self.path).parts if p.strip() not in [None, \"\", \" \", \"/\"]]\n current_folder_object = self.client.folder(folder_id=self.root) if isinstance(self.root, str) else self.root\n\n for next_folder_name in cleaned_path_parts:\n current_folder_object = self._get_or_create_folder(current_folder_object, next_folder_name)\n\n self.log.info(f\"Folder identified or created: {current_folder_object}\")\n return current_folder_object\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxFolderNotFoundError","title":"koheesio.integrations.box.BoxFolderNotFoundError","text":"Error when a provided box path does not exist.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxPathIsEmptyError","title":"koheesio.integrations.box.BoxPathIsEmptyError","text":"Exception when provided Box path is empty or no files matched the mask.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase","title":"koheesio.integrations.box.BoxReaderBase","text":"BoxReaderBase(**data)\n
Base class for Box readers.
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the Spark reader.')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.schema_","title":"schema_ class-attribute
instance-attribute
","text":"schema_: Optional[StructType] = Field(None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output","title":"Output","text":"Make default reader output optional to gracefully handle 'no-files / folder' cases.
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.Output.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxReaderBase.execute","title":"execute abstractmethod
","text":"execute() -> Output\n
Source code in src/koheesio/integrations/box.py
@abstractmethod\ndef execute(self) -> Output:\n raise NotImplementedError\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy","title":"koheesio.integrations.box.BoxToBoxFileCopy","text":"BoxToBoxFileCopy(**data)\n
Copy one or multiple files to the target Box path.
Examples:
from koheesio.steps.integrations.box import BoxToBoxFileCopy\n\nauth_params = {...}\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileCopy(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileCopy(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileCopy.action","title":"action","text":"action(file: File, folder: Folder)\n
Copy file to the desired destination and extend file description with the processing info
Parameters:
Name Type Description Default file
File
File object as specified in Box SDK
required folder
Folder
Folder object as specified in Box SDK
required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n \"\"\"\n Copy file to the desired destination and extend file description with the processing info\n\n Parameters\n ----------\n file: File\n File object as specified in Box SDK\n folder: Folder\n Folder object as specified in Box SDK\n \"\"\"\n self.log.info(f\"Copying '{file.get()}' to '{folder.get()}'...\")\n file.copy(parent_folder=folder).update_info(\n data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n )\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove","title":"koheesio.integrations.box.BoxToBoxFileMove","text":"BoxToBoxFileMove(**data)\n
Move one or multiple files to the target Box path
Examples:
from koheesio.steps.integrations.box import BoxToBoxFileMove\n\nauth_params = {...}\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], path=\"/foo/bar\").execute()\n# or\nBoxToBoxFileMove(**auth_params, file=[\"1\", \"2\"], folder=\"1\").execute()\n# or\nfolder = BoxFolderGet(**auth_params, path=\"/foo/bar\").execute().folder\nBoxToBoxFileMove(**auth_params, file=[File(), File()], folder=folder).execute()\n
Source code in src/koheesio/integrations/box.py
def __init__(self, **data):\n super().__init__(**data)\n self.init_client()\n
"},{"location":"api_reference/integrations/box.html#koheesio.integrations.box.BoxToBoxFileMove.action","title":"action","text":"action(file: File, folder: Folder)\n
Move file to the desired destination and extend file description with the processing info
Parameters:
Name Type Description Default file
File
File object as specified in Box SDK
required folder
Folder
Folder object as specified in Box SDK
required Source code in src/koheesio/integrations/box.py
def action(self, file: File, folder: Folder):\n \"\"\"\n Move file to the desired destination and extend file description with the processing info\n\n Parameters\n ----------\n file: File\n File object as specified in Box SDK\n folder: Folder\n Folder object as specified in Box SDK\n \"\"\"\n self.log.info(f\"Moving '{file.get()}' to '{folder.get()}'...\")\n file.move(parent_folder=folder).update_info(\n data={\"description\": \"\\n\".join([f\"File processed on {datetime.utcnow()}\", file.get()[\"description\"]])}\n )\n
"},{"location":"api_reference/integrations/spark/index.html","title":"Spark","text":""},{"location":"api_reference/integrations/spark/sftp.html","title":"Sftp","text":"This module contains the SFTPWriter class and the SFTPWriteMode enum.
The SFTPWriter class is used to write data to a file on an SFTP server. It uses the Paramiko library to establish an SFTP connection and write data to the server. The data to be written is provided by a BufferWriter, which generates the data in a buffer. See the docstring of the SFTPWriter class for more details. Refer to koheesio.spark.writers.buffer for more details on the BufferWriter interface.
The SFTPWriteMode enum defines the different write modes that the SFTPWriter can use. These modes determine how the SFTPWriter behaves when the file it is trying to write to already exists on the server. For more details on each mode, see the docstring of the SFTPWriteMode enum.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode","title":"koheesio.integrations.spark.sftp.SFTPWriteMode","text":"The different write modes for the SFTPWriter.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--overwrite","title":"OVERWRITE:","text":" - If the file exists, it will be overwritten.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--append","title":"APPEND:","text":" - If the file exists, the new data will be appended to it.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--ignore","title":"IGNORE:","text":" - If the file exists, the method will return without writing anything.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--exclusive","title":"EXCLUSIVE:","text":" - If the file exists, an error will be raised.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--backup","title":"BACKUP:","text":" - If the file exists and the new data is different from the existing data, a backup will be created and the file will be overwritten.
- If it does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode--update","title":"UPDATE:","text":" - If the file exists and the new data is different from the existing data, the file will be overwritten.
- If the file exists and the new data is the same as the existing data, the method will return without writing anything.
- If the file does not exist, a new file will be created.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.BACKUP","title":"BACKUP class-attribute
instance-attribute
","text":"BACKUP = 'backup'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.EXCLUSIVE","title":"EXCLUSIVE class-attribute
instance-attribute
","text":"EXCLUSIVE = 'exclusive'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.IGNORE","title":"IGNORE class-attribute
instance-attribute
","text":"IGNORE = 'ignore'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.OVERWRITE","title":"OVERWRITE class-attribute
instance-attribute
","text":"OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.UPDATE","title":"UPDATE class-attribute
instance-attribute
","text":"UPDATE = 'update'\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.write_mode","title":"write_mode property
","text":"write_mode\n
Return the write mode for the given SFTPWriteMode.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriteMode.from_string","title":"from_string classmethod
","text":"from_string(mode: str)\n
Return the SFTPWriteMode for the given string.
Source code in src/koheesio/integrations/spark/sftp.py
@classmethod\ndef from_string(cls, mode: str):\n \"\"\"Return the SFTPWriteMode for the given string.\"\"\"\n return cls[mode.upper()]\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter","title":"koheesio.integrations.spark.sftp.SFTPWriter","text":"Write a Dataframe to SFTP through a BufferWriter
Concept - This class uses Paramiko to connect to an SFTP server and write the contents of a buffer to a file on the server.
- This implementation takes inspiration from https://github.com/springml/spark-sftp
Parameters:
Name Type Description Default path
Union[str, Path]
Path to the folder to write to
required file_name
Optional[str]
Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension.
None
host
str
SFTP Host
required port
int
SFTP Port
required username
SecretStr
SFTP Server Username
None
password
SecretStr
SFTP Server Password
None
buffer_writer
BufferWriter
This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.
required mode
Write mode: overwrite, append, ignore, exclusive, backup, or update. See the docstring of SFTPWriteMode for more details.
required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.buffer_writer","title":"buffer_writer class-attribute
instance-attribute
","text":"buffer_writer: InstanceOf[BufferWriter] = Field(default=..., description='This is the writer that will generate the body of the file that will be written to the specified file through SFTP. Details on how the DataFrame is written to the buffer should be implemented in the implementation of the BufferWriter class. Any BufferWriter can be used here, as long as it implements the BufferWriter interface.')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.client","title":"client property
","text":"client: SFTPClient\n
Return the SFTP client. If it doesn't exist, create it.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.file_name","title":"file_name class-attribute
instance-attribute
","text":"file_name: Optional[str] = Field(default=None, description='Name of the file. If not provided, the file name is expected to be part of the path. Make sure to add the desired file extension!', alias='filename')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.host","title":"host class-attribute
instance-attribute
","text":"host: str = Field(default=..., description='SFTP Host')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.mode","title":"mode class-attribute
instance-attribute
","text":"mode: SFTPWriteMode = Field(default=OVERWRITE, description='Write mode: overwrite, append, ignore, exclusive, backup, or update.' + __doc__)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.password","title":"password class-attribute
instance-attribute
","text":"password: Optional[SecretStr] = Field(default=None, description='SFTP Server Password')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.path","title":"path class-attribute
instance-attribute
","text":"path: Union[str, Path] = Field(default=..., description='Path to the folder to write to', alias='prefix')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.port","title":"port class-attribute
instance-attribute
","text":"port: int = Field(default=..., description='SFTP Port')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.transport","title":"transport property
","text":"transport\n
Return the transport for the SFTP connection. If it doesn't exist, create it.
If the username and password are provided, use them to connect to the SFTP server.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.username","title":"username class-attribute
instance-attribute
","text":"username: Optional[SecretStr] = Field(default=None, description='SFTP Server Username')\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_mode","title":"write_mode property
","text":"write_mode\n
Return the write mode for the given SFTPWriteMode.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.check_file_exists","title":"check_file_exists","text":"check_file_exists(file_path: str) -> bool\n
Check if a file exists on the SFTP server.
Source code in src/koheesio/integrations/spark/sftp.py
def check_file_exists(self, file_path: str) -> bool:\n \"\"\"\n Check if a file exists on the SFTP server.\n \"\"\"\n try:\n self.client.stat(file_path)\n return True\n except IOError:\n return False\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n buffer_output: InstanceOf[BufferWriter.Output] = self.buffer_writer.write(self.df)\n\n # write buffer to the SFTP server\n try:\n self._handle_write_mode(self.path.as_posix(), buffer_output)\n finally:\n self._close_client()\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_path_and_file_name","title":"validate_path_and_file_name","text":"validate_path_and_file_name(data: dict) -> dict\n
Validate the path, make sure path and file_name are Path objects.
Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"before\")\ndef validate_path_and_file_name(cls, data: dict) -> dict:\n \"\"\"Validate the path, make sure path and file_name are Path objects.\"\"\"\n path_or_str = data.get(\"path\")\n\n if isinstance(path_or_str, str):\n # make sure the path is a Path object\n path_or_str = Path(path_or_str)\n\n if not isinstance(path_or_str, Path):\n raise ValueError(f\"Invalid path: {path_or_str}\")\n\n if file_name := data.get(\"file_name\", data.get(\"filename\")):\n path_or_str = path_or_str / file_name\n try:\n del data[\"filename\"]\n except KeyError:\n pass\n data[\"file_name\"] = file_name\n\n data[\"path\"] = path_or_str\n return data\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.validate_sftp_host","title":"validate_sftp_host","text":"validate_sftp_host(v) -> str\n
Validate the host
Source code in src/koheesio/integrations/spark/sftp.py
@field_validator(\"host\")\ndef validate_sftp_host(cls, v) -> str:\n \"\"\"Validate the host\"\"\"\n # remove the sftp:// prefix if present\n if v.startswith(\"sftp://\"):\n v = v.replace(\"sftp://\", \"\")\n\n # remove the trailing slash if present\n if v.endswith(\"/\"):\n v = v[:-1]\n\n return v\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SFTPWriter.write_file","title":"write_file","text":"write_file(file_path: str, buffer_output: InstanceOf[Output])\n
Using Paramiko, write the data in the buffer to SFTP.
Source code in src/koheesio/integrations/spark/sftp.py
def write_file(self, file_path: str, buffer_output: InstanceOf[BufferWriter.Output]):\n \"\"\"\n Using Paramiko, write the data in the buffer to SFTP.\n \"\"\"\n with self.client.open(file_path, self.write_mode) as file:\n self.log.debug(f\"Writing file {file_path} to SFTP...\")\n file.write(buffer_output.read())\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp","title":"koheesio.integrations.spark.sftp.SendCsvToSftp","text":"Write a DataFrame to an SFTP server as a CSV file.
This class uses the PandasCsvBufferWriter to generate the CSV data and the SFTPWriter to write the data to the SFTP server.
Example from koheesio.spark.writers import SendCsvToSftp\n\nwriter = SendCsvToSftp(\n # SFTP Parameters\n host=\"sftp.example.com\",\n port=22,\n username=\"user\",\n password=\"password\",\n path=\"/path/to/folder\",\n file_name=\"file.tsv.gz\",\n # CSV Parameters\n header=True,\n sep=\" \",\n quote='\"',\n timestampFormat=\"%Y-%m-%d\",\n lineSep=os.linesep,\n compression=\"gzip\",\n index=False,\n)\n\nwriter.write(df)\n
In this example, the DataFrame df
is written to the file file.csv.gz
in the folder /path/to/folder
on the SFTP server. The file is written as a CSV file with a tab delimiter (TSV), double quotes as the quote character, and gzip compression.
Parameters:
Name Type Description Default path
Union[str, Path]
Path to the folder to write to.
required file_name
Optional[str]
Name of the file. If not provided, it's expected to be part of the path.
required host
str
SFTP Host.
required port
int
SFTP Port.
required username
SecretStr
SFTP Server Username.
required password
SecretStr
SFTP Server Password.
required mode
Write mode: overwrite, append, ignore, exclusive, backup, or update.
required header
Whether to write column names as the first line. Default is True.
required sep
Field delimiter for the output file. Default is ','.
required quote
Character used to quote fields. Default is '\"'.
required quoteAll
Whether all values should be enclosed in quotes. Default is False.
required escape
Character used to escape sep and quote when needed. Default is '\\'.
required timestampFormat
Date format for datetime objects. Default is '%Y-%m-%dT%H:%M:%S.%f'.
required lineSep
Character used as line separator. Default is os.linesep.
required compression
Compression to use for the output data. Default is None.
required For
required"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.buffer_writer","title":"buffer_writer class-attribute
instance-attribute
","text":"buffer_writer: PandasCsvBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendCsvToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"set_up_buffer_writer() -> SendCsvToSftp\n
Set up the buffer writer, passing all CSV related options to it.
Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendCsvToSftp\":\n \"\"\"Set up the buffer writer, passing all CSV related options to it.\"\"\"\n self.buffer_writer = PandasCsvBufferWriter(**self.get_options(options_type=\"kohesio_pandas_buffer_writer\"))\n return self\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp","title":"koheesio.integrations.spark.sftp.SendJsonToSftp","text":"Write a DataFrame to an SFTP server as a JSON file.
This class uses the PandasJsonBufferWriter to generate the JSON data and the SFTPWriter to write the data to the SFTP server.
Example from koheesio.spark.writers import SendJsonToSftp\n\nwriter = SendJsonToSftp(\n # SFTP Parameters (Inherited from SFTPWriter)\n host=\"sftp.example.com\",\n port=22,\n username=\"user\",\n password=\"password\",\n path=\"/path/to/folder\",\n file_name=\"file.json.gz\",\n # JSON Parameters (Inherited from PandasJsonBufferWriter)\n orient=\"records\",\n date_format=\"iso\",\n double_precision=2,\n date_unit=\"ms\",\n lines=False,\n compression=\"gzip\",\n index=False,\n)\n\nwriter.write(df)\n
In this example, the DataFrame df
is written to the file file.json.gz
in the folder /path/to/folder
on the SFTP server. The file is written as a JSON file with gzip compression.
Parameters:
Name Type Description Default path
Union[str, Path]
Path to the folder on the SFTP server.
required file_name
Optional[str]
Name of the file, including extension. If not provided, expected to be part of the path.
required host
str
SFTP Host.
required port
int
SFTP Port.
required username
SecretStr
SFTP Server Username.
required password
SecretStr
SFTP Server Password.
required mode
Write mode: overwrite, append, ignore, exclusive, backup, or update.
required orient
Format of the JSON string. Default is 'records'.
required lines
If True, output is one JSON object per line. Only used when orient='records'. Default is True.
required date_format
Type of date conversion. Default is 'iso'.
required double_precision
Decimal places for encoding floating point values. Default is 10.
required force_ascii
If True, encoded string is ASCII. Default is True.
required compression
Compression to use for output data. Default is None.
required See Also For more details on the JSON parameters, refer to the PandasJsonBufferWriter class documentation.
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.buffer_writer","title":"buffer_writer class-attribute
instance-attribute
","text":"buffer_writer: PandasJsonBufferWriter = Field(default=None, validate_default=False)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/integrations/spark/sftp.py
def execute(self):\n SFTPWriter.execute(self)\n
"},{"location":"api_reference/integrations/spark/sftp.html#koheesio.integrations.spark.sftp.SendJsonToSftp.set_up_buffer_writer","title":"set_up_buffer_writer","text":"set_up_buffer_writer() -> SendJsonToSftp\n
Set up the buffer writer, passing all JSON related options to it.
Source code in src/koheesio/integrations/spark/sftp.py
@model_validator(mode=\"after\")\ndef set_up_buffer_writer(self) -> \"SendJsonToSftp\":\n \"\"\"Set up the buffer writer, passing all JSON related options to it.\"\"\"\n self.buffer_writer = PandasJsonBufferWriter(\n **self.get_options(), compression=self.compression, columns=self.columns\n )\n return self\n
"},{"location":"api_reference/integrations/spark/dq/index.html","title":"Dq","text":""},{"location":"api_reference/integrations/spark/dq/spark_expectations.html","title":"Spark expectations","text":"Koheesio step for running data quality rules with Spark Expectations engine.
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","title":"koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation","text":"Run DQ rules for an input dataframe with Spark Expectations engine.
References Spark Expectations: https://engineering.nike.com/spark-expectations/1.0.0/
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.drop_meta_column","title":"drop_meta_column class-attribute
instance-attribute
","text":"drop_meta_column: bool = Field(default=False, alias='drop_meta_columns', description='Whether to drop meta columns added by spark expectations on the output df')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.enable_debugger","title":"enable_debugger class-attribute
instance-attribute
","text":"enable_debugger: bool = Field(default=False, alias='debugger', description='...')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_format","title":"error_writer_format class-attribute
instance-attribute
","text":"error_writer_format: Optional[str] = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writer_mode","title":"error_writer_mode class-attribute
instance-attribute
","text":"error_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.error_writing_options","title":"error_writing_options class-attribute
instance-attribute
","text":"error_writing_options: Optional[Dict[str, str]] = Field(default_factory=dict, alias='error_writing_options', description='Options for writing to the error table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='delta', alias='dataframe_writer_format', description='The format used to write to the stats and err table. Separate output formats can be specified for each table using the error_writer_format and stats_writer_format params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.mode","title":"mode class-attribute
instance-attribute
","text":"mode: Union[str, BatchOutputMode] = Field(default=APPEND, alias='dataframe_writer_mode', description='The write mode that will be used to write to the err and stats table. Separate output modes can be specified for each table using the error_writer_mode and stats_writer_mode params')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.product_id","title":"product_id class-attribute
instance-attribute
","text":"product_id: str = Field(default=..., description='Spark Expectations product identifier')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.rules_table","title":"rules_table class-attribute
instance-attribute
","text":"rules_table: str = Field(default=..., alias='product_rules_table', description='DQ rules table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.se_user_conf","title":"se_user_conf class-attribute
instance-attribute
","text":"se_user_conf: Dict[str, Any] = Field(default={se_notifications_enable_email: False, se_notifications_enable_slack: False}, alias='user_conf', description='SE user provided confs', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_streaming","title":"statistics_streaming class-attribute
instance-attribute
","text":"statistics_streaming: Dict[str, Any] = Field(default={se_enable_streaming: False}, alias='stats_streaming_options', description='SE stats streaming options ', validate_default=False)\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.statistics_table","title":"statistics_table class-attribute
instance-attribute
","text":"statistics_table: str = Field(default=..., alias='dq_stats_table_name', description='DQ stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_format","title":"stats_writer_format class-attribute
instance-attribute
","text":"stats_writer_format: Optional[str] = Field(default='delta', alias='stats_writer_format', description='The format used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.stats_writer_mode","title":"stats_writer_mode class-attribute
instance-attribute
","text":"stats_writer_mode: Optional[Union[str, BatchOutputMode]] = Field(default=APPEND, alias='stats_writer_mode', description='The write mode that will be used to write to the stats table')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.target_table","title":"target_table class-attribute
instance-attribute
","text":"target_table: str = Field(default=..., alias='target_table_name', description=\"The table that will contain good records. Won't write to it, but will write to the err table with same name plus _err suffix\")\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output","title":"Output","text":"Output of the SparkExpectationsTransformation step.
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.error_table_writer","title":"error_table_writer class-attribute
instance-attribute
","text":"error_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations error table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.rules_df","title":"rules_df class-attribute
instance-attribute
","text":"rules_df: DataFrame = Field(default=..., description='Output dataframe')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.se","title":"se class-attribute
instance-attribute
","text":"se: SparkExpectations = Field(default=..., description='Spark Expectations object')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.Output.stats_table_writer","title":"stats_table_writer class-attribute
instance-attribute
","text":"stats_table_writer: WrappedDataFrameWriter = Field(default=..., description='Spark Expectations stats table writer')\n
"},{"location":"api_reference/integrations/spark/dq/spark_expectations.html#koheesio.integrations.spark.dq.spark_expectations.SparkExpectationsTransformation.execute","title":"execute","text":"execute() -> Output\n
Apply data quality rules to a dataframe using the out-of-the-box SE decorator
Source code in src/koheesio/integrations/spark/dq/spark_expectations.py
def execute(self) -> Output:\n \"\"\"\n Apply data quality rules to a dataframe using the out-of-the-box SE decorator\n \"\"\"\n # read rules table\n rules_df = self.spark.read.table(self.rules_table).cache()\n self.output.rules_df = rules_df\n\n @self._se.with_expectations(\n target_table=self.target_table,\n user_conf=self.se_user_conf,\n # Below params are `False` by default, however exposing them here for extra visibility\n # The writes can be handled by downstream Koheesio steps\n write_to_table=False,\n write_to_temp_table=False,\n )\n def inner(df: DataFrame) -> DataFrame:\n \"\"\"Just a wrapper to be able to use Spark Expectations decorator\"\"\"\n return df\n\n output_df = inner(self.df)\n\n if self.drop_meta_column:\n output_df = output_df.drop(\"meta_dq_run_id\", \"meta_dq_run_datetime\")\n\n self.output.df = output_df\n
"},{"location":"api_reference/models/index.html","title":"Models","text":"Models package creates models that can be used to base other classes on.
- Every model should be at least a pydantic BaseModel, but can also be a Step, or a StepOutput.
- Every model is expected to be an ABC (Abstract Base Class)
- Optionally a model can inherit ExtraParamsMixin that provides unpacking of kwargs into
extra_params
dict property removing need to create a dict before passing kwargs to a model initializer.
A Model class can be exceptionally handy when you need similar Pydantic models in multiple places, for example across Transformation and Reader classes.
"},{"location":"api_reference/models/index.html#koheesio.models.ListOfColumns","title":"koheesio.models.ListOfColumns module-attribute
","text":"ListOfColumns = Annotated[List[str], BeforeValidator(_list_of_columns_validation)]\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel","title":"koheesio.models.BaseModel","text":"Base model for all models.
Extends pydantic BaseModel with some additional configuration. To be used as a base class for all models in Koheesio instead of pydantic.BaseModel.
Additional methods and properties: Different Modes This BaseModel class supports lazy mode. This means that validation of the items stored in the class can be called at will instead of being forced to run it upfront.
-
Normal mode: you need to know the values ahead of time
normal_mode = YourOwnModel(a=\"foo\", b=42)\n
-
Lazy mode: being able to defer the validation until later
lazy_mode = YourOwnModel.lazy()\nlazy_mode.a = \"foo\"\nlazy_mode.b = 42\nlazy_mode.validate_output()\n
The prime advantage of using lazy mode is that you don't have to know all your outputs up front, and can add them as they become available. All while still being able to validate that you have collected all your output at the end. -
With statements: With statements are also allowed. The validate_output
method from the earlier example will run upon exit of the with-statement.
with YourOwnModel.lazy() as with_output:\n with_output.a = \"foo\"\n with_output.b = 42\n
Note: that a lazy mode BaseModel object is required to work with a with-statement.
Examples:
from koheesio.models import BaseModel\n\n\nclass Person(BaseModel):\n name: str\n age: int\n\n\n# Using the lazy method to create an instance without immediate validation\nperson = Person.lazy()\n\n# Setting attributes\nperson.name = \"John Doe\"\nperson.age = 30\n\n# Now we validate the instance\nperson.validate_output()\n\nprint(person)\n
In this example, the Person instance is created without immediate validation. The attributes name and age are set afterward. The validate_output
method is then called to validate the instance.
Koheesio specific configuration: Koheesio models are configured differently from Pydantic defaults. The following configuration is used:
-
extra=\"allow\"
This setting allows for extra fields that are not specified in the model definition. If a field is present in the data but not in the model, it will not raise an error. Pydantic default is \"ignore\", which means that extra attributes are ignored.
-
arbitrary_types_allowed=True
This setting allows for fields in the model to be of any type. This is useful when you want to include fields in your model that are not standard Python types. Pydantic default is False, which means that fields must be of a standard Python type.
-
populate_by_name=True
This setting allows an aliased field to be populated by its name as given by the model attribute, as well as the alias. This was known as allow_population_by_field_name in pydantic v1. Pydantic default is False, which means that fields can only be populated by their alias.
-
validate_assignment=False
This setting determines whether the model should be revalidated when the data is changed. If set to True
, every time a field is assigned a new value, the entire model is validated again.
Pydantic default is (also) False
, which means that the model is not revalidated when the data is changed. The default behavior of Pydantic is to validate the data when the model is created. In case the user changes the data after the model is created, the model is not revalidated.
-
revalidate_instances=\"subclass-instances\"
This setting determines whether to revalidate models during validation if the instance is a subclass of the model. This is important as inheritance is used a lot in Koheesio. Pydantic default is never
, which means that the model and dataclass instances are not revalidated during validation.
-
validate_default=True
This setting determines whether to validate default values during validation. When set to True, default values are checked during the validation process. We opt to set this to True, as we are attempting to make the sure that the data is valid prior to running / executing any Step. Pydantic default is False, which means that default values are not validated during validation.
-
frozen=False
This setting determines whether the model is immutable. If set to True, once a model is created, its fields cannot be changed. Pydantic default is also False, which means that the model is mutable.
-
coerce_numbers_to_str=True
This setting determines whether to convert number fields to strings. When set to True, enables automatic coercion of any Number
type to str
. Pydantic doesn't allow number types (int
, float
, Decimal
) to be coerced as type str
by default.
-
use_enum_values=True
This setting determines whether to use the values of Enum fields. If set to True, the actual value of the Enum is used instead of the reference. Pydantic default is False, which means that the reference to the Enum is used.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--fields","title":"Fields","text":"Every Koheesio BaseModel has two fields: name
and description
. These fields are used to provide a name and a description to the model.
-
name
: This is the name of the Model. If not provided, it defaults to the class name.
-
description
: This is the description of the Model. It has several default behaviors:
- If not provided, it defaults to the docstring of the class.
- If the docstring is not provided, it defaults to the name of the class.
- For multi-line descriptions, it has the following behaviors:
- Only the first non-empty line is used.
- Empty lines are removed.
- Only the first 3 lines are considered.
- Only the first 120 characters are considered.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--validators","title":"Validators","text":" _set_name_and_description
: Set the name and description of the Model as per the rules mentioned above.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--properties","title":"Properties","text":" log
: Returns a logger with the name of the class.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--class-methods","title":"Class Methods","text":" from_basemodel
: Returns a new BaseModel instance based on the data of another BaseModel. from_context
: Creates BaseModel instance from a given Context. from_dict
: Creates BaseModel instance from a given dictionary. from_json
: Creates BaseModel instance from a given JSON string. from_toml
: Creates BaseModel object from a given toml file. from_yaml
: Creates BaseModel object from a given yaml file. lazy
: Constructs the model without doing validation.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--dunder-methods","title":"Dunder Methods","text":" __add__
: Allows to add two BaseModel instances together. __enter__
: Allows for using the model in a with-statement. __exit__
: Allows for using the model in a with-statement. __setitem__
: Set Item dunder method for BaseModel. __getitem__
: Get Item dunder method for BaseModel.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel--instance-methods","title":"Instance Methods","text":" hasattr
: Check if given key is present in the model. get
: Get an attribute of the model, but don't fail if not present. merge
: Merge key,value map with self. set
: Allows for subscribing / assigning to class[key]
. to_context
: Converts the BaseModel instance to a Context object. to_dict
: Converts the BaseModel instance to a dictionary. to_json
: Converts the BaseModel instance to a JSON string. to_yaml
: Converts the BaseModel instance to a YAML string.
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.description","title":"description class-attribute
instance-attribute
","text":"description: Optional[str] = Field(default=None, description='Description of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.log","title":"log property
","text":"log: Logger\n
Returns a logger with the name of the class
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='allow', arbitrary_types_allowed=True, populate_by_name=True, validate_assignment=False, revalidate_instances='subclass-instances', validate_default=True, frozen=False, coerce_numbers_to_str=True, use_enum_values=True)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.name","title":"name class-attribute
instance-attribute
","text":"name: Optional[str] = Field(default=None, description='Name of the Model')\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_basemodel","title":"from_basemodel classmethod
","text":"from_basemodel(basemodel: BaseModel, **kwargs)\n
Returns a new BaseModel instance based on the data of another BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_basemodel(cls, basemodel: BaseModel, **kwargs):\n \"\"\"Returns a new BaseModel instance based on the data of another BaseModel\"\"\"\n kwargs = {**basemodel.model_dump(), **kwargs}\n return cls(**kwargs)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_context","title":"from_context classmethod
","text":"from_context(context: Context) -> BaseModel\n
Creates BaseModel instance from a given Context
You have to make sure that the Context object has the necessary attributes to create the model.
Examples:
class SomeStep(BaseModel):\n foo: str\n\n\ncontext = Context(foo=\"bar\")\nsome_step = SomeStep.from_context(context)\nprint(some_step.foo) # prints 'bar'\n
Parameters:
Name Type Description Default context
Context
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_context(cls, context: Context) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given Context\n\n You have to make sure that the Context object has the necessary attributes to create the model.\n\n Examples\n --------\n ```python\n class SomeStep(BaseModel):\n foo: str\n\n\n context = Context(foo=\"bar\")\n some_step = SomeStep.from_context(context)\n print(some_step.foo) # prints 'bar'\n ```\n\n Parameters\n ----------\n context: Context\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_dict","title":"from_dict classmethod
","text":"from_dict(data: Dict[str, Any]) -> BaseModel\n
Creates BaseModel instance from a given dictionary
Parameters:
Name Type Description Default data
Dict[str, Any]
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given dictionary\n\n Parameters\n ----------\n data: Dict[str, Any]\n\n Returns\n -------\n BaseModel\n \"\"\"\n return cls(**data)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_json","title":"from_json classmethod
","text":"from_json(json_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel instance from a given JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.from_json : Deserializes a JSON string to a Context object
Parameters:
Name Type Description Default json_file_or_str
Union[str, Path]
Pathlike string or Path that points to the json file or string containing json
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_json(cls, json_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel instance from a given JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.from_json : Deserializes a JSON string to a Context object\n\n Parameters\n ----------\n json_file_or_str : Union[str, Path]\n Pathlike string or Path that points to the json file or string containing json\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_json(json_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_toml","title":"from_toml classmethod
","text":"from_toml(toml_file_or_str: Union[str, Path]) -> BaseModel\n
Creates BaseModel object from a given toml file
Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.
Parameters:
Name Type Description Default toml_file_or_str
Union[str, Path]
Pathlike string or Path that points to the toml file, or string containing toml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_toml(cls, toml_file_or_str: Union[str, Path]) -> BaseModel:\n \"\"\"Creates BaseModel object from a given toml file\n\n Note: BaseModel offloads the serialization and deserialization of the TOML string to Context class.\n\n Parameters\n ----------\n toml_file_or_str: str or Path\n Pathlike string or Path that points to the toml file, or string containing toml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_toml(toml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.from_yaml","title":"from_yaml classmethod
","text":"from_yaml(yaml_file_or_str: str) -> BaseModel\n
Creates BaseModel object from a given yaml file
Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default yaml_file_or_str
str
Pathlike string or Path that points to the yaml file, or string containing yaml
required Returns:
Type Description BaseModel
Source code in src/koheesio/models/__init__.py
@classmethod\ndef from_yaml(cls, yaml_file_or_str: str) -> BaseModel:\n \"\"\"Creates BaseModel object from a given yaml file\n\n Note: BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n yaml_file_or_str: str or Path\n Pathlike string or Path that points to the yaml file, or string containing yaml\n\n Returns\n -------\n BaseModel\n \"\"\"\n _context = Context.from_yaml(yaml_file_or_str)\n return cls.from_context(_context)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.get","title":"get","text":"get(key: str, default: Optional[Any] = None)\n
Get an attribute of the model, but don't fail if not present
Similar to dict.get()
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.get(\"foo\") # returns 'bar'\nstep_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n
Parameters:
Name Type Description Default key
str
name of the key to get
required default
Optional[Any]
Default value in case the attribute does not exist
None
Returns:
Type Description Any
The value of the attribute
Source code in src/koheesio/models/__init__.py
def get(self, key: str, default: Optional[Any] = None):\n \"\"\"Get an attribute of the model, but don't fail if not present\n\n Similar to dict.get()\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.get(\"foo\") # returns 'bar'\n step_output.get(\"non_existent_key\", \"oops\") # returns 'oops'\n ```\n\n Parameters\n ----------\n key: str\n name of the key to get\n default: Optional[Any]\n Default value in case the attribute does not exist\n\n Returns\n -------\n Any\n The value of the attribute\n \"\"\"\n if self.hasattr(key):\n return self.__getitem__(key)\n return default\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.hasattr","title":"hasattr","text":"hasattr(key: str) -> bool\n
Check if given key is present in the model
Parameters:
Name Type Description Default key
str
required Returns:
Type Description bool
Source code in src/koheesio/models/__init__.py
def hasattr(self, key: str) -> bool:\n \"\"\"Check if given key is present in the model\n\n Parameters\n ----------\n key: str\n\n Returns\n -------\n bool\n \"\"\"\n return hasattr(self, key)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.lazy","title":"lazy classmethod
","text":"lazy()\n
Constructs the model without doing validation
Essentially an alias to BaseModel.construct()
Source code in src/koheesio/models/__init__.py
@classmethod\ndef lazy(cls):\n \"\"\"Constructs the model without doing validation\n\n Essentially an alias to BaseModel.construct()\n \"\"\"\n return cls.model_construct()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.merge","title":"merge","text":"merge(other: Union[Dict, BaseModel])\n
Merge key,value map with self
Functionally similar to adding two dicts together; like running {**dict_a, **dict_b}
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n
Parameters:
Name Type Description Default other
Union[Dict, BaseModel]
Dict or another instance of a BaseModel class that will be added to self
required Source code in src/koheesio/models/__init__.py
def merge(self, other: Union[Dict, BaseModel]):\n \"\"\"Merge key,value map with self\n\n Functionally similar to adding two dicts together; like running `{**dict_a, **dict_b}`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.merge({\"lorem\": \"ipsum\"}) # step_output will now contain {'foo': 'bar', 'lorem': 'ipsum'}\n ```\n\n Parameters\n ----------\n other: Union[Dict, BaseModel]\n Dict or another instance of a BaseModel class that will be added to self\n \"\"\"\n if isinstance(other, BaseModel):\n other = other.model_dump() # ensures we really have a dict\n\n for k, v in other.items():\n self.set(k, v)\n\n return self\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.set","title":"set","text":"set(key: str, value: Any)\n
Allows for subscribing / assigning to class[key]
.
Examples:
step_output = StepOutput(foo=\"bar\")\nstep_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n
Parameters:
Name Type Description Default key
str
The key of the attribute to assign to
required value
Any
Value that should be assigned to the given key
required Source code in src/koheesio/models/__init__.py
def set(self, key: str, value: Any):\n \"\"\"Allows for subscribing / assigning to `class[key]`.\n\n Examples\n --------\n ```python\n step_output = StepOutput(foo=\"bar\")\n step_output.set(foo\", \"baz\") # overwrites 'foo' to be 'baz'\n ```\n\n Parameters\n ----------\n key: str\n The key of the attribute to assign to\n value: Any\n Value that should be assigned to the given key\n \"\"\"\n self.__setitem__(key, value)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_context","title":"to_context","text":"to_context() -> Context\n
Converts the BaseModel instance to a Context object
Returns:
Type Description Context
Source code in src/koheesio/models/__init__.py
def to_context(self) -> Context:\n \"\"\"Converts the BaseModel instance to a Context object\n\n Returns\n -------\n Context\n \"\"\"\n return Context(**self.to_dict())\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_dict","title":"to_dict","text":"to_dict() -> Dict[str, Any]\n
Converts the BaseModel instance to a dictionary
Returns:
Type Description Dict[str, Any]
Source code in src/koheesio/models/__init__.py
def to_dict(self) -> Dict[str, Any]:\n \"\"\"Converts the BaseModel instance to a dictionary\n\n Returns\n -------\n Dict[str, Any]\n \"\"\"\n return self.model_dump()\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_json","title":"to_json","text":"to_json(pretty: bool = False)\n
Converts the BaseModel instance to a JSON string
BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored in the BaseModel object, which is not possible with the standard json library.
See Also Context.to_json : Serializes a Context object to a JSON string
Parameters:
Name Type Description Default pretty
bool
Toggles whether to return a pretty json string or not
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_json(self, pretty: bool = False):\n \"\"\"Converts the BaseModel instance to a JSON string\n\n BaseModel offloads the serialization and deserialization of the JSON string to Context class. Context uses\n jsonpickle library to serialize and deserialize the JSON string. This is done to allow for objects to be stored\n in the BaseModel object, which is not possible with the standard json library.\n\n See Also\n --------\n Context.to_json : Serializes a Context object to a JSON string\n\n Parameters\n ----------\n pretty : bool, optional, default=False\n Toggles whether to return a pretty json string or not\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_json(pretty=pretty)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.to_yaml","title":"to_yaml","text":"to_yaml(clean: bool = False) -> str\n
Converts the BaseModel instance to a YAML string
BaseModel offloads the serialization and deserialization of the YAML string to Context class.
Parameters:
Name Type Description Default clean
bool
Toggles whether to remove !!python/object:...
from yaml or not. Default: False
False
Returns:
Type Description str
containing all parameters of the BaseModel instance
Source code in src/koheesio/models/__init__.py
def to_yaml(self, clean: bool = False) -> str:\n \"\"\"Converts the BaseModel instance to a YAML string\n\n BaseModel offloads the serialization and deserialization of the YAML string to Context class.\n\n Parameters\n ----------\n clean: bool\n Toggles whether to remove `!!python/object:...` from yaml or not.\n Default: False\n\n Returns\n -------\n str\n containing all parameters of the BaseModel instance\n \"\"\"\n _context = self.to_context()\n return _context.to_yaml(clean=clean)\n
"},{"location":"api_reference/models/index.html#koheesio.models.BaseModel.validate","title":"validate","text":"validate() -> BaseModel\n
Validate the BaseModel instance
This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to validate the instance after all the attributes have been set.
This method is intended to be used with the lazy
method. The lazy
method is used to create an instance of the BaseModel without immediate validation. The validate
method is then used to validate the instance after.
Note: in the Pydantic BaseModel, the validate
method throws a deprecated warning. This is because Pydantic recommends using the validate_model
method instead. However, we are using the validate
method here in a different context and a slightly different way.
Examples:
class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\nfoo_model = FooModel.lazy()\nfoo_model.foo = \"bar\"\nfoo_model.lorem = \"ipsum\"\nfoo_model.validate()\n
In this example, the foo_model
instance is created without immediate validation. The attributes foo and lorem are set afterward. The validate
method is then called to validate the instance. Returns:
Type Description BaseModel
The BaseModel instance
Source code in src/koheesio/models/__init__.py
def validate(self) -> BaseModel:\n \"\"\"Validate the BaseModel instance\n\n This method is used to validate the BaseModel instance. It is used in conjunction with the lazy method to\n validate the instance after all the attributes have been set.\n\n This method is intended to be used with the `lazy` method. The `lazy` method is used to create an instance of\n the BaseModel without immediate validation. The `validate` method is then used to validate the instance after.\n\n > Note: in the Pydantic BaseModel, the `validate` method throws a deprecated warning. This is because Pydantic\n recommends using the `validate_model` method instead. However, we are using the `validate` method here in a\n different context and a slightly different way.\n\n Examples\n --------\n ```python\n class FooModel(BaseModel):\n foo: str\n lorem: str\n\n\n foo_model = FooModel.lazy()\n foo_model.foo = \"bar\"\n foo_model.lorem = \"ipsum\"\n foo_model.validate()\n ```\n In this example, the `foo_model` instance is created without immediate validation. The attributes foo and lorem\n are set afterward. The `validate` method is then called to validate the instance.\n\n Returns\n -------\n BaseModel\n The BaseModel instance\n \"\"\"\n return self.model_validate(self.model_dump())\n
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin","title":"koheesio.models.ExtraParamsMixin","text":"Mixin class that adds support for arbitrary keyword arguments to Pydantic models.
The keyword arguments are extracted from the model's values
and moved to a params
dictionary.
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.extra_params","title":"extra_params cached
property
","text":"extra_params: Dict[str, Any]\n
Extract params (passed as arbitrary kwargs) from values and move them to params dict
"},{"location":"api_reference/models/index.html#koheesio.models.ExtraParamsMixin.params","title":"params class-attribute
instance-attribute
","text":"params: Dict[str, Any] = Field(default_factory=dict)\n
"},{"location":"api_reference/models/sql.html","title":"Sql","text":"This module contains the base class for SQL steps.
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep","title":"koheesio.models.sql.SqlBaseStep","text":"Base class for SQL steps
params
are used as placeholders for templating. These are identified with ${placeholder} in the SQL script.
Parameters:
Name Type Description Default sql_path
Path to a SQL file
required sql
SQL script to apply
required params
Placeholders (parameters) for templating. These are identified with ${placeholder}
in the SQL script.
Note: any arbitrary kwargs passed to the class will be added to params.
required"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.params","title":"params class-attribute
instance-attribute
","text":"params: Dict[str, Any] = Field(default_factory=dict, description='Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script. Note: any arbitrary kwargs passed to the class will be added to params.')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.query","title":"query property
","text":"query\n
Returns the query while performing params replacement
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql","title":"sql class-attribute
instance-attribute
","text":"sql: Optional[str] = Field(default=None, description='SQL script to apply')\n
"},{"location":"api_reference/models/sql.html#koheesio.models.sql.SqlBaseStep.sql_path","title":"sql_path class-attribute
instance-attribute
","text":"sql_path: Optional[Union[Path, str]] = Field(default=None, description='Path to a SQL file')\n
"},{"location":"api_reference/notifications/index.html","title":"Notifications","text":"Notification module for sending messages to notification services (e.g. Slack, Email, etc.)
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity","title":"koheesio.notifications.NotificationSeverity","text":"Enumeration of allowed message severities
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.ERROR","title":"ERROR class-attribute
instance-attribute
","text":"ERROR = 'error'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.INFO","title":"INFO class-attribute
instance-attribute
","text":"INFO = 'info'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.SUCCESS","title":"SUCCESS class-attribute
instance-attribute
","text":"SUCCESS = 'success'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.WARN","title":"WARN class-attribute
instance-attribute
","text":"WARN = 'warn'\n
"},{"location":"api_reference/notifications/index.html#koheesio.notifications.NotificationSeverity.alert_icon","title":"alert_icon property
","text":"alert_icon: str\n
Return a colored circle in slack markup
"},{"location":"api_reference/notifications/slack.html","title":"Slack","text":"Classes to ease interaction with Slack
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification","title":"koheesio.notifications.slack.SlackNotification","text":"Generic Slack notification class via the Blocks
API
NOTE: channel
parameter is used only with Slack Web API: https://api.slack.com/messaging/sending If webhook is used, the channel specification is not required
Example:
s = SlackNotification(\n url=\"slack-webhook-url\",\n channel=\"channel\",\n message=\"Some *markdown* compatible text\",\n)\ns.execute()\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.channel","title":"channel class-attribute
instance-attribute
","text":"channel: Optional[str] = Field(default=None, description='Slack channel id')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.headers","title":"headers class-attribute
instance-attribute
","text":"headers: Optional[Dict[str, Any]] = {'Content-type': 'application/json'}\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.message","title":"message class-attribute
instance-attribute
","text":"message: str = Field(default=..., description='The message that gets posted to Slack')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.execute","title":"execute","text":"execute()\n
Generate payload and send post request
Source code in src/koheesio/notifications/slack.py
def execute(self):\n \"\"\"\n Generate payload and send post request\n \"\"\"\n self.data = self.get_payload()\n HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotification.get_payload","title":"get_payload","text":"get_payload()\n
Generate payload with Block Kit
. More details: https://api.slack.com/block-kit
Source code in src/koheesio/notifications/slack.py
def get_payload(self):\n \"\"\"\n Generate payload with `Block Kit`.\n More details: https://api.slack.com/block-kit\n \"\"\"\n payload = {\n \"attachments\": [\n {\n \"blocks\": [\n {\n \"type\": \"section\",\n \"text\": {\n \"type\": \"mrkdwn\",\n \"text\": self.message,\n },\n }\n ],\n }\n ]\n }\n\n if self.channel:\n payload[\"channel\"] = self.channel\n\n return json.dumps(payload)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity","title":"koheesio.notifications.slack.SlackNotificationWithSeverity","text":"Slack notification class via the Blocks
API with etra severity information and predefined extra fields
Example: from koheesio.steps.integrations.notifications import NotificationSeverity
s = SlackNotificationWithSeverity(\n url=\"slack-webhook-url\",\n channel=\"channel\",\n message=\"Some *markdown* compatible text\"\n severity=NotificationSeverity.ERROR,\n title=\"Title\",\n environment=\"dev\",\n application=\"Application\"\n)\ns.execute()\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.application","title":"application class-attribute
instance-attribute
","text":"application: str = Field(default=..., description='Pipeline or application name')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.environment","title":"environment class-attribute
instance-attribute
","text":"environment: str = Field(default=..., description='Environment description, e.g. dev / qa /prod')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(use_enum_values=False)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.severity","title":"severity class-attribute
instance-attribute
","text":"severity: NotificationSeverity = Field(default=..., description='Severity of the message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.timestamp","title":"timestamp class-attribute
instance-attribute
","text":"timestamp: datetime = Field(default=utcnow(), alias='execution_timestamp', description='Pipeline or application execution timestamp')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.title","title":"title class-attribute
instance-attribute
","text":"title: str = Field(default=..., description='Title of your message')\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.execute","title":"execute","text":"execute()\n
Generate payload and send post request
Source code in src/koheesio/notifications/slack.py
def execute(self):\n \"\"\"\n Generate payload and send post request\n \"\"\"\n self.message = self.get_payload_message()\n self.data = self.get_payload()\n HttpPostStep.execute(self)\n
"},{"location":"api_reference/notifications/slack.html#koheesio.notifications.slack.SlackNotificationWithSeverity.get_payload_message","title":"get_payload_message","text":"get_payload_message()\n
Generate payload message based on the predefined set of parameters
Source code in src/koheesio/notifications/slack.py
def get_payload_message(self):\n \"\"\"\n Generate payload message based on the predefined set of parameters\n \"\"\"\n return dedent(\n f\"\"\"\n {self.severity.alert_icon} *{self.severity.name}:* {self.title}\n *Environment:* {self.environment}\n *Application:* {self.application}\n *Message:* {self.message}\n *Timestamp:* {self.timestamp}\n \"\"\"\n )\n
"},{"location":"api_reference/secrets/index.html","title":"Secrets","text":"Module for secret integrations.
Contains abstract class for various secret integrations also known as SecretContext.
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret","title":"koheesio.secrets.Secret","text":"Abstract class for various secret integrations. All secrets are wrapped into Context class for easy access. Either existing context can be provided, or new context will be created and returned at runtime.
Secrets are wrapped into the pydantic.SecretStr.
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.context","title":"context class-attribute
instance-attribute
","text":"context: Optional[Context] = Field(Context({}), description='Existing `Context` instance can be used for secrets, otherwise new empty context will be created.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.parent","title":"parent class-attribute
instance-attribute
","text":"parent: Optional[str] = Field(default=..., description='Group secrets from one secure path under this friendly name', pattern='^[a-zA-Z0-9_]+$')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.root","title":"root class-attribute
instance-attribute
","text":"root: Optional[str] = Field(default='secrets', description='All secrets will be grouped under this root.')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output","title":"Output","text":"Output class for Secret.
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.Output.context","title":"context class-attribute
instance-attribute
","text":"context: Context = Field(default=..., description='Koheesio context')\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.encode_secret_values","title":"encode_secret_values classmethod
","text":"encode_secret_values(data: dict)\n
Encode secret values in the dictionary.
Ensures that all values in the dictionary are wrapped in SecretStr.
Source code in src/koheesio/secrets/__init__.py
@classmethod\ndef encode_secret_values(cls, data: dict):\n \"\"\"Encode secret values in the dictionary.\n\n Ensures that all values in the dictionary are wrapped in SecretStr.\n \"\"\"\n encoded_dict = {}\n for key, value in data.items():\n if isinstance(value, dict):\n encoded_dict[key] = cls.encode_secret_values(value)\n else:\n encoded_dict[key] = SecretStr(value)\n return encoded_dict\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.execute","title":"execute","text":"execute()\n
Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.
Source code in src/koheesio/secrets/__init__.py
def execute(self):\n \"\"\"\n Main method to handle secrets protection and context creation with \"root-parent-secrets\" structure.\n \"\"\"\n context = Context(self.encode_secret_values(data={self.root: {self.parent: self._get_secrets()}}))\n self.output.context = self.context.merge(context=context)\n
"},{"location":"api_reference/secrets/index.html#koheesio.secrets.Secret.get","title":"get","text":"get() -> Context\n
Convenience method to return context with secrets.
Source code in src/koheesio/secrets/__init__.py
def get(self) -> Context:\n \"\"\"\n Convenience method to return context with secrets.\n \"\"\"\n self.execute()\n return self.output.context\n
"},{"location":"api_reference/secrets/cerberus.html","title":"Cerberus","text":"Module for retrieving secrets from Cerberus.
Secrets are stored as SecretContext and can be accessed accordingly.
See CerberusSecret for more information.
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret","title":"koheesio.secrets.cerberus.CerberusSecret","text":"Retrieve secrets from Cerberus and wrap them into Context class for easy access. All secrets are stored under the \"secret\" root and \"parent\". \"Parent\" either derived from the secure data path by replacing \"/\" and \"-\", or manually provided by the user. Secrets are wrapped into the pydantic.SecretStr.
Example:
context = {\n \"secrets\": {\n \"parent\": {\n \"webhook\": SecretStr(\"**********\"),\n \"description\": SecretStr(\"**********\"),\n }\n }\n}\n
Values can be decoded like this:
context.secrets.parent.webhook.get_secret_value()\n
or if working with dictionary is preferable: for key, value in context.get_all().items():\n value.get_secret_value()\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.aws_session","title":"aws_session class-attribute
instance-attribute
","text":"aws_session: Optional[Session] = Field(default=None, description='AWS Session to pass to Cerberus client, can be used for local execution.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.path","title":"path class-attribute
instance-attribute
","text":"path: str = Field(default=..., description=\"Secure data path, eg. 'app/my-sdb/my-secrets'\")\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.token","title":"token class-attribute
instance-attribute
","text":"token: Optional[SecretStr] = Field(default=get('CERBERUS_TOKEN', None), description='Cerberus token, can be used for local development without AWS auth mechanism.Note: Token has priority over AWS session.')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., description='Cerberus URL, eg. https://cerberus.domain.com')\n
"},{"location":"api_reference/secrets/cerberus.html#koheesio.secrets.cerberus.CerberusSecret.verbose","title":"verbose class-attribute
instance-attribute
","text":"verbose: bool = Field(default=False, description='Enable verbose for Cerberus client')\n
"},{"location":"api_reference/spark/index.html","title":"Spark","text":"Spark step module
"},{"location":"api_reference/spark/index.html#koheesio.spark.AnalysisException","title":"koheesio.spark.AnalysisException module-attribute
","text":"AnalysisException = AnalysisException\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.DataFrame","title":"koheesio.spark.DataFrame module-attribute
","text":"DataFrame = DataFrame\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkSession","title":"koheesio.spark.SparkSession module-attribute
","text":"SparkSession = SparkSession\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep","title":"koheesio.spark.SparkStep","text":"Base class for a Spark step
Extends the Step class with SparkSession support. The following: - Spark steps are expected to return a Spark DataFrame as output. - spark property is available to access the active SparkSession instance.
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.spark","title":"spark property
","text":"spark: Optional[SparkSession]\n
Get active SparkSession instance
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output","title":"Output","text":"Output class for SparkStep
"},{"location":"api_reference/spark/index.html#koheesio.spark.SparkStep.Output.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/index.html#koheesio.spark.current_timestamp_utc","title":"koheesio.spark.current_timestamp_utc","text":"current_timestamp_utc(spark: SparkSession) -> Column\n
Get the current timestamp in UTC
Source code in src/koheesio/spark/__init__.py
def current_timestamp_utc(spark: SparkSession) -> Column:\n \"\"\"Get the current timestamp in UTC\"\"\"\n return F.to_utc_timestamp(F.current_timestamp(), spark.conf.get(\"spark.sql.session.timeZone\"))\n
"},{"location":"api_reference/spark/delta.html","title":"Delta","text":"Module for creating and managing Delta tables.
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep","title":"koheesio.spark.delta.DeltaTableStep","text":"Class for creating and managing Delta tables.
DeltaTable aims to provide a simple interface to create and manage Delta tables. It is a wrapper around the Spark SQL API for Delta tables.
Example from koheesio.steps import DeltaTableStep\n\nDeltaTableStep(\n table=\"my_table\",\n database=\"my_database\",\n catalog=\"my_catalog\",\n create_if_not_exists=True,\n default_create_properties={\n \"delta.randomizeFilePrefixes\": \"true\",\n \"delta.checkpoint.writeStatsAsStruct\": \"true\",\n \"delta.minReaderVersion\": \"2\",\n \"delta.minWriterVersion\": \"5\",\n },\n)\n
Methods:
Name Description get_persisted_properties
Get persisted properties of table.
add_property
Alter table and set table property.
add_properties
Alter table and add properties.
execute
Nothing to execute on a Table.
max_version_ts_of_last_execution
Max version timestamp of last execution. If no timestamp is found, returns 1900-01-01 00:00:00. Note: will raise an error if column VERSION_TIMESTAMP
does not exist.
Properties - name -> str Deprecated. Use
.table_name
instead. - table_name -> str Table name.
- dataframe -> DataFrame Returns a DataFrame to be able to interact with this table.
- columns -> Optional[List[str]] Returns all column names as a list.
- has_change_type -> bool Checks if a column named
_change_type
is present in the table. - exists -> bool Check if table exists.
Parameters:
Name Type Description Default table
str
Table name.
required database
str
Database or Schema name.
None
catalog
str
Catalog name.
None
create_if_not_exists
bool
Force table creation if it doesn't exist. Note: Default properties will be applied to the table during CREATION.
False
default_create_properties
Dict[str, str]
Default table properties to be applied during CREATION if force_creation
True.
{\"delta.randomizeFilePrefixes\": \"true\", \"delta.checkpoint.writeStatsAsStruct\": \"true\", \"delta.minReaderVersion\": \"2\", \"delta.minWriterVersion\": \"5\"}
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.catalog","title":"catalog class-attribute
instance-attribute
","text":"catalog: Optional[str] = Field(default=None, description='Catalog name. Note: Can be ignored if using a SparkCatalog that does not support catalog notation (e.g. Hive)')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.columns","title":"columns property
","text":"columns: Optional[List[str]]\n
Returns all column names as a list.
Example DeltaTableStep(...).columns\n
Would for example return ['age', 'name']
if the table has columns age
and name
."},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.create_if_not_exists","title":"create_if_not_exists class-attribute
instance-attribute
","text":"create_if_not_exists: bool = Field(default=False, alias='force_creation', description=\"Force table creation if it doesn't exist.Note: Default properties will be applied to the table during CREATION.\")\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.database","title":"database class-attribute
instance-attribute
","text":"database: Optional[str] = Field(default=None, description='Database or Schema name.')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.dataframe","title":"dataframe property
","text":"dataframe: DataFrame\n
Returns a DataFrame to be able to interact with this table
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.default_create_properties","title":"default_create_properties class-attribute
instance-attribute
","text":"default_create_properties: Dict[str, Union[str, bool, int]] = Field(default={'delta.randomizeFilePrefixes': 'true', 'delta.checkpoint.writeStatsAsStruct': 'true', 'delta.minReaderVersion': '2', 'delta.minWriterVersion': '5'}, description='Default table properties to be applied during CREATION if `create_if_not_exists` True')\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.exists","title":"exists property
","text":"exists: bool\n
Check if table exists
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.has_change_type","title":"has_change_type property
","text":"has_change_type: bool\n
Checks if a column named _change_type
is present in the table
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.is_cdf_active","title":"is_cdf_active property
","text":"is_cdf_active: bool\n
Check if CDF property is set and activated
Returns:
Type Description bool
delta.enableChangeDataFeed property is set to 'true'
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table","title":"table instance-attribute
","text":"table: str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.table_name","title":"table_name property
","text":"table_name: str\n
Fully qualified table name in the form of catalog.database.table
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_properties","title":"add_properties","text":"add_properties(properties: Dict[str, Union[str, bool, int]], override: bool = False)\n
Alter table and add properties.
Parameters:
Name Type Description Default properties
Dict[str, Union[str, int, bool]]
Properties to be added to table.
required override
bool
Enable override of existing value for property in table.
False
Source code in src/koheesio/spark/delta.py
def add_properties(self, properties: Dict[str, Union[str, bool, int]], override: bool = False):\n \"\"\"Alter table and add properties.\n\n Parameters\n ----------\n properties : Dict[str, Union[str, int, bool]]\n Properties to be added to table.\n override : bool, optional, default=False\n Enable override of existing value for property in table.\n\n \"\"\"\n for k, v in properties.items():\n v_str = str(v) if not isinstance(v, bool) else str(v).lower()\n self.add_property(key=k, value=v_str, override=override)\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.add_property","title":"add_property","text":"add_property(key: str, value: Union[str, int, bool], override: bool = False)\n
Alter table and set table property.
Parameters:
Name Type Description Default key
str
Property key(name).
required value
Union[str, int, bool]
Property value.
required override
bool
Enable override of existing value for property in table.
False
Source code in src/koheesio/spark/delta.py
def add_property(self, key: str, value: Union[str, int, bool], override: bool = False):\n \"\"\"Alter table and set table property.\n\n Parameters\n ----------\n key: str\n Property key(name).\n value: Union[str, int, bool]\n Property value.\n override: bool\n Enable override of existing value for property in table.\n\n \"\"\"\n persisted_properties = self.get_persisted_properties()\n v_str = str(value) if not isinstance(value, bool) else str(value).lower()\n\n def _alter_table() -> None:\n property_pair = f\"'{key}'='{v_str}'\"\n\n try:\n # noinspection SqlNoDataSourceInspection\n self.spark.sql(f\"ALTER TABLE {self.table_name} SET TBLPROPERTIES ({property_pair})\")\n self.log.debug(f\"Table `{self.table_name}` has been altered. Property `{property_pair}` added.\")\n except Py4JJavaError as e:\n msg = f\"Property `{key}` can not be applied to table `{self.table_name}`. Exception: {e}\"\n self.log.warning(msg)\n warnings.warn(msg)\n\n if self.exists:\n if key in persisted_properties and persisted_properties[key] != v_str:\n if override:\n self.log.debug(\n f\"Property `{key}` presents in `{self.table_name}` and has value `{persisted_properties[key]}`.\"\n f\"Override is enabled.The value will be changed to `{v_str}`.\"\n )\n _alter_table()\n else:\n self.log.debug(\n f\"Skipping adding property `{key}`, because it is already set \"\n f\"for table `{self.table_name}` to `{v_str}`. To override it, provide override=True\"\n )\n else:\n _alter_table()\n else:\n self.default_create_properties[key] = v_str\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.execute","title":"execute","text":"execute()\n
Nothing to execute on a Table
Source code in src/koheesio/spark/delta.py
def execute(self):\n \"\"\"Nothing to execute on a Table\"\"\"\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_column_type","title":"get_column_type","text":"get_column_type(column: str) -> Optional[DataType]\n
Get the type of a column in the table.
Parameters:
Name Type Description Default column
str
Column name.
required Returns:
Type Description Optional[DataType]
Column type.
Source code in src/koheesio/spark/delta.py
def get_column_type(self, column: str) -> Optional[DataType]:\n \"\"\"Get the type of a column in the table.\n\n Parameters\n ----------\n column : str\n Column name.\n\n Returns\n -------\n Optional[DataType]\n Column type.\n \"\"\"\n return self.dataframe.schema[column].dataType if self.columns and column in self.columns else None\n
"},{"location":"api_reference/spark/delta.html#koheesio.spark.delta.DeltaTableStep.get_persisted_properties","title":"get_persisted_properties","text":"get_persisted_properties() -> Dict[str, str]\n
Get persisted properties of table.
Returns:
Type Description Dict[str, str]
Persisted properties as a dictionary.
Source code in src/koheesio/spark/delta.py
def get_persisted_properties(self) -> Dict[str, str]:\n \"\"\"Get persisted properties of table.\n\n Returns\n -------\n Dict[str, str]\n Persisted properties as a dictionary.\n \"\"\"\n persisted_properties = {}\n raw_options = self.spark.sql(f\"SHOW TBLPROPERTIES {self.table_name}\").collect()\n\n for ro in raw_options:\n key, value = ro.asDict().values()\n persisted_properties[key] = value\n\n return persisted_properties\n
"},{"location":"api_reference/spark/etl_task.html","title":"Etl task","text":"ETL Task
Extract -> Transform -> Load
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask","title":"koheesio.spark.etl_task.EtlTask","text":"ETL Task
Etl stands for: Extract -> Transform -> Load
This task is a composition of a Reader (extract), a series of Transformations (transform) and a Writer (load). In other words, it reads data from a source, applies a series of transformations, and writes the result to a target.
Parameters:
Name Type Description Default name
str
Name of the task
required description
str
Description of the task
required source
Reader
Source to read from [extract]
required transformations
list[Transformation]
Series of transformations [transform]. The order of the transformations is important!
required target
Writer
Target to write to [load]
required Example from koheesio.tasks import EtlTask\n\nfrom koheesio.steps.readers import CsvReader\nfrom koheesio.steps.transformations.repartition import Repartition\nfrom koheesio.steps.writers import CsvWriter\n\netl_task = EtlTask(\n name=\"My ETL Task\",\n description=\"This is an example ETL task\",\n source=CsvReader(path=\"path/to/source.csv\"),\n transformations=[Repartition(num_partitions=2)],\n target=DummyWriter(),\n)\n\netl_task.execute()\n
This code will read from a CSV file, repartition the DataFrame to 2 partitions, and write the result to the console.
Extending the EtlTask The EtlTask is designed to be a simple and flexible way to define ETL processes. It is not designed to be a one-size-fits-all solution, but rather a starting point for building more complex ETL processes. If you need more complex functionality, you can extend the EtlTask class and override the extract
, transform
and load
methods. You can also implement your own execute
method to define the entire ETL process from scratch should you need more flexibility.
Advantages of using the EtlTask - It is a simple way to define ETL processes
- It is easy to understand and extend
- It is easy to test and debug
- It is easy to maintain and refactor
- It is easy to integrate with other tools and libraries
- It is easy to use in a production environment
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.etl_date","title":"etl_date class-attribute
instance-attribute
","text":"etl_date: datetime = Field(default=utcnow(), description=\"Date time when this object was created as iso format. Example: '2023-01-24T09:39:23.632374'\")\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.source","title":"source class-attribute
instance-attribute
","text":"source: InstanceOf[Reader] = Field(default=..., description='Source to read from [extract]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.target","title":"target class-attribute
instance-attribute
","text":"target: InstanceOf[Writer] = Field(default=..., description='Target to write to [load]')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transformations","title":"transformations class-attribute
instance-attribute
","text":"transformations: conlist(min_length=0, item_type=InstanceOf[Transformation]) = Field(default_factory=list, description='Series of transformations', alias='transforms')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output","title":"Output","text":"Output class for EtlTask
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.source_df","title":"source_df class-attribute
instance-attribute
","text":"source_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .extract() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.target_df","title":"target_df class-attribute
instance-attribute
","text":"target_df: DataFrame = Field(default=..., description='The Spark DataFrame used by .load() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.Output.transform_df","title":"transform_df class-attribute
instance-attribute
","text":"transform_df: DataFrame = Field(default=..., description='The Spark DataFrame produced by .transform() method')\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.execute","title":"execute","text":"execute()\n
Run the ETL process
Source code in src/koheesio/spark/etl_task.py
def execute(self):\n \"\"\"Run the ETL process\"\"\"\n self.log.info(f\"Task started at {self.etl_date}\")\n\n # extract from source\n self.output.source_df = self.extract()\n\n # transform\n self.output.transform_df = self.transform(self.output.source_df)\n\n # load to target\n self.output.target_df = self.load(self.output.transform_df)\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.extract","title":"extract","text":"extract() -> DataFrame\n
Read from Source
logging is handled by the Reader.execute()-method's @do_execute decorator
Source code in src/koheesio/spark/etl_task.py
def extract(self) -> DataFrame:\n \"\"\"Read from Source\n\n logging is handled by the Reader.execute()-method's @do_execute decorator\n \"\"\"\n reader: Reader = self.source\n return reader.read()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.load","title":"load","text":"load(df: DataFrame) -> DataFrame\n
Write to Target
logging is handled by the Writer.execute()-method's @do_execute decorator
Source code in src/koheesio/spark/etl_task.py
def load(self, df: DataFrame) -> DataFrame:\n \"\"\"Write to Target\n\n logging is handled by the Writer.execute()-method's @do_execute decorator\n \"\"\"\n writer: Writer = self.target\n writer.write(df)\n return df\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.run","title":"run","text":"run()\n
alias of execute
Source code in src/koheesio/spark/etl_task.py
def run(self):\n \"\"\"alias of execute\"\"\"\n return self.execute()\n
"},{"location":"api_reference/spark/etl_task.html#koheesio.spark.etl_task.EtlTask.transform","title":"transform","text":"transform(df: DataFrame) -> DataFrame\n
Transform recursively
logging is handled by the Transformation.execute()-method's @do_execute decorator
Source code in src/koheesio/spark/etl_task.py
def transform(self, df: DataFrame) -> DataFrame:\n \"\"\"Transform recursively\n\n logging is handled by the Transformation.execute()-method's @do_execute decorator\n \"\"\"\n for t in self.transformations:\n df = t.transform(df)\n return df\n
"},{"location":"api_reference/spark/snowflake.html","title":"Snowflake","text":"Snowflake steps and tasks for Koheesio
Every class in this module is a subclass of Step
or Task
and is used to perform operations on Snowflake.
Notes Every Step in this module is based on SnowflakeBaseModel. The following parameters are available for every Step.
Parameters:
Name Type Description Default url
str
Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL
. required user
str
Login name for the Snowflake user. Alias for sfUser
.
required password
SecretStr
Password for the Snowflake user. Alias for sfPassword
.
required database
str
The database to use for the session after connecting. Alias for sfDatabase
.
required sfSchema
str
The schema to use for the session after connecting. Alias for schema
(\"schema\" is a reserved name in Pydantic, so we use sfSchema
as main name instead).
required role
str
The default security role to use for the session after connecting. Alias for sfRole
.
required warehouse
str
The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse
.
required authenticator
Optional[str]
Authenticator for the Snowflake user. Example: \"okta.com\".
None
options
Optional[Dict[str, Any]]
Extra options to pass to the Snowflake connector.
{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}
format
str
The default snowflake
format can be used natively in Databricks, use net.snowflake.spark.snowflake
in other environments and make sure to install required JARs.
\"snowflake\"
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn","title":"koheesio.spark.snowflake.AddColumn","text":"Add an empty column to a Snowflake table with given name and DataType
Example AddColumn(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"MY_TABLE\",\n col=\"MY_COL\",\n dataType=StringType(),\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.column","title":"column class-attribute
instance-attribute
","text":"column: str = Field(default=..., description='The name of the new column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The name of the Snowflake table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.type","title":"type class-attribute
instance-attribute
","text":"type: DataType = Field(default=..., description='The DataType represented as a Spark DataType')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output","title":"Output","text":"Output class for AddColumn
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.Output.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='Query that was executed to add the column')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.AddColumn.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n query = f\"ALTER TABLE {self.table} ADD COLUMN {self.column} {map_spark_type(self.type)}\".upper()\n self.output.query = query\n RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","title":"koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame","text":"Create (or Replace) a Snowflake table which has the same schema as a Spark DataFrame
Can be used as any Transformation. The DataFrame is however left unchanged, and only used for determining the schema of the Snowflake Table that is to be created (or replaced).
Example CreateOrReplaceTableFromDataFrame(\n database=\"MY_DB\",\n schema=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=\"super-secret-password\",\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"MY_TABLE\",\n df=df,\n).execute()\n
Or, as a Transformation:
CreateOrReplaceTableFromDataFrame(\n ...\n table=\"MY_TABLE\",\n).transform(df)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., alias='table_name', description='The name of the (new) table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output","title":"Output","text":"Output class for CreateOrReplaceTableFromDataFrame
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.input_schema","title":"input_schema class-attribute
instance-attribute
","text":"input_schema: StructType = Field(default=..., description='The original schema from the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='Query that was executed to create the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.Output.snowflake_schema","title":"snowflake_schema class-attribute
instance-attribute
","text":"snowflake_schema: str = Field(default=..., description='Derived Snowflake table schema based on the input DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.CreateOrReplaceTableFromDataFrame.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n self.output.df = self.df\n\n input_schema = self.df.schema\n self.output.input_schema = input_schema\n\n snowflake_schema = \", \".join([f\"{c.name} {map_spark_type(c.dataType)}\" for c in input_schema])\n self.output.snowflake_schema = snowflake_schema\n\n table_name = f\"{self.database}.{self.sfSchema}.{self.table}\"\n query = f\"CREATE OR REPLACE TABLE {table_name} ({snowflake_schema})\"\n self.output.query = query\n\n RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery","title":"koheesio.spark.snowflake.DbTableQuery","text":"Read table from Snowflake using the dbtable
option instead of query
Example DbTableQuery(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"user\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"db.schema.table\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.DbTableQuery.dbtable","title":"dbtable class-attribute
instance-attribute
","text":"dbtable: str = Field(default=..., alias='table', description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema","title":"koheesio.spark.snowflake.GetTableSchema","text":"Get the schema from a Snowflake table as a Spark Schema
Notes - This Step will execute a
SELECT * FROM <table> LIMIT 1
query to get the schema of the table. - The schema will be stored in the
table_schema
attribute of the output. table_schema
is used as the attribute name to avoid conflicts with the schema
attribute of Pydantic's BaseModel.
Example schema = (\n GetTableSchema(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=\"super-secret-password\",\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n table=\"MY_TABLE\",\n )\n .execute()\n .table_schema\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The Snowflake table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output","title":"Output","text":"Output class for GetTableSchema
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.Output.table_schema","title":"table_schema class-attribute
instance-attribute
","text":"table_schema: StructType = Field(default=..., serialization_alias='schema', description='The Spark Schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GetTableSchema.execute","title":"execute","text":"execute() -> Output\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> Output:\n query = f\"SELECT * FROM {self.table} LIMIT 1\" # nosec B608: hardcoded_sql_expressions\n df = Query(**self.get_options(), query=query).execute().df\n self.output.table_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject","text":"Grant Snowflake privileges to a set of roles on a fully qualified object, i.e. database.schema.object_name
This class is a subclass of GrantPrivilegesOnObject
and is used to grant privileges on a fully qualified object. The advantage of using this class is that it sets the object name to be fully qualified, i.e. database.schema.object_name
.
Meaning, you can set the database
, schema
and object
separately and the object name will be set to be fully qualified, i.e. database.schema.object_name
.
Example GrantPrivilegesOnFullyQualifiedObject(\n database=\"MY_DB\",\n schema=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n ...\n object=\"MY_TABLE\",\n type=\"TABLE\",\n ...\n)\n
In this example, the object name will be set to be fully qualified, i.e. MY_DB.MY_SCHEMA.MY_TABLE
. If you were to use GrantPrivilegesOnObject
instead, you would have to set the object name to be fully qualified yourself.
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnFullyQualifiedObject.set_object_name","title":"set_object_name","text":"set_object_name()\n
Set the object name to be fully qualified, i.e. database.schema.object_name
Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef set_object_name(self):\n \"\"\"Set the object name to be fully qualified, i.e. database.schema.object_name\"\"\"\n # database, schema, obj_name\n db = self.database\n schema = self.model_dump()[\"sfSchema\"] # since \"schema\" is a reserved name\n obj_name = self.object\n\n self.object = f\"{db}.{schema}.{obj_name}\"\n\n return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject","title":"koheesio.spark.snowflake.GrantPrivilegesOnObject","text":"A wrapper on Snowflake GRANT privileges
With this Step, you can grant Snowflake privileges to a set of roles on a table, a view, or an object
See Also https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html
Parameters:
Name Type Description Default warehouse
str
The name of the warehouse. Alias for sfWarehouse
required user
str
The username. Alias for sfUser
required password
SecretStr
The password. Alias for sfPassword
required role
str
The role name
required object
str
The name of the object to grant privileges on
required type
str
The type of object to grant privileges on, e.g. TABLE, VIEW
required privileges
Union[conlist(str, min_length=1), str]
The Privilege/Permission or list of Privileges/Permissions to grant on the given object.
required roles
Union[conlist(str, min_length=1), str]
The Role or list of Roles to grant the privileges to
required Example GrantPermissionsOnTable(\n object=\"MY_TABLE\",\n type=\"TABLE\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n permissions=[\"SELECT\", \"INSERT\"],\n).execute()\n
In this example, the APPLICATION.SNOWFLAKE.ADMIN
role will be granted SELECT
and INSERT
privileges on the MY_TABLE
table using the MY_WH
warehouse.
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.object","title":"object class-attribute
instance-attribute
","text":"object: str = Field(default=..., description='The name of the object to grant privileges on')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.privileges","title":"privileges class-attribute
instance-attribute
","text":"privileges: Union[conlist(str, min_length=1), str] = Field(default=..., alias='permissions', description='The Privilege/Permission or list of Privileges/Permissions to grant on the given object. See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.roles","title":"roles class-attribute
instance-attribute
","text":"roles: Union[conlist(str, min_length=1), str] = Field(default=..., alias='role', validation_alias='roles', description='The Role or list of Roles to grant the privileges to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.type","title":"type class-attribute
instance-attribute
","text":"type: str = Field(default=..., description='The type of object to grant privileges on, e.g. TABLE, VIEW')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output","title":"Output","text":"Output class for GrantPrivilegesOnObject
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.Output.query","title":"query class-attribute
instance-attribute
","text":"query: conlist(str, min_length=1) = Field(default=..., description='Query that was executed to grant privileges', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n self.output.query = []\n roles = self.roles\n\n for role in roles:\n query = self.get_query(role)\n self.output.query.append(query)\n RunQuery(**self.get_options(), query=query).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.get_query","title":"get_query","text":"get_query(role: str)\n
Build the GRANT query
Parameters:
Name Type Description Default role
str
The role name
required Returns:
Name Type Description query
str
The Query that performs the grant
Source code in src/koheesio/spark/snowflake.py
def get_query(self, role: str):\n \"\"\"Build the GRANT query\n\n Parameters\n ----------\n role: str\n The role name\n\n Returns\n -------\n query : str\n The Query that performs the grant\n \"\"\"\n query = f\"GRANT {','.join(self.privileges)} ON {self.type} {self.object} TO ROLE {role}\".upper()\n return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.set_roles_privileges","title":"set_roles_privileges","text":"set_roles_privileges(values)\n
Coerce roles and privileges to be lists if they are not already.
Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"before\")\ndef set_roles_privileges(cls, values):\n \"\"\"Coerce roles and privileges to be lists if they are not already.\"\"\"\n roles_value = values.get(\"roles\") or values.get(\"role\")\n privileges_value = values.get(\"privileges\")\n\n if not (roles_value and privileges_value):\n raise ValueError(\"You have to specify roles AND privileges when using 'GrantPrivilegesOnObject'.\")\n\n # coerce values to be lists\n values[\"roles\"] = [roles_value] if isinstance(roles_value, str) else roles_value\n values[\"role\"] = values[\"roles\"][0] # hack to keep the validator happy\n values[\"privileges\"] = [privileges_value] if isinstance(privileges_value, str) else privileges_value\n\n return values\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnObject.validate_object_and_object_type","title":"validate_object_and_object_type","text":"validate_object_and_object_type()\n
Validate that the object and type are set.
Source code in src/koheesio/spark/snowflake.py
@model_validator(mode=\"after\")\ndef validate_object_and_object_type(self):\n \"\"\"Validate that the object and type are set.\"\"\"\n object_value = self.object\n if not object_value:\n raise ValueError(\"You must provide an `object`, this should be the name of the object. \")\n\n object_type = self.type\n if not object_type:\n raise ValueError(\n \"You must provide a `type`, e.g. TABLE, VIEW, DATABASE. \"\n \"See https://docs.snowflake.com/en/sql-reference/sql/grant-privilege.html\"\n )\n\n return self\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable","title":"koheesio.spark.snowflake.GrantPrivilegesOnTable","text":"Grant Snowflake privileges to a set of roles on a table
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.object","title":"object class-attribute
instance-attribute
","text":"object: str = Field(default=..., alias='table', description='The name of the Table to grant Privileges on. This should be just the name of the table; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnTable.type","title":"type class-attribute
instance-attribute
","text":"type: str = 'TABLE'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView","title":"koheesio.spark.snowflake.GrantPrivilegesOnView","text":"Grant Snowflake privileges to a set of roles on a view
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.object","title":"object class-attribute
instance-attribute
","text":"object: str = Field(default=..., alias='view', description='The name of the View to grant Privileges on. This should be just the name of the view; so without Database and Schema, use sfDatabase/database and sfSchema/schema to set those instead.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.GrantPrivilegesOnView.type","title":"type class-attribute
instance-attribute
","text":"type: str = 'VIEW'\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query","title":"koheesio.spark.snowflake.Query","text":"Query data from Snowflake and return the result as a DataFrame
Example Query(\n database=\"MY_DB\",\n schema_=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"gid.account@nike.com\",\n password=Secret(\"super-secret-password\"),\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n query=\"SELECT * FROM MY_TABLE\",\n).execute().df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='The query to run')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.get_options","title":"get_options","text":"get_options()\n
add query to options
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n \"\"\"add query to options\"\"\"\n options = super().get_options()\n options[\"query\"] = self.query\n return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.Query.validate_query","title":"validate_query","text":"validate_query(query)\n
Replace escape characters
Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n \"\"\"Replace escape characters\"\"\"\n query = query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n return query\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery","title":"koheesio.spark.snowflake.RunQuery","text":"Run a query on Snowflake that does not return a result, e.g. create table statement
This is a wrapper around 'net.snowflake.spark.snowflake.Utils.runQuery' on the JVM
Example RunQuery(\n database=\"MY_DB\",\n schema=\"MY_SCHEMA\",\n warehouse=\"MY_WH\",\n user=\"account\",\n password=\"***\",\n role=\"APPLICATION.SNOWFLAKE.ADMIN\",\n query=\"CREATE TABLE test (col1 string)\",\n).execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.query","title":"query class-attribute
instance-attribute
","text":"query: str = Field(default=..., description='The query to run', alias='sql')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.execute","title":"execute","text":"execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n if not self.query:\n self.log.warning(\"Empty string given as query input, skipping execution\")\n return\n # noinspection PyProtectedMember\n self.spark._jvm.net.snowflake.spark.snowflake.Utils.runQuery(self.get_options(), self.query)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.get_options","title":"get_options","text":"get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n # Executing the RunQuery without `host` option in Databricks throws:\n # An error occurred while calling z:net.snowflake.spark.snowflake.Utils.runQuery.\n # : java.util.NoSuchElementException: key not found: host\n options = super().get_options()\n options[\"host\"] = options[\"sfURL\"]\n return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.RunQuery.validate_query","title":"validate_query","text":"validate_query(query)\n
Replace escape characters
Source code in src/koheesio/spark/snowflake.py
@field_validator(\"query\")\ndef validate_query(cls, query):\n \"\"\"Replace escape characters\"\"\"\n return query.replace(\"\\\\n\", \"\\n\").replace(\"\\\\t\", \"\\t\").strip()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel","title":"koheesio.spark.snowflake.SnowflakeBaseModel","text":"BaseModel for setting up Snowflake Driver options.
Notes - Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
- Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
- Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector
Parameters:
Name Type Description Default url
str
Hostname for the Snowflake account, e.g. .snowflakecomputing.com. Alias for sfURL
. required user
str
Login name for the Snowflake user. Alias for sfUser
.
required password
SecretStr
Password for the Snowflake user. Alias for sfPassword
.
required database
str
The database to use for the session after connecting. Alias for sfDatabase
.
required sfSchema
str
The schema to use for the session after connecting. Alias for schema
(\"schema\" is a reserved name in Pydantic, so we use sfSchema
as main name instead).
required role
str
The default security role to use for the session after connecting. Alias for sfRole
.
required warehouse
str
The default virtual warehouse to use for the session after connecting. Alias for sfWarehouse
.
required authenticator
Optional[str]
Authenticator for the Snowflake user. Example: \"okta.com\".
None
options
Optional[Dict[str, Any]]
Extra options to pass to the Snowflake connector.
{\"sfCompress\": \"on\", \"continue_on_error\": \"off\"}
format
str
The default snowflake
format can be used natively in Databricks, use net.snowflake.spark.snowflake
in other environments and make sure to install required JARs.
\"snowflake\"
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.authenticator","title":"authenticator class-attribute
instance-attribute
","text":"authenticator: Optional[str] = Field(default=None, description='Authenticator for the Snowflake user', examples=['okta.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.database","title":"database class-attribute
instance-attribute
","text":"database: str = Field(default=..., alias='sfDatabase', description='The database to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='snowflake', description='The default `snowflake` format can be used natively in Databricks, use `net.snowflake.spark.snowflake` in other environments and make sure to install required JARs.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field(default={'sfCompress': 'on', 'continue_on_error': 'off'}, description='Extra options to pass to the Snowflake connector')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.password","title":"password class-attribute
instance-attribute
","text":"password: SecretStr = Field(default=..., alias='sfPassword', description='Password for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.role","title":"role class-attribute
instance-attribute
","text":"role: str = Field(default=..., alias='sfRole', description='The default security role to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.sfSchema","title":"sfSchema class-attribute
instance-attribute
","text":"sfSchema: str = Field(default=..., alias='schema', description='The schema to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., alias='sfURL', description='Hostname for the Snowflake account, e.g. <account>.snowflakecomputing.com', examples=['example.snowflakecomputing.com'])\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.user","title":"user class-attribute
instance-attribute
","text":"user: str = Field(default=..., alias='sfUser', description='Login name for the Snowflake user')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.warehouse","title":"warehouse class-attribute
instance-attribute
","text":"warehouse: str = Field(default=..., alias='sfWarehouse', description='The default virtual warehouse to use for the session after connecting')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeBaseModel.get_options","title":"get_options","text":"get_options()\n
Get the sfOptions as a dictionary.
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n \"\"\"Get the sfOptions as a dictionary.\"\"\"\n return {\n key: value\n for key, value in {\n \"sfURL\": self.url,\n \"sfUser\": self.user,\n \"sfPassword\": self.password.get_secret_value(),\n \"authenticator\": self.authenticator,\n \"sfDatabase\": self.database,\n \"sfSchema\": self.sfSchema,\n \"sfRole\": self.role,\n \"sfWarehouse\": self.warehouse,\n **self.options,\n }.items()\n if value is not None\n }\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader","title":"koheesio.spark.snowflake.SnowflakeReader","text":"Wrapper around JdbcReader for Snowflake.
Example sr = SnowflakeReader(\n url=\"foo.snowflakecomputing.com\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n database=\"db\",\n schema=\"schema\",\n)\ndf = sr.read()\n
Notes - Snowflake is supported natively in Databricks 4.2 and newer: https://docs.snowflake.com/en/user-guide/spark-connector-databricks
- Refer to Snowflake docs for the installation instructions for non-Databricks environments: https://docs.snowflake.com/en/user-guide/spark-connector-install
- Refer to Snowflake docs for connection options: https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: Optional[str] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeStep","title":"koheesio.spark.snowflake.SnowflakeStep","text":"Expands the SnowflakeBaseModel so that it can be used as a Step
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep","title":"koheesio.spark.snowflake.SnowflakeTableStep","text":"Expands the SnowflakeStep, adding a 'table' parameter
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The name of the table')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTableStep.get_options","title":"get_options","text":"get_options()\n
Source code in src/koheesio/spark/snowflake.py
def get_options(self):\n options = super().get_options()\n options[\"table\"] = self.table\n return options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeTransformation","title":"koheesio.spark.snowflake.SnowflakeTransformation","text":"Adds Snowflake parameters to the Transformation class
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter","title":"koheesio.spark.snowflake.SnowflakeWriter","text":"Class for writing to Snowflake
See Also - koheesio.steps.writers.Writer
- koheesio.steps.writers.BatchOutputMode
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.insert_type","title":"insert_type class-attribute
instance-attribute
","text":"insert_type: Optional[BatchOutputMode] = Field(APPEND, alias='mode', description='The insertion type, append or overwrite')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='Target table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SnowflakeWriter.execute","title":"execute","text":"execute()\n
Write to Snowflake
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n \"\"\"Write to Snowflake\"\"\"\n self.log.debug(f\"writing to {self.table} with mode {self.insert_type}\")\n self.df.write.format(self.format).options(**self.get_options()).option(\"dbtable\", self.table).mode(\n self.insert_type\n ).save()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema","title":"koheesio.spark.snowflake.SyncTableAndDataFrameSchema","text":"Sync the schema's of a Snowflake table and a DataFrame. This will add NULL columns for the columns that are not in both and perform type casts where needed.
The Snowflake table will take priority in case of type conflicts.
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.df","title":"df class-attribute
instance-attribute
","text":"df: DataFrame = Field(default=..., description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.dry_run","title":"dry_run class-attribute
instance-attribute
","text":"dry_run: Optional[bool] = Field(default=False, description='Only show schema differences, do not apply changes')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='The table name')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output","title":"Output","text":"Output class for SyncTableAndDataFrameSchema
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_df_schema","title":"new_df_schema class-attribute
instance-attribute
","text":"new_df_schema: StructType = Field(default=..., description='New DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.new_sf_schema","title":"new_sf_schema class-attribute
instance-attribute
","text":"new_sf_schema: StructType = Field(default=..., description='New Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_df_schema","title":"original_df_schema class-attribute
instance-attribute
","text":"original_df_schema: StructType = Field(default=..., description='Original DataFrame schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.original_sf_schema","title":"original_sf_schema class-attribute
instance-attribute
","text":"original_sf_schema: StructType = Field(default=..., description='Original Snowflake schema')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.Output.sf_table_altered","title":"sf_table_altered class-attribute
instance-attribute
","text":"sf_table_altered: bool = Field(default=False, description='Flag to indicate whether Snowflake schema has been altered')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SyncTableAndDataFrameSchema.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n self.log.warning(\"Snowflake table will always take a priority in case of data type conflicts!\")\n\n # spark side\n df_schema = self.df.schema\n self.output.original_df_schema = deepcopy(df_schema) # using deepcopy to avoid storing in place changes\n df_cols = [c.name.lower() for c in df_schema]\n\n # snowflake side\n sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n self.output.original_sf_schema = sf_schema\n sf_cols = [c.name.lower() for c in sf_schema]\n\n if self.dry_run:\n # Display differences between Spark DataFrame and Snowflake schemas\n # and provide dummy values that are expected as class outputs.\n self.log.warning(f\"Columns to be added to Snowflake table: {set(df_cols) - set(sf_cols)}\")\n self.log.warning(f\"Columns to be added to Spark DataFrame: {set(sf_cols) - set(df_cols)}\")\n\n self.output.new_df_schema = t.StructType()\n self.output.new_sf_schema = t.StructType()\n self.output.df = self.df\n self.output.sf_table_altered = False\n\n else:\n # Add columns to SnowFlake table that exist in DataFrame\n for df_column in df_schema:\n if df_column.name.lower() not in sf_cols:\n AddColumn(\n **self.get_options(),\n table=self.table,\n column=df_column.name,\n type=df_column.dataType,\n ).execute()\n self.output.sf_table_altered = True\n\n if self.output.sf_table_altered:\n sf_schema = GetTableSchema(**self.get_options(), table=self.table).execute().table_schema\n sf_cols = [c.name.lower() for c in sf_schema]\n\n self.output.new_sf_schema = sf_schema\n\n # Add NULL columns to the DataFrame if they exist in SnowFlake but not in the df\n df = self.df\n for sf_col in self.output.original_sf_schema:\n sf_col_name = sf_col.name.lower()\n if sf_col_name not in df_cols:\n sf_col_type = sf_col.dataType\n df = df.withColumn(sf_col_name, f.lit(None).cast(sf_col_type))\n\n # Put DataFrame columns in the same order as the Snowflake table\n df = df.select(*sf_cols)\n\n self.output.df = df\n self.output.new_df_schema = df.schema\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","title":"koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask","text":"Synchronize a Delta table to a Snowflake table
- Overwrite - only in batch mode
- Append - supports batch and streaming mode
- Merge - only in streaming mode
Example SynchronizeDeltaToSnowflakeTask(\n url=\"acme.snowflakecomputing.com\",\n user=\"admin\",\n role=\"ADMIN\",\n warehouse=\"SF_WAREHOUSE\",\n database=\"SF_DATABASE\",\n schema=\"SF_SCHEMA\",\n source_table=DeltaTableStep(...),\n target_table=\"my_sf_table\",\n key_columns=[\n \"id\",\n ],\n streaming=False,\n).run()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.checkpoint_location","title":"checkpoint_location class-attribute
instance-attribute
","text":"checkpoint_location: Optional[str] = Field(default=None, description='Checkpoint location to use')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.enable_deletion","title":"enable_deletion class-attribute
instance-attribute
","text":"enable_deletion: Optional[bool] = Field(default=False, description='In case of merge synchronisation_mode add deletion statement in merge query.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.key_columns","title":"key_columns class-attribute
instance-attribute
","text":"key_columns: Optional[List[str]] = Field(default_factory=list, description='Key columns on which merge statements will be MERGE statement will be applied.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.non_key_columns","title":"non_key_columns property
","text":"non_key_columns: List[str]\n
Columns of source table that aren't part of the (composite) primary key
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.persist_staging","title":"persist_staging class-attribute
instance-attribute
","text":"persist_staging: Optional[bool] = Field(default=False, description='In case of debugging, set `persist_staging` to True to retain the staging table for inspection after synchronization.')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.reader","title":"reader property
","text":"reader\n
DeltaTable reader
Returns: DeltaTableReader the will yield source delta table\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.schema_tracking_location","title":"schema_tracking_location class-attribute
instance-attribute
","text":"schema_tracking_location: Optional[str] = Field(default=None, description='Schema tracking location to use. Info: https://docs.delta.io/latest/delta-streaming.html#-schema-tracking')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.source_table","title":"source_table class-attribute
instance-attribute
","text":"source_table: DeltaTableStep = Field(default=..., description='Source delta table to synchronize')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table","title":"staging_table property
","text":"staging_table\n
Intermediate table on snowflake where staging results are stored
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.staging_table_name","title":"staging_table_name class-attribute
instance-attribute
","text":"staging_table_name: Optional[str] = Field(default=None, alias='staging_table', description='Optional snowflake staging name', validate_default=False)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: Optional[bool] = Field(default=False, description=\"Should synchronisation happen in streaming or in batch mode. Streaming is supported in 'APPEND' and 'MERGE' mode. Batch is supported in 'OVERWRITE' and 'APPEND' mode.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.synchronisation_mode","title":"synchronisation_mode class-attribute
instance-attribute
","text":"synchronisation_mode: BatchOutputMode = Field(default=MERGE, description=\"Determines if synchronisation will 'overwrite' any existing table, 'append' new rows or 'merge' with existing rows.\")\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.target_table","title":"target_table class-attribute
instance-attribute
","text":"target_table: str = Field(default=..., description='Target table in snowflake to synchronize to')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer","title":"writer property
","text":"writer: Union[ForEachBatchStreamWriter, SnowflakeWriter]\n
Writer to persist to snowflake
Depending on configured options, this returns an SnowflakeWriter or ForEachBatchStreamWriter: - OVERWRITE/APPEND mode yields SnowflakeWriter - MERGE mode yields ForEachBatchStreamWriter
Returns:
Type Description Union[ForEachBatchStreamWriter, SnowflakeWriter]
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.writer_","title":"writer_ class-attribute
instance-attribute
","text":"writer_: Optional[Union[ForEachBatchStreamWriter, SnowflakeWriter]] = None\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.drop_table","title":"drop_table","text":"drop_table(snowflake_table)\n
Drop a given snowflake table
Source code in src/koheesio/spark/snowflake.py
def drop_table(self, snowflake_table):\n \"\"\"Drop a given snowflake table\"\"\"\n self.log.warning(f\"Dropping table {snowflake_table} from snowflake\")\n drop_table_query = f\"\"\"DROP TABLE IF EXISTS {snowflake_table}\"\"\"\n query_executor = RunQuery(**self.get_options(), query=drop_table_query)\n query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.execute","title":"execute","text":"execute() -> None\n
Source code in src/koheesio/spark/snowflake.py
def execute(self) -> None:\n # extract\n df = self.extract()\n self.output.source_df = df\n\n # synchronize\n self.output.target_df = df\n self.load(df)\n if not self.persist_staging:\n # If it's a streaming job, await for termination before dropping staging table\n if self.streaming:\n self.writer.await_termination()\n self.drop_table(self.staging_table)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.extract","title":"extract","text":"extract() -> DataFrame\n
Extract source table
Source code in src/koheesio/spark/snowflake.py
def extract(self) -> DataFrame:\n \"\"\"\n Extract source table\n \"\"\"\n if self.synchronisation_mode == BatchOutputMode.MERGE:\n if not self.source_table.is_cdf_active:\n raise RuntimeError(\n f\"Source table {self.source_table.table_name} does not have CDF enabled. \"\n f\"Set TBLPROPERTIES ('delta.enableChangeDataFeed' = true) to enable. \"\n f\"Current properties = {self.source_table_properties}\"\n )\n\n df = self.reader.read()\n self.output.source_df = df\n return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.load","title":"load","text":"load(df) -> DataFrame\n
Load source table into snowflake
Source code in src/koheesio/spark/snowflake.py
def load(self, df) -> DataFrame:\n \"\"\"Load source table into snowflake\"\"\"\n if self.synchronisation_mode == BatchOutputMode.MERGE:\n self.log.info(f\"Truncating staging table {self.staging_table}\")\n self.truncate_table(self.staging_table)\n self.writer.write(df)\n self.output.target_df = df\n return df\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.run","title":"run","text":"run()\n
alias of execute
Source code in src/koheesio/spark/snowflake.py
def run(self):\n \"\"\"alias of execute\"\"\"\n return self.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.SynchronizeDeltaToSnowflakeTask.truncate_table","title":"truncate_table","text":"truncate_table(snowflake_table)\n
Truncate a given snowflake table
Source code in src/koheesio/spark/snowflake.py
def truncate_table(self, snowflake_table):\n \"\"\"Truncate a given snowflake table\"\"\"\n truncate_query = f\"\"\"TRUNCATE TABLE IF EXISTS {snowflake_table}\"\"\"\n query_executor = RunQuery(\n **self.get_options(),\n query=truncate_query,\n )\n query_executor.execute()\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists","title":"koheesio.spark.snowflake.TableExists","text":"Check if the table exists in Snowflake by using INFORMATION_SCHEMA.
Example k = TableExists(\n url=\"foo.snowflakecomputing.com\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n database=\"db\",\n schema=\"schema\",\n table=\"table\",\n)\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output","title":"Output","text":"Output class for TableExists
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.Output.exists","title":"exists class-attribute
instance-attribute
","text":"exists: bool = Field(default=..., description='Whether or not the table exists')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TableExists.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n query = (\n dedent(\n # Force upper case, due to case-sensitivity of where clause\n f\"\"\"\n SELECT *\n FROM INFORMATION_SCHEMA.TABLES\n WHERE TABLE_CATALOG = '{self.database}'\n AND TABLE_SCHEMA = '{self.sfSchema}'\n AND TABLE_TYPE = 'BASE TABLE'\n AND upper(TABLE_NAME) = '{self.table.upper()}'\n \"\"\" # nosec B608: hardcoded_sql_expressions\n )\n .upper()\n .strip()\n )\n\n self.log.debug(f\"Query that was executed to check if the table exists:\\n{query}\")\n\n df = Query(**self.get_options(), query=query).read()\n\n exists = df.count() > 0\n self.log.info(f\"Table {self.table} {'exists' if exists else 'does not exist'}\")\n self.output.exists = exists\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery","title":"koheesio.spark.snowflake.TagSnowflakeQuery","text":"Provides Snowflake query tag pre-action that can be used to easily find queries through SF history search and further group them for debugging and cost tracking purposes.
Takes in query tag attributes as kwargs and additional Snowflake options dict that can optionally contain other set of pre-actions to be applied to a query, in that case existing pre-action aren't dropped, query tag pre-action will be added to them.
Passed Snowflake options dictionary is not modified in-place, instead anew dictionary containing updated pre-actions is returned.
Notes See this article for explanation: https://select.dev/posts/snowflake-query-tags
Arbitrary tags can be applied, such as team, dataset names, business capability, etc.
Example query_tag = AddQueryTag(\n options={\"preactions\": ...},\n task_name=\"cleanse_task\",\n pipeline_name=\"ingestion-pipeline\",\n etl_date=\"2022-01-01\",\n pipeline_execution_time=\"2022-01-01T00:00:00\",\n task_execution_time=\"2022-01-01T01:00:00\",\n environment=\"dev\",\n trace_id=\"e0fdec43-a045-46e5-9705-acd4f3f96045\",\n span_id=\"cb89abea-1c12-471f-8b12-546d2d66f6cb\",\n ),\n).execute().options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.options","title":"options class-attribute
instance-attribute
","text":"options: Dict = Field(default_factory=dict, description='Additional Snowflake options, optionally containing additional preactions')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output","title":"Output","text":"Output class for AddQueryTag
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.Output.options","title":"options class-attribute
instance-attribute
","text":"options: Dict = Field(default=..., description='Copy of provided SF options, with added query tag preaction')\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.TagSnowflakeQuery.execute","title":"execute","text":"execute()\n
Add query tag preaction to Snowflake options
Source code in src/koheesio/spark/snowflake.py
def execute(self):\n \"\"\"Add query tag preaction to Snowflake options\"\"\"\n tag_json = json.dumps(self.extra_params, indent=4, sort_keys=True)\n tag_preaction = f\"ALTER SESSION SET QUERY_TAG = '{tag_json}';\"\n preactions = self.options.get(\"preactions\", \"\")\n preactions = f\"{preactions}\\n{tag_preaction}\".strip()\n updated_options = dict(self.options)\n updated_options[\"preactions\"] = preactions\n self.output.options = updated_options\n
"},{"location":"api_reference/spark/snowflake.html#koheesio.spark.snowflake.map_spark_type","title":"koheesio.spark.snowflake.map_spark_type","text":"map_spark_type(spark_type: DataType)\n
Translates Spark DataFrame Schema type to SnowFlake type
Basic Types Snowflake Type StringType STRING NullType STRING BooleanType BOOLEAN Numeric Types Snowflake Type LongType BIGINT IntegerType INT ShortType SMALLINT DoubleType DOUBLE FloatType FLOAT NumericType FLOAT ByteType BINARY Date / Time Types Snowflake Type DateType DATE TimestampType TIMESTAMP Advanced Types Snowflake Type DecimalType DECIMAL MapType VARIANT ArrayType VARIANT StructType VARIANT References - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
- Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html
Parameters:
Name Type Description Default spark_type
DataType
DataType taken out of the StructField
required Returns:
Type Description str
The Snowflake data type
Source code in src/koheesio/spark/snowflake.py
def map_spark_type(spark_type: t.DataType):\n \"\"\"\n Translates Spark DataFrame Schema type to SnowFlake type\n\n | Basic Types | Snowflake Type |\n |-------------------|----------------|\n | StringType | STRING |\n | NullType | STRING |\n | BooleanType | BOOLEAN |\n\n | Numeric Types | Snowflake Type |\n |-------------------|----------------|\n | LongType | BIGINT |\n | IntegerType | INT |\n | ShortType | SMALLINT |\n | DoubleType | DOUBLE |\n | FloatType | FLOAT |\n | NumericType | FLOAT |\n | ByteType | BINARY |\n\n | Date / Time Types | Snowflake Type |\n |-------------------|----------------|\n | DateType | DATE |\n | TimestampType | TIMESTAMP |\n\n | Advanced Types | Snowflake Type |\n |-------------------|----------------|\n | DecimalType | DECIMAL |\n | MapType | VARIANT |\n | ArrayType | VARIANT |\n | StructType | VARIANT |\n\n References\n ----------\n - Spark SQL DataTypes: https://spark.apache.org/docs/latest/sql-ref-datatypes.html\n - Snowflake DataTypes: https://docs.snowflake.com/en/sql-reference/data-types.html\n\n Parameters\n ----------\n spark_type : pyspark.sql.types.DataType\n DataType taken out of the StructField\n\n Returns\n -------\n str\n The Snowflake data type\n \"\"\"\n # StructField means that the entire Field was passed, we need to extract just the dataType before continuing\n if isinstance(spark_type, t.StructField):\n spark_type = spark_type.dataType\n\n # Check if the type is DayTimeIntervalType\n if isinstance(spark_type, t.DayTimeIntervalType):\n warn(\n \"DayTimeIntervalType is being converted to STRING. \"\n \"Consider converting to a more supported date/time/timestamp type in Snowflake.\"\n )\n\n # fmt: off\n # noinspection PyUnresolvedReferences\n data_type_map = {\n # Basic Types\n t.StringType: \"STRING\",\n t.NullType: \"STRING\",\n t.BooleanType: \"BOOLEAN\",\n\n # Numeric Types\n t.LongType: \"BIGINT\",\n t.IntegerType: \"INT\",\n t.ShortType: \"SMALLINT\",\n t.DoubleType: \"DOUBLE\",\n t.FloatType: \"FLOAT\",\n t.NumericType: \"FLOAT\",\n t.ByteType: \"BINARY\",\n t.BinaryType: \"VARBINARY\",\n\n # Date / Time Types\n t.DateType: \"DATE\",\n t.TimestampType: \"TIMESTAMP\",\n t.DayTimeIntervalType: \"STRING\",\n\n # Advanced Types\n t.DecimalType:\n f\"DECIMAL({spark_type.precision},{spark_type.scale})\" # pylint: disable=no-member\n if isinstance(spark_type, t.DecimalType) else \"DECIMAL(38,0)\",\n t.MapType: \"VARIANT\",\n t.ArrayType: \"VARIANT\",\n t.StructType: \"VARIANT\",\n }\n return data_type_map.get(type(spark_type), 'STRING')\n
"},{"location":"api_reference/spark/utils.html","title":"Utils","text":"Spark Utility functions
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_minor_version","title":"koheesio.spark.utils.spark_minor_version module-attribute
","text":"spark_minor_version: float = get_spark_minor_version()\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype","title":"koheesio.spark.utils.SparkDatatype","text":"Allowed spark datatypes
The following table lists the data types that are supported by Spark SQL.
Data type SQL name ByteType BYTE, TINYINT ShortType SHORT, SMALLINT IntegerType INT, INTEGER LongType LONG, BIGINT FloatType FLOAT, REAL DoubleType DOUBLE DecimalType DECIMAL, DEC, NUMERIC StringType STRING BinaryType BINARY BooleanType BOOLEAN TimestampType TIMESTAMP, TIMESTAMP_LTZ DateType DATE ArrayType ARRAY MapType MAP NullType VOID Not supported yet - TimestampNTZType TIMESTAMP_NTZ
- YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
- DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
See Also https://spark.apache.org/docs/latest/sql-ref-datatypes.html#supported-data-types
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.ARRAY","title":"ARRAY class-attribute
instance-attribute
","text":"ARRAY = 'array'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BIGINT","title":"BIGINT class-attribute
instance-attribute
","text":"BIGINT = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BINARY","title":"BINARY class-attribute
instance-attribute
","text":"BINARY = 'binary'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BOOLEAN","title":"BOOLEAN class-attribute
instance-attribute
","text":"BOOLEAN = 'boolean'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.BYTE","title":"BYTE class-attribute
instance-attribute
","text":"BYTE = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DATE","title":"DATE class-attribute
instance-attribute
","text":"DATE = 'date'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DEC","title":"DEC class-attribute
instance-attribute
","text":"DEC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DECIMAL","title":"DECIMAL class-attribute
instance-attribute
","text":"DECIMAL = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.DOUBLE","title":"DOUBLE class-attribute
instance-attribute
","text":"DOUBLE = 'double'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.FLOAT","title":"FLOAT class-attribute
instance-attribute
","text":"FLOAT = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INT","title":"INT class-attribute
instance-attribute
","text":"INT = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.INTEGER","title":"INTEGER class-attribute
instance-attribute
","text":"INTEGER = 'integer'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.LONG","title":"LONG class-attribute
instance-attribute
","text":"LONG = 'long'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.MAP","title":"MAP class-attribute
instance-attribute
","text":"MAP = 'map'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.NUMERIC","title":"NUMERIC class-attribute
instance-attribute
","text":"NUMERIC = 'decimal'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.REAL","title":"REAL class-attribute
instance-attribute
","text":"REAL = 'float'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SHORT","title":"SHORT class-attribute
instance-attribute
","text":"SHORT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.SMALLINT","title":"SMALLINT class-attribute
instance-attribute
","text":"SMALLINT = 'short'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.STRING","title":"STRING class-attribute
instance-attribute
","text":"STRING = 'string'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP","title":"TIMESTAMP class-attribute
instance-attribute
","text":"TIMESTAMP = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TIMESTAMP_LTZ","title":"TIMESTAMP_LTZ class-attribute
instance-attribute
","text":"TIMESTAMP_LTZ = 'timestamp'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.TINYINT","title":"TINYINT class-attribute
instance-attribute
","text":"TINYINT = 'byte'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.VOID","title":"VOID class-attribute
instance-attribute
","text":"VOID = 'void'\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.spark_type","title":"spark_type property
","text":"spark_type: DataType\n
Returns the spark type for the given enum value
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.SparkDatatype.from_string","title":"from_string classmethod
","text":"from_string(value: str) -> SparkDatatype\n
Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive
Source code in src/koheesio/spark/utils.py
@classmethod\ndef from_string(cls, value: str) -> \"SparkDatatype\":\n \"\"\"Allows for getting the right Enum value by simply passing a string value\n This method is not case-sensitive\n \"\"\"\n return getattr(cls, value.upper())\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.get_spark_minor_version","title":"koheesio.spark.utils.get_spark_minor_version","text":"get_spark_minor_version() -> float\n
Returns the minor version of the spark instance.
For example, if the spark version is 3.3.2, this function would return 3.3
Source code in src/koheesio/spark/utils.py
def get_spark_minor_version() -> float:\n \"\"\"Returns the minor version of the spark instance.\n\n For example, if the spark version is 3.3.2, this function would return 3.3\n \"\"\"\n return float(\".\".join(spark_version.split(\".\")[:2]))\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.on_databricks","title":"koheesio.spark.utils.on_databricks","text":"on_databricks() -> bool\n
Retrieve if we're running on databricks or elsewhere
Source code in src/koheesio/spark/utils.py
def on_databricks() -> bool:\n \"\"\"Retrieve if we're running on databricks or elsewhere\"\"\"\n dbr_version = os.getenv(\"DATABRICKS_RUNTIME_VERSION\", None)\n return dbr_version is not None and dbr_version != \"\"\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.schema_struct_to_schema_str","title":"koheesio.spark.utils.schema_struct_to_schema_str","text":"schema_struct_to_schema_str(schema: StructType) -> str\n
Converts a StructType to a schema str
Source code in src/koheesio/spark/utils.py
def schema_struct_to_schema_str(schema: StructType) -> str:\n \"\"\"Converts a StructType to a schema str\"\"\"\n if not schema:\n return \"\"\n return \",\\n\".join([f\"{field.name} {field.dataType.typeName().upper()}\" for field in schema.fields])\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_array","title":"koheesio.spark.utils.spark_data_type_is_array","text":"spark_data_type_is_array(data_type: DataType) -> bool\n
Check if the column's dataType is of type ArrayType
Source code in src/koheesio/spark/utils.py
def spark_data_type_is_array(data_type: DataType) -> bool:\n \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n return isinstance(data_type, ArrayType)\n
"},{"location":"api_reference/spark/utils.html#koheesio.spark.utils.spark_data_type_is_numeric","title":"koheesio.spark.utils.spark_data_type_is_numeric","text":"spark_data_type_is_numeric(data_type: DataType) -> bool\n
Check if the column's dataType is of type ArrayType
Source code in src/koheesio/spark/utils.py
def spark_data_type_is_numeric(data_type: DataType) -> bool:\n \"\"\"Check if the column's dataType is of type ArrayType\"\"\"\n return isinstance(data_type, (IntegerType, LongType, FloatType, DoubleType, DecimalType))\n
"},{"location":"api_reference/spark/readers/index.html","title":"Readers","text":"Readers are a type of Step that read data from a source based on the input parameters and stores the result in self.output.df.
For a comprehensive guide on the usage, examples, and additional features of Reader classes, please refer to the reference/concepts/steps/readers section of the Koheesio documentation.
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader","title":"koheesio.spark.readers.Reader","text":"Base class for all Readers
A Reader is a Step that reads data from a source based on the input parameters and stores the result in self.output.df (DataFrame).
When implementing a Reader, the execute() method should be implemented. The execute() method should read from the source and store the result in self.output.df.
The Reader class implements a standard read() method that calls the execute() method and returns the result. This method can be used to read data from a Reader without having to call the execute() method directly. Read method does not need to be implemented in the child class.
Every Reader has a SparkSession available as self.spark. This is the currently active SparkSession.
The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the output.df is None, .execute() will be run first.
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.df","title":"df property
","text":"df: Optional[DataFrame]\n
Shorthand for accessing self.output.df If the output.df is None, .execute() will be run first
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.execute","title":"execute abstractmethod
","text":"execute()\n
Execute on a Reader should handle self.output.df (output) as a minimum Read from whichever source -> store result in self.output.df
Source code in src/koheesio/spark/readers/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Execute on a Reader should handle self.output.df (output) as a minimum\n Read from whichever source -> store result in self.output.df\n \"\"\"\n # self.output.df # output dataframe\n ...\n
"},{"location":"api_reference/spark/readers/index.html#koheesio.spark.readers.Reader.read","title":"read","text":"read() -> Optional[DataFrame]\n
Read from a Reader without having to call the execute() method directly
Source code in src/koheesio/spark/readers/__init__.py
def read(self) -> Optional[DataFrame]:\n \"\"\"Read from a Reader without having to call the execute() method directly\"\"\"\n self.execute()\n return self.output.df\n
"},{"location":"api_reference/spark/readers/delta.html","title":"Delta","text":"Read data from a Delta table and return a DataFrame or DataStream
Classes:
Name Description DeltaTableReader
Reads data from a Delta table and returns a DataFrame
DeltaTableStreamReader
Reads data from a Delta table and returns a DataStream
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS","title":"koheesio.spark.readers.delta.STREAMING_ONLY_OPTIONS module-attribute
","text":"STREAMING_ONLY_OPTIONS = ['ignore_deletes', 'ignore_changes', 'starting_version', 'starting_timestamp', 'schema_tracking_location']\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING","title":"koheesio.spark.readers.delta.STREAMING_SCHEMA_WARNING module-attribute
","text":"STREAMING_SCHEMA_WARNING = '\\nImportant!\\nAlthough you can start the streaming source from a specified version or timestamp, the schema of the streaming source is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the data with an incorrect schema.'\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader","title":"koheesio.spark.readers.delta.DeltaTableReader","text":"Reads data from a Delta table and returns a DataFrame Delta Table can be read in batch or streaming mode It also supports reading change data feed (CDF) in both batch mode and streaming mode
Parameters:
Name Type Description Default table
Union[DeltaTableStep, str]
The table to read
required filter_cond
Optional[Union[Column, str]]
Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions. For example: f.col('state') == 'Ohio'
, state = 'Ohio'
or (col('col1') > 3) & (col('col2') < 9)
required columns
Columns to select from the table. One or many columns can be provided as strings. For example: ['col1', 'col2']
, ['col1']
or 'col1'
required streaming
Optional[bool]
Whether to read the table as a Stream or not
required read_change_feed
bool
readChangeFeed: Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html
required starting_version
str
startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.
required starting_timestamp
str
startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)
required ignore_deletes
bool
ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
required ignore_changes
bool
ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.
required"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[ListOfColumns] = Field(default=None, description=\"Columns to select from the table. One or many columns can be provided as strings. For example: `['col1', 'col2']`, `['col1']` or `'col1'` \")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.filter_cond","title":"filter_cond class-attribute
instance-attribute
","text":"filter_cond: Optional[Union[Column, str]] = Field(default=None, alias='filterCondition', description=\"Filter condition to apply to the dataframe. Filters can be provided by using Column or string expressions For example: `f.col('state') == 'Ohio'`, `state = 'Ohio'` or `(col('col1') > 3) & (col('col2') < 9)`\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_changes","title":"ignore_changes class-attribute
instance-attribute
","text":"ignore_changes: bool = Field(default=False, alias='ignoreChanges', description='ignoreChanges: re-process updates if files had to be rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows may still be emitted, therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.ignore_deletes","title":"ignore_deletes class-attribute
instance-attribute
","text":"ignore_deletes: bool = Field(default=False, alias='ignoreDeletes', description='ignoreDeletes: Ignore transactions that delete data at partition boundaries. Note: Only supported for streaming tables. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.read_change_feed","title":"read_change_feed class-attribute
instance-attribute
","text":"read_change_feed: bool = Field(default=False, alias='readChangeFeed', description=\"Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records 'change events' for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated. See: https://docs.databricks.com/delta/delta-change-data-feed.html\")\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.reader","title":"reader property
","text":"reader: Union[DataStreamReader, DataFrameReader]\n
Return the reader for the DeltaTableReader based on the streaming
attribute
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.schema_tracking_location","title":"schema_tracking_location class-attribute
instance-attribute
","text":"schema_tracking_location: Optional[str] = Field(default=None, alias='schemaTrackingLocation', description='schemaTrackingLocation: Track the location of source schema. Note: Recommend to enable Delta reader version: 3 and writer version: 7 for this option. For more info see https://docs.delta.io/latest/delta-column-mapping.html' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.skip_change_commits","title":"skip_change_commits class-attribute
instance-attribute
","text":"skip_change_commits: bool = Field(default=False, alias='skipChangeCommits', description='skipChangeCommits: Skip processing of change commits. Note: Only supported for streaming tables. (not supported in Open Source Delta Implementation). Prefer using skipChangeCommits over ignoreDeletes and ignoreChanges starting DBR12.1 and above. For more info see https://docs.databricks.com/structured-streaming/delta-lake.html#skip-change-commits')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_timestamp","title":"starting_timestamp class-attribute
instance-attribute
","text":"starting_timestamp: Optional[str] = Field(default=None, alias='startingTimestamp', description='startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Either provide a timestamp string (e.g. 2019-01-01T00:00:00.000Z) or a date string (e.g. 2019-01-01)' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.starting_version","title":"starting_version class-attribute
instance-attribute
","text":"starting_version: Optional[str] = Field(default=None, alias='startingVersion', description='startingVersion: The Delta Lake version to start from. All table changes starting from this version (inclusive) will be read by the streaming source. You can obtain the commit versions from the version column of the DESCRIBE HISTORY command output.' + STREAMING_SCHEMA_WARNING)\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: Optional[bool] = Field(default=False, description='Whether to read the table as a Stream or not')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.table","title":"table class-attribute
instance-attribute
","text":"table: Union[DeltaTableStep, str] = Field(default=..., description='The table to read')\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.temp_view_name","title":"temp_view_name property
","text":"temp_view_name\n
Get the temporary view name for the dataframe for SQL queries
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.view","title":"view property
","text":"view\n
Create a temporary view of the dataframe for SQL queries
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/delta.py
def execute(self):\n df = self.reader.table(self.table.table_name)\n if self.filter_cond is not None:\n df = df.filter(f.expr(self.filter_cond) if isinstance(self.filter_cond, str) else self.filter_cond)\n if self.columns is not None:\n df = df.select(*self.columns)\n self.output.df = df\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.get_options","title":"get_options","text":"get_options() -> Dict[str, Any]\n
Get the options for the DeltaTableReader based on the streaming
attribute
Source code in src/koheesio/spark/readers/delta.py
def get_options(self) -> Dict[str, Any]:\n \"\"\"Get the options for the DeltaTableReader based on the `streaming` attribute\"\"\"\n options = {\n # Enable Change Data Feed (CDF) feature\n \"readChangeFeed\": self.read_change_feed,\n # Initial position, one of:\n \"startingVersion\": self.starting_version,\n \"startingTimestamp\": self.starting_timestamp,\n }\n\n # Streaming only options\n if self.streaming:\n options = {\n **options,\n # Ignore updates and deletes, one of:\n \"ignoreDeletes\": self.ignore_deletes,\n \"ignoreChanges\": self.ignore_changes,\n \"skipChangeCommits\": self.skip_change_commits,\n \"schemaTrackingLocation\": self.schema_tracking_location,\n }\n # Batch only options\n else:\n pass # there are none... for now :)\n\n def normalize(v: Union[str, bool]):\n \"\"\"normalize values\"\"\"\n # True becomes \"true\", False becomes \"false\"\n v = str(v).lower() if isinstance(v, bool) else v\n return v\n\n # Any options with `value == None` are filtered out\n return {k: normalize(v) for k, v in options.items() if v is not None}\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableReader.set_temp_view_name","title":"set_temp_view_name","text":"set_temp_view_name()\n
Set a temporary view name for the dataframe for SQL queries
Source code in src/koheesio/spark/readers/delta.py
@model_validator(mode=\"after\")\ndef set_temp_view_name(self):\n \"\"\"Set a temporary view name for the dataframe for SQL queries\"\"\"\n table_name = self.table.table\n vw_name = get_random_string(prefix=f\"tmp_{table_name}\")\n self.__temp_view_name__ = vw_name\n return self\n
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader","title":"koheesio.spark.readers.delta.DeltaTableStreamReader","text":"Reads data from a Delta table and returns a DataStream
"},{"location":"api_reference/spark/readers/delta.html#koheesio.spark.readers.delta.DeltaTableStreamReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: bool = True\n
"},{"location":"api_reference/spark/readers/dummy.html","title":"Dummy","text":"A simple DummyReader that returns a DataFrame with an id-column of the given range
"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader","title":"koheesio.spark.readers.dummy.DummyReader","text":"A simple DummyReader that returns a DataFrame with an id-column of the given range
Can be used in place of any Reader without having to read from a real source.
Wraps SparkSession.range(). Output DataFrame will have a single column named \"id\" of type Long and length of the given range.
Parameters:
Name Type Description Default range
int
How large to make the Dataframe
required Example from koheesio.spark.readers.dummy import DummyReader\n\noutput_df = DummyReader(range=100).read()\n
output_df: Output DataFrame will have a single column named \"id\" of type Long
containing 100 rows (0-99).
id 0 1 ... 99"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.range","title":"range class-attribute
instance-attribute
","text":"range: int = Field(default=100, description='How large to make the Dataframe')\n
"},{"location":"api_reference/spark/readers/dummy.html#koheesio.spark.readers.dummy.DummyReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/dummy.py
def execute(self):\n self.output.df = self.spark.range(self.range)\n
"},{"location":"api_reference/spark/readers/file_loader.html","title":"File loader","text":"Generic file Readers for different file formats.
Supported file formats: - CSV - Parquet - Avro - JSON - ORC - Text
Examples:
from koheesio.spark.readers import (\n CsvReader,\n ParquetReader,\n AvroReader,\n JsonReader,\n OrcReader,\n)\n\ncsv_reader = CsvReader(path=\"path/to/file.csv\", header=True)\nparquet_reader = ParquetReader(path=\"path/to/file.parquet\")\navro_reader = AvroReader(path=\"path/to/file.avro\")\njson_reader = JsonReader(path=\"path/to/file.json\")\norc_reader = OrcReader(path=\"path/to/file.orc\")\n
For more information about the available options, see Spark's official documentation.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader","title":"koheesio.spark.readers.file_loader.AvroReader","text":"Reads an Avro file.
This class is a convenience class that sets the format
field to FileFormat.avro
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = AvroReader(path=\"path/to/file.avro\", mergeSchema=True)\n
Make sure to have the spark-avro
package installed in your environment.
For more information about the available options, see the official documentation.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.AvroReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = avro\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader","title":"koheesio.spark.readers.file_loader.CsvReader","text":"Reads a CSV file.
This class is a convenience class that sets the format
field to FileFormat.csv
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = CsvReader(path=\"path/to/file.csv\", header=True)\n
For more information about the available options, see the official pyspark documentation and read about CSV data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.CsvReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = csv\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat","title":"koheesio.spark.readers.file_loader.FileFormat","text":"Supported file formats.
This enum represents the supported file formats that can be used with the FileLoader class. The available file formats are: - csv: Comma-separated values format - parquet: Apache Parquet format - avro: Apache Avro format - json: JavaScript Object Notation format - orc: Apache ORC format - text: Plain text format
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.avro","title":"avro class-attribute
instance-attribute
","text":"avro = 'avro'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.csv","title":"csv class-attribute
instance-attribute
","text":"csv = 'csv'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.json","title":"json class-attribute
instance-attribute
","text":"json = 'json'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.orc","title":"orc class-attribute
instance-attribute
","text":"orc = 'orc'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.parquet","title":"parquet class-attribute
instance-attribute
","text":"parquet = 'parquet'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileFormat.text","title":"text class-attribute
instance-attribute
","text":"text = 'text'\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader","title":"koheesio.spark.readers.file_loader.FileLoader","text":"Generic file reader.
Available file formats:\n- CSV\n- Parquet\n- Avro\n- JSON\n- ORC\n- Text (default)\n\nExtra parameters can be passed to the reader using the `extra_params` attribute or as keyword arguments.\n\nExample:\n```python\nreader = FileLoader(path=\"path/to/textfile.txt\", format=\"text\", header=True, lineSep=\"\n
\") ```
For more information about the available options, see Spark's\n[official pyspark documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.text.html)\nand [read about text data source](https://spark.apache.org/docs/latest/sql-data-sources-text.html).\n\nAlso see the [data sources generic options](https://spark.apache.org/docs/3.5.0/sql-data-sources-generic-options.html).\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = Field(default=text, description='File format to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.path","title":"path class-attribute
instance-attribute
","text":"path: Union[Path, str] = Field(default=..., description='Path to the file to read')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.schema_","title":"schema_ class-attribute
instance-attribute
","text":"schema_: Optional[Union[StructType, str]] = Field(default=None, description='Schema to use when reading the file', validate_default=False, alias='schema')\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.ensure_path_is_str","title":"ensure_path_is_str","text":"ensure_path_is_str(v)\n
Ensure that the path is a string as required by Spark.
Source code in src/koheesio/spark/readers/file_loader.py
@field_validator(\"path\")\ndef ensure_path_is_str(cls, v):\n \"\"\"Ensure that the path is a string as required by Spark.\"\"\"\n if isinstance(v, Path):\n return str(v.absolute().as_posix())\n return v\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.FileLoader.execute","title":"execute","text":"execute()\n
Reads the file using the specified format, schema, while applying any extra parameters.
Source code in src/koheesio/spark/readers/file_loader.py
def execute(self):\n \"\"\"Reads the file using the specified format, schema, while applying any extra parameters.\"\"\"\n reader = self.spark.read.format(self.format)\n\n if self.schema_:\n reader.schema(self.schema_)\n\n if self.extra_params:\n reader = reader.options(**self.extra_params)\n\n self.output.df = reader.load(self.path)\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader","title":"koheesio.spark.readers.file_loader.JsonReader","text":"Reads a JSON file.
This class is a convenience class that sets the format
field to FileFormat.json
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = JsonReader(path=\"path/to/file.json\", allowComments=True)\n
For more information about the available options, see the official pyspark documentation and read about JSON data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.JsonReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = json\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader","title":"koheesio.spark.readers.file_loader.OrcReader","text":"Reads an ORC file.
This class is a convenience class that sets the format
field to FileFormat.orc
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = OrcReader(path=\"path/to/file.orc\", mergeSchema=True)\n
For more information about the available options, see the official documentation and read about ORC data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.OrcReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = orc\n
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader","title":"koheesio.spark.readers.file_loader.ParquetReader","text":"Reads a Parquet file.
This class is a convenience class that sets the format
field to FileFormat.parquet
.
Extra parameters can be passed to the reader using the extra_params
attribute or as keyword arguments.
Example:
reader = ParquetReader(path=\"path/to/file.parquet\", mergeSchema=True)\n
For more information about the available options, see the official pyspark documentation and read about Parquet data source.
Also see the data sources generic options.
"},{"location":"api_reference/spark/readers/file_loader.html#koheesio.spark.readers.file_loader.ParquetReader.format","title":"format class-attribute
instance-attribute
","text":"format: FileFormat = parquet\n
"},{"location":"api_reference/spark/readers/hana.html","title":"Hana","text":"HANA reader.
"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader","title":"koheesio.spark.readers.hana.HanaReader","text":"Wrapper around JdbcReader for SAP HANA
Notes - Refer to JdbcReader for the list of all available parameters.
- Refer to SAP HANA Client Interface Programming Reference docs for the list of all available connection string parameters: https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/109397c2206a4ab2a5386d494f4cf75e.html
Example Note: jars should be added to the Spark session manually. This class does not take care of that.
This example depends on the SAP HANA ngdbc
JAR. e.g. ngdbc-2.5.49.
from koheesio.spark.readers.hana import HanaReader\njdbc_hana = HanaReader(\n url=\"jdbc:sap://<domain_or_ip>:<port>/?<options>\n user=\"YOUR_USERNAME\",\n password=\"***\",\n dbtable=\"schemaname.tablename\"\n)\ndf = jdbc_hana.read()\n
Parameters:
Name Type Description Default url
str
JDBC connection string. Refer to SAP HANA docs for the list of all available connection string parameters. Example: jdbc:sap://:[/?] required user
str
required password
SecretStr
required dbtable
str
Database table name, also include schema name
required options
Optional[Dict[str, Any]]
Extra options to pass to the SAP HANA JDBC driver. Refer to SAP HANA docs for the list of all available connection string parameters. Example: {\"fetchsize\": 2000, \"numPartitions\": 10}
required query
Optional[str]
Query
required format
str
The type of format to load. Defaults to 'jdbc'. Should not be changed.
required driver
str
Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.
required"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: str = Field(default='com.sap.db.jdbc.Driver', description='Make sure that the necessary JARs are available in the cluster: ngdbc-2-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/hana.html#koheesio.spark.readers.hana.HanaReader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field(default={'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the SAP HANA JDBC driver')\n
"},{"location":"api_reference/spark/readers/jdbc.html","title":"Jdbc","text":"Module for reading data from JDBC sources.
Classes:
Name Description JdbcReader
Reader for JDBC tables.
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader","title":"koheesio.spark.readers.jdbc.JdbcReader","text":"Reader for JDBC tables.
Wrapper around Spark's jdbc read format
Notes - Query has precedence over dbtable. If query and dbtable both are filled in, dbtable will be ignored!
- Extra options to the spark reader can be passed through the
options
input. Refer to Spark documentation for details: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html - Consider using
fetchsize
as one of the options, as it is greatly increases the performance of the reader - Consider using
numPartitions
, partitionColumn
, lowerBound
, upperBound
together with real or synthetic partitioning column as it will improve the reader performance
When implementing a JDBC reader, the get_options()
method should be implemented. The method should return a dict of options required for the specific JDBC driver. The get_options()
method can be overridden in the child class. Additionally, the driver
parameter should be set to the name of the JDBC driver. Be aware that the driver jar needs to be included in the Spark session; this class does not (and can not) take care of that!
Example Note: jars should be added to the Spark session manually. This class does not take care of that.
This example depends on the jar for MS SQL: https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre8/mssql-jdbc-9.2.1.jre8.jar
from koheesio.spark.readers.jdbc import JdbcReader\n\njdbc_mssql = JdbcReader(\n driver=\"com.microsoft.sqlserver.jdbc.SQLServerDriver\",\n url=\"jdbc:sqlserver://10.xxx.xxx.xxx:1433;databaseName=YOUR_DATABASE\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n dbtable=\"schemaname.tablename\",\n options={\"fetchsize\": 100},\n)\ndf = jdbc_mssql.read()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.dbtable","title":"dbtable class-attribute
instance-attribute
","text":"dbtable: Optional[str] = Field(default=None, description='Database table name, also include schema name')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: str = Field(default=..., description='Driver name. Be aware that the driver jar needs to be passed to the task')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='jdbc', description=\"The type of format to load. Defaults to 'jdbc'.\")\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field(default_factory=dict, description='Extra options to pass to spark reader')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.password","title":"password class-attribute
instance-attribute
","text":"password: SecretStr = Field(default=..., description='Password belonging to the username')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.query","title":"query class-attribute
instance-attribute
","text":"query: Optional[str] = Field(default=None, description='Query')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., description='URL for the JDBC driver. Note, in some environments you need to use the IP Address instead of the hostname of the server.')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.user","title":"user class-attribute
instance-attribute
","text":"user: str = Field(default=..., description='User to authenticate to the server')\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.execute","title":"execute","text":"execute()\n
Wrapper around Spark's jdbc read format
Source code in src/koheesio/spark/readers/jdbc.py
def execute(self):\n \"\"\"Wrapper around Spark's jdbc read format\"\"\"\n\n # Can't have both dbtable and query empty\n if not self.dbtable and not self.query:\n raise ValueError(\"Please do not leave dbtable and query both empty!\")\n\n if self.query and self.dbtable:\n self.log.info(\"Both 'query' and 'dbtable' are filled in, 'dbtable' will be ignored!\")\n\n options = self.get_options()\n\n if pw := self.password:\n options[\"password\"] = pw.get_secret_value()\n\n if query := self.query:\n options[\"query\"] = query\n self.log.info(f\"Executing query: {self.query}\")\n else:\n options[\"dbtable\"] = self.dbtable\n\n self.output.df = self.spark.read.format(self.format).options(**options).load()\n
"},{"location":"api_reference/spark/readers/jdbc.html#koheesio.spark.readers.jdbc.JdbcReader.get_options","title":"get_options","text":"get_options()\n
Dictionary of options required for the specific JDBC driver.
Note: override this method if driver requires custom names, e.g. Snowflake: sfUrl
, sfUser
, etc.
Source code in src/koheesio/spark/readers/jdbc.py
def get_options(self):\n \"\"\"\n Dictionary of options required for the specific JDBC driver.\n\n Note: override this method if driver requires custom names, e.g. Snowflake: `sfUrl`, `sfUser`, etc.\n \"\"\"\n return {\n \"driver\": self.driver,\n \"url\": self.url,\n \"user\": self.user,\n \"password\": self.password,\n **self.options,\n }\n
"},{"location":"api_reference/spark/readers/kafka.html","title":"Kafka","text":"Module for KafkaReader and KafkaStreamReader.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader","title":"koheesio.spark.readers.kafka.KafkaReader","text":"Reader for Kafka topics.
Wrapper around Spark's kafka read format. Supports both batch and streaming reads.
Parameters:
Name Type Description Default read_broker
str
Kafka brokers to read from. Should be passed as a single string with multiple brokers passed in a comma separated list
required topic
str
Kafka topic to consume.
required streaming
Optional[bool]
Whether to read the kafka topic as a stream or not.
required params
Optional[Dict[str, str]]
Arbitrary options to be applied when creating NSP Reader. If a user provides values for subscribe
or kafka.bootstrap.servers
, they will be ignored in favor of configuration passed through topic
and read_broker
respectively. Defaults to an empty dictionary.
required Notes - The
read_broker
and topic
parameters are required. - The
streaming
parameter defaults to False
. - The
params
parameter defaults to an empty dictionary. This parameter is also aliased as kafka_options
. - Any extra kafka options can also be passed as key-word arguments; these will be merged with the
params
parameter
Example from koheesio.spark.readers.kafka import KafkaReader\n\nkafka_reader = KafkaReader(\n read_broker=\"kafka-broker-1:9092,kafka-broker-2:9092\",\n topic=\"my-topic\",\n streaming=True,\n # extra kafka options can be passed as key-word arguments\n startingOffsets=\"earliest\",\n)\n
In the example above, the KafkaReader
will read from the my-topic
Kafka topic, using the brokers kafka-broker-1:9092
and kafka-broker-2:9092
. The reader will read the topic as a stream and will start reading from the earliest available offset.
The stream can be started by calling the read
or execute
method on the kafka_reader
object.
Note: The KafkaStreamReader
could be used in the example above to achieve the same result. streaming
would default to True
in that case and could be omitted from the parameters.
See Also - Official Spark Documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.batch_reader","title":"batch_reader property
","text":"batch_reader\n
Returns the Spark read object for batch processing.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.logged_option_keys","title":"logged_option_keys property
","text":"logged_option_keys\n
Keys that are allowed to be logged for the options.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.options","title":"options property
","text":"options\n
Merge fixed parameters with arbitrary options provided by user.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, str]] = Field(default_factory=dict, alias='kafka_options', description=\"Arbitrary options to be applied when creating NSP Reader. If a user provides values for 'subscribe' or 'kafka.bootstrap.servers', they will be ignored in favor of configuration passed through 'topic' and 'read_broker' respectively.\")\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.read_broker","title":"read_broker class-attribute
instance-attribute
","text":"read_broker: str = Field(..., description='Kafka brokers to read from, should be passed as a single string with multiple brokers passed in a comma separated list')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.reader","title":"reader property
","text":"reader\n
Returns the appropriate reader based on the streaming flag.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.stream_reader","title":"stream_reader property
","text":"stream_reader\n
Returns the Spark readStream object.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: Optional[bool] = Field(default=False, description='Whether to read the kafka topic as a stream or not. Defaults to False.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.topic","title":"topic class-attribute
instance-attribute
","text":"topic: str = Field(default=..., description='Kafka topic to consume.')\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/kafka.py
def execute(self):\n applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n self.log.debug(f\"Applying options {applied_options}\")\n\n self.output.df = self.reader.format(\"kafka\").options(**self.options).load()\n
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader","title":"koheesio.spark.readers.kafka.KafkaStreamReader","text":"KafkaStreamReader is a KafkaReader that reads data as a stream
This class is identical to KafkaReader, with the streaming
parameter defaulting to True
.
"},{"location":"api_reference/spark/readers/kafka.html#koheesio.spark.readers.kafka.KafkaStreamReader.streaming","title":"streaming class-attribute
instance-attribute
","text":"streaming: bool = True\n
"},{"location":"api_reference/spark/readers/memory.html","title":"Memory","text":"Create Spark DataFrame directly from the data stored in a Python variable
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat","title":"koheesio.spark.readers.memory.DataFormat","text":"Data formats supported by the InMemoryDataReader
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.CSV","title":"CSV class-attribute
instance-attribute
","text":"CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.DataFormat.JSON","title":"JSON class-attribute
instance-attribute
","text":"JSON = 'json'\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader","title":"koheesio.spark.readers.memory.InMemoryDataReader","text":"Directly read data from a Python variable and convert it to a Spark DataFrame.
Read the data, that is stored in one of the supported formats (see DataFormat
) directly from the variable and convert it to the Spark DataFrame. The use cases include converting JSON output of the API into the dataframe; reading the CSV data via the API (e.g. Box API).
The advantage of using this reader is that it allows to read the data directly from the Python variable, without the need to store it on the disk. This can be useful when the data is small and does not need to be stored permanently.
Parameters:
Name Type Description Default data
Union[str, list, dict, bytes]
Source data
required format
DataFormat
File / data format
required schema_
Optional[StructType]
Schema that will be applied during the creation of Spark DataFrame
None
params
Optional[Dict[str, Any]]
Set of extra parameters that should be passed to the appropriate reader (csv / json). Optionally, the user can pass the parameters that are specific to the reader (e.g. multiLine
for JSON reader) as key-word arguments. These will be merged with the params
parameter.
dict
Example # Read CSV data from a string\ndf1 = InMemoryDataReader(format=DataFormat.CSV, data='foo,bar\\nA,1\\nB,2')\n\n# Read JSON data from a string\ndf2 = InMemoryDataReader(format=DataFormat.JSON, data='{\"foo\": A, \"bar\": 1}'\ndf3 = InMemoryDataReader(format=DataFormat.JSON, data=['{\"foo\": \"A\", \"bar\": 1}', '{\"foo\": \"B\", \"bar\": 2}']\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.data","title":"data class-attribute
instance-attribute
","text":"data: Union[str, list, dict, bytes] = Field(default=..., description='Source data')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.format","title":"format class-attribute
instance-attribute
","text":"format: DataFormat = Field(default=..., description='File / data format')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to the appropriate reader (csv / json)')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.schema_","title":"schema_ class-attribute
instance-attribute
","text":"schema_: Optional[StructType] = Field(default=None, alias='schema', description='[Optional] Schema that will be applied during the creation of Spark DataFrame')\n
"},{"location":"api_reference/spark/readers/memory.html#koheesio.spark.readers.memory.InMemoryDataReader.execute","title":"execute","text":"execute()\n
Execute method appropriate to the specific data format
Source code in src/koheesio/spark/readers/memory.py
def execute(self):\n \"\"\"\n Execute method appropriate to the specific data format\n \"\"\"\n _func = getattr(InMemoryDataReader, f\"_{self.format}\")\n _df = partial(_func, self, self._rdd)()\n self.output.df = _df\n
"},{"location":"api_reference/spark/readers/metastore.html","title":"Metastore","text":"Create Spark DataFrame from table in Metastore
"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader","title":"koheesio.spark.readers.metastore.MetastoreReader","text":"Reader for tables/views from Spark Metastore
Parameters:
Name Type Description Default table
str
Table name in spark metastore
required"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.table","title":"table class-attribute
instance-attribute
","text":"table: str = Field(default=..., description='Table name in spark metastore')\n
"},{"location":"api_reference/spark/readers/metastore.html#koheesio.spark.readers.metastore.MetastoreReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/metastore.py
def execute(self):\n self.output.df = self.spark.table(self.table)\n
"},{"location":"api_reference/spark/readers/rest_api.html","title":"Rest api","text":"This module provides the RestApiReader class for interacting with RESTful APIs.
The RestApiReader class is designed to fetch data from RESTful APIs and store the response in a DataFrame. It supports different transports, e.g. Paginated Http or Async HTTP. The main entry point is the execute
method, which performs transport.execute() call and provide data from the API calls.
For more details on how to use this class and its methods, refer to the class docstring.
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader","title":"koheesio.spark.readers.rest_api.RestApiReader","text":"A reader class that executes an API call and stores the response in a DataFrame.
Parameters:
Name Type Description Default transport
Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]
The HTTP transport step.
required spark_schema
Union[str, StructType, List[str], Tuple[str, ...], AtomicType]
The pyspark schema of the response.
required Attributes:
Name Type Description transport
Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]]
The HTTP transport step.
spark_schema
Union[str, StructType, List[str], Tuple[str, ...], AtomicType]
The pyspark schema of the response.
Returns:
Type Description Output
The output of the reader, which includes the DataFrame.
Examples:
Here are some examples of how to use this class:
Example 1: Paginated Transport
import requests\nfrom urllib3 import Retry\n\nfrom koheesio.steps.http import HttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = requests.Session()\nretry_logic = Retry(total=max_retries, status_forcelist=[503])\nsession.mount(\"https://\", HTTPAdapter(max_retries=retry_logic))\nsession.mount(\"http://\", HTTPAdapter(max_retries=retry_logic))\n\ntransport = PaginatedHtppGetStep(\n url=\"https://api.example.com/data?page={page}\",\n paginate=True,\n pages=3,\n session=session,\n)\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n
Example 2: Async Transport
from aiohttp import ClientSession, TCPConnector\nfrom aiohttp_retry import ExponentialRetry\nfrom yarl import URL\n\nfrom koheesio.steps.asyncio.http import AsyncHttpGetStep\nfrom koheesio.spark.readers.rest_api import RestApiReader\n\nsession = ClientSession()\nurls = [URL(\"http://httpbin.org/get\"), URL(\"http://httpbin.org/get\")]\nretry_options = ExponentialRetry()\nconnector = TCPConnector(limit=10)\ntransport = AsyncHttpGetStep(\n client_session=session,\n url=urls,\n retry_options=retry_options,\n connector=connector,\n)\n\ntask = RestApiReader(transport=transport, spark_schema=\"id: int, page:int, value: string\")\ntask.execute()\nall_data = [row.asDict() for row in task.output.df.collect()]\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.spark_schema","title":"spark_schema class-attribute
instance-attribute
","text":"spark_schema: Union[str, StructType, List[str], Tuple[str, ...], AtomicType] = Field(..., description='The pyspark schema of the response')\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.transport","title":"transport class-attribute
instance-attribute
","text":"transport: Union[InstanceOf[AsyncHttpGetStep], InstanceOf[HttpGetStep]] = Field(..., description='HTTP transport step', exclude=True)\n
"},{"location":"api_reference/spark/readers/rest_api.html#koheesio.spark.readers.rest_api.RestApiReader.execute","title":"execute","text":"execute() -> Output\n
Executes the API call and stores the response in a DataFrame.
Returns:
Type Description Output
The output of the reader, which includes the DataFrame.
Source code in src/koheesio/spark/readers/rest_api.py
def execute(self) -> Reader.Output:\n \"\"\"\n Executes the API call and stores the response in a DataFrame.\n\n Returns\n -------\n Reader.Output\n The output of the reader, which includes the DataFrame.\n \"\"\"\n raw_data = self.transport.execute()\n\n if isinstance(raw_data, HttpGetStep.Output):\n data = raw_data.response_json\n elif isinstance(raw_data, AsyncHttpGetStep.Output):\n data = [d for d, _ in raw_data.responses_urls] # type: ignore\n\n if data:\n self.output.df = self.spark.createDataFrame(data=data, schema=self.spark_schema) # type: ignore\n
"},{"location":"api_reference/spark/readers/snowflake.html","title":"Snowflake","text":"Module containing Snowflake reader classes.
This module contains classes for reading data from Snowflake. The classes are used to create a Spark DataFrame from a Snowflake table or a query.
Classes:
Name Description SnowflakeReader
Reader for Snowflake tables.
Query
Reader for Snowflake queries.
DbTableQuery
Reader for Snowflake queries that return a single row.
Notes The classes are defined in the koheesio.steps.integrations.snowflake module; this module simply inherits from the classes defined there.
See Also - koheesio.spark.readers.Reader Base class for all Readers.
- koheesio.steps.integrations.snowflake Module containing Snowflake classes.
More detailed class descriptions can be found in the class docstrings.
"},{"location":"api_reference/spark/readers/spark_sql_reader.html","title":"Spark sql reader","text":"This module contains the SparkSqlReader class which reads the SparkSQL compliant query and returns the dataframe.
"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader","title":"koheesio.spark.readers.spark_sql_reader.SparkSqlReader","text":"SparkSqlReader reads the SparkSQL compliant query and returns the dataframe.
This SQL can originate from a string or a file and may contain placeholder (parameters) for templating. - Placeholders are identified with ${placeholder}. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).
Example SQL script (example.sql):
SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n
Python code:
from koheesio.spark.readers import SparkSqlReader\n\nreader = SparkSqlReader(\n sql_path=\"example.sql\",\n # params can also be passed as kwargs\n dynamic_column\"=\"name\",\n \"table_name\"=\"my_table\"\n)\nreader.execute()\n
In this example, the SQL script is read from a file and the placeholders are replaced with the given params. The resulting SQL query is:
SELECT id, id + 1 AS incremented_id, name AS extra_column\nFROM my_table\n
The query is then executed and the resulting DataFrame is stored in the output.df
attribute.
Parameters:
Name Type Description Default sql_path
str or Path
Path to a SQL file
required sql
str
SQL query to execute
required params
dict
Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.
required Notes Any arbitrary kwargs passed to the class will be added to params.
"},{"location":"api_reference/spark/readers/spark_sql_reader.html#koheesio.spark.readers.spark_sql_reader.SparkSqlReader.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/readers/spark_sql_reader.py
def execute(self):\n self.output.df = self.spark.sql(self.query)\n
"},{"location":"api_reference/spark/readers/teradata.html","title":"Teradata","text":"Teradata reader.
"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader","title":"koheesio.spark.readers.teradata.TeradataReader","text":"Wrapper around JdbcReader for Teradata.
Notes - Consider using synthetic partitioning column when using partitioned read:
MOD(HASHBUCKET(HASHROW(<TABLE>.<COLUMN>)), <NUM_PARTITIONS>)
- Relevant jars should be added to the Spark session manually. This class does not take care of that.
See Also - Refer to JdbcReader for the list of all available parameters.
- Refer to Teradata docs for the list of all available connection string parameters: https://teradata-docs.s3.amazonaws.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_2.html#BABJIHBJ
Example This example depends on the Teradata terajdbc4
JAR. e.g. terajdbc4-17.20.00.15. Keep in mind that older versions of terajdbc4
drivers also require tdgssconfig
JAR.
from koheesio.spark.readers.teradata import TeradataReader\n\ntd = TeradataReader(\n url=\"jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on\",\n user=\"YOUR_USERNAME\",\n password=\"***\",\n dbtable=\"schemaname.tablename\",\n)\n
Parameters:
Name Type Description Default url
str
JDBC connection string. Refer to Teradata docs for the list of all available connection string parameters. Example: jdbc:teradata://<domain_or_ip>/logmech=ldap,charset=utf8,database=<db>,type=fastexport, maybenull=on
required user
str
Username
required password
SecretStr
Password
required dbtable
str
Database table name, also include schema name
required options
Optional[Dict[str, Any]]
Extra options to pass to the Teradata JDBC driver. Refer to Teradata docs for the list of all available connection string parameters.
{\"fetchsize\": 2000, \"numPartitions\": 10}
query
Optional[str]
Query
None
format
str
The type of format to load. Defaults to 'jdbc'. Should not be changed.
required driver
str
Driver name. Be aware that the driver jar needs to be passed to the task. Should not be changed.
required"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.driver","title":"driver class-attribute
instance-attribute
","text":"driver: str = Field('com.teradata.jdbc.TeraDriver', description='Make sure that the necessary JARs are available in the cluster: terajdbc4-x.x.x.x')\n
"},{"location":"api_reference/spark/readers/teradata.html#koheesio.spark.readers.teradata.TeradataReader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, Any]] = Field({'fetchsize': 2000, 'numPartitions': 10}, description='Extra options to pass to the Teradata JDBC driver')\n
"},{"location":"api_reference/spark/readers/databricks/index.html","title":"Databricks","text":""},{"location":"api_reference/spark/readers/databricks/autoloader.html","title":"Autoloader","text":"Read from a location using Databricks' autoloader
Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader","title":"koheesio.spark.readers.databricks.autoloader.AutoLoader","text":"Read from a location using Databricks' autoloader
Autoloader can ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
Notes autoloader
is a Spark Structured Streaming
function!
Although most transformations are compatible with Spark Structured Streaming
, not all of them are. As a result, be mindful with your downstream transformations.
Parameters:
Name Type Description Default format
Union[str, AutoLoaderFormat]
The file format, used in cloudFiles.format
. Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
required location
str
The location where the files are located, used in cloudFiles.location
required schema_location
str
The location for storing inferred schema and supporting schema evolution, used in cloudFiles.schemaLocation
.
required options
Optional[Dict[str, str]]
Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html
{}
Example from koheesio.spark.readers.databricks import AutoLoader, AutoLoaderFormat\n\nresult_df = AutoLoader(\n format=AutoLoaderFormat.JSON,\n location=\"some_s3_path\",\n schema_location=\"other_s3_path\",\n options={\"multiLine\": \"true\"},\n).read()\n
See Also Some other useful documentation:
- autoloader: https://docs.databricks.com/ingestion/auto-loader/index.html
- Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.format","title":"format class-attribute
instance-attribute
","text":"format: Union[str, AutoLoaderFormat] = Field(default=..., description=__doc__)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.location","title":"location class-attribute
instance-attribute
","text":"location: str = Field(default=..., description='The location where the files are located, used in `cloudFiles.location`')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.options","title":"options class-attribute
instance-attribute
","text":"options: Optional[Dict[str, str]] = Field(default_factory=dict, description='Extra inputs to provide to the autoloader. For a full list of inputs, see https://docs.databricks.com/ingestion/auto-loader/options.html')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.schema_location","title":"schema_location class-attribute
instance-attribute
","text":"schema_location: str = Field(default=..., alias='schemaLocation', description='The location for storing inferred schema and supporting schema evolution, used in `cloudFiles.schemaLocation`.')\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.execute","title":"execute","text":"execute()\n
Reads from the given location with the given options using Autoloader
Source code in src/koheesio/spark/readers/databricks/autoloader.py
def execute(self):\n \"\"\"Reads from the given location with the given options using Autoloader\"\"\"\n self.output.df = self.reader().load(self.location)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.get_options","title":"get_options","text":"get_options()\n
Get the options for the autoloader
Source code in src/koheesio/spark/readers/databricks/autoloader.py
def get_options(self):\n \"\"\"Get the options for the autoloader\"\"\"\n self.options.update(\n {\n \"cloudFiles.format\": self.format,\n \"cloudFiles.schemaLocation\": self.schema_location,\n }\n )\n return self.options\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.reader","title":"reader","text":"reader()\n
Return the reader for the autoloader
Source code in src/koheesio/spark/readers/databricks/autoloader.py
def reader(self):\n \"\"\"Return the reader for the autoloader\"\"\"\n return self.spark.readStream.format(\"cloudFiles\").options(**self.get_options())\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoader.validate_format","title":"validate_format","text":"validate_format(format_specified)\n
Validate format
value
Source code in src/koheesio/spark/readers/databricks/autoloader.py
@field_validator(\"format\")\ndef validate_format(cls, format_specified):\n \"\"\"Validate `format` value\"\"\"\n if isinstance(format_specified, str):\n if format_specified.upper() in [f.value.upper() for f in AutoLoaderFormat]:\n format_specified = getattr(AutoLoaderFormat, format_specified.upper())\n return str(format_specified.value)\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","title":"koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat","text":"The file format, used in cloudFiles.format
Autoloader supports JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats.
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.AVRO","title":"AVRO class-attribute
instance-attribute
","text":"AVRO = 'avro'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.BINARYFILE","title":"BINARYFILE class-attribute
instance-attribute
","text":"BINARYFILE = 'binaryfile'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.CSV","title":"CSV class-attribute
instance-attribute
","text":"CSV = 'csv'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.JSON","title":"JSON class-attribute
instance-attribute
","text":"JSON = 'json'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.ORC","title":"ORC class-attribute
instance-attribute
","text":"ORC = 'orc'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.PARQUET","title":"PARQUET class-attribute
instance-attribute
","text":"PARQUET = 'parquet'\n
"},{"location":"api_reference/spark/readers/databricks/autoloader.html#koheesio.spark.readers.databricks.autoloader.AutoLoaderFormat.TEXT","title":"TEXT class-attribute
instance-attribute
","text":"TEXT = 'text'\n
"},{"location":"api_reference/spark/transformations/index.html","title":"Transformations","text":"This module contains the base classes for all transformations.
See class docstrings for more information.
References For a comprehensive guide on the usage, examples, and additional features of Transformation classes, please refer to the reference/concepts/steps/transformations section of the Koheesio documentation.
Classes:
Name Description Transformation
Base class for all transformations
ColumnsTransformation
Extended Transformation class with a preset validator for handling column(s) data
ColumnsTransformationWithTarget
Extended ColumnsTransformation class with an additional target_column
field
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation","title":"koheesio.spark.transformations.ColumnsTransformation","text":"Extended Transformation class with a preset validator for handling column(s) data with a standardized input for a single column or multiple columns.
Concept A ColumnsTransformation is a Transformation with a standardized input for column or columns. The columns
are stored as a list. Either a single string, or a list of strings can be passed to enter the columns
. column
and columns
are aliases to one another - internally the name columns
should be used though.
columns
are stored as a list - either a single string, or a list of strings can be passed to enter the
columns
column
and columns
are aliases to one another - internally the name columns
should be used though.
If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns
Configuring the ColumnsTransformation The ColumnsTransformation class has a ColumnConfig
class that can be used to configure the behavior of the class. This class has the following fields: - run_for_all_data_type
allows to run the transformation for all columns of a given type.
-
limit_data_type
allows to limit the transformation to a specific data type.
-
data_type_strict_mode
Toggles strict mode for data type validation. Will only work if limit_data_type
is set.
Note that Data types need to be specified as a SparkDatatype enum.
See the docstrings of the ColumnConfig
class for more information. See the SparkDatatype enum for a list of available data types.
Users should not have to interact with the ColumnConfig
class directly.
Parameters:
Name Type Description Default columns
The column (or list of columns) to apply the transformation to. Alias: column
required Example from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\n\nclass AddOne(ColumnsTransformation):\n def execute(self):\n for column in self.get_columns():\n self.output.df = self.df.withColumn(column, f.col(column) + 1)\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.columns","title":"columns class-attribute
instance-attribute
","text":"columns: ListOfColumns = Field(default='', alias='column', description='The column (or list of columns) to apply the transformation to. Alias: column')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.data_type_strict_mode_is_set","title":"data_type_strict_mode_is_set property
","text":"data_type_strict_mode_is_set: bool\n
Returns True if data_type_strict_mode is set
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.limit_data_type_is_set","title":"limit_data_type_is_set property
","text":"limit_data_type_is_set: bool\n
Returns True if limit_data_type is set
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.run_for_all_is_set","title":"run_for_all_is_set property
","text":"run_for_all_is_set: bool\n
Returns True if the transformation should be run for all columns of a given type
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig","title":"ColumnConfig","text":"Koheesio ColumnsTransformation specific Config
Parameters:
Name Type Description Default run_for_all_data_type
allows to run the transformation for all columns of a given type. A user can trigger this behavior by either omitting the columns
parameter or by passing a single *
as a column name. In both cases, the run_for_all_data_type
will be used to determine the data type. Value should be be passed as a SparkDatatype enum. (default: [None])
required limit_data_type
allows to limit the transformation to a specific data type. Value should be passed as a SparkDatatype enum. (default: [None])
required data_type_strict_mode
Toggles strict mode for data type validation. Will only work if limit_data_type
is set. - when True, a ValueError will be raised if any column does not adhere to the limit_data_type
- when False, a warning will be thrown and the column will be skipped instead (default: False)
required"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute
instance-attribute
","text":"data_type_strict_mode: bool = False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type: Optional[List[SparkDatatype]] = [None]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.column_type_of_col","title":"column_type_of_col","text":"column_type_of_col(col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True) -> Union[DataType, str]\n
Returns the dataType of a Column object as a string.
The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type based on the column name. We retrieve the name of the column from the Column object by calling toString() from the JVM.
Examples:
input_df: | str_column | int_column | |------------|------------| | hello | 1 | | world | 2 |
# using the AddOne transformation from the example above\nadd_one = AddOne(\n columns=[\"str_column\", \"int_column\"],\n df=input_df,\n)\nadd_one.column_type_of_col(\"str_column\") # returns \"string\"\nadd_one.column_type_of_col(\"int_column\") # returns \"integer\"\n# returns IntegerType\nadd_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n
Parameters:
Name Type Description Default col
Union[str, Column]
The column to check the type of
required df
Optional[DataFrame]
The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor will be used.
None
simple_return_mode
bool
If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.
True
Returns:
Name Type Description datatype
str
The type of the column as a string
Source code in src/koheesio/spark/transformations/__init__.py
def column_type_of_col(\n self, col: Union[str, Column], df: Optional[DataFrame] = None, simple_return_mode: bool = True\n) -> Union[DataType, str]:\n \"\"\"\n Returns the dataType of a Column object as a string.\n\n The Column object does not have a type attribute, so we have to ask the DataFrame its schema and find the type\n based on the column name. We retrieve the name of the column from the Column object by calling toString() from\n the JVM.\n\n Examples\n --------\n __input_df:__\n | str_column | int_column |\n |------------|------------|\n | hello | 1 |\n | world | 2 |\n\n ```python\n # using the AddOne transformation from the example above\n add_one = AddOne(\n columns=[\"str_column\", \"int_column\"],\n df=input_df,\n )\n add_one.column_type_of_col(\"str_column\") # returns \"string\"\n add_one.column_type_of_col(\"int_column\") # returns \"integer\"\n # returns IntegerType\n add_one.column_type_of_col(\"int_column\", simple_return_mode=False)\n ```\n\n Parameters\n ----------\n col: Union[str, Column]\n The column to check the type of\n\n df: Optional[DataFrame]\n The DataFrame belonging to the column. If not provided, the DataFrame passed to the constructor\n will be used.\n\n simple_return_mode: bool\n If True, the return value will be a simple string. If False, the return value will be a SparkDatatype enum.\n\n Returns\n -------\n datatype: str\n The type of the column as a string\n \"\"\"\n df = df or self.df\n if not df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n\n if not isinstance(col, Column):\n col = f.col(col)\n\n # ask the JVM for the name of the column\n # noinspection PyProtectedMember\n col_name = col._jc.toString()\n\n # In order to check the datatype of the column, we have to ask the DataFrame its schema\n df_col = [c for c in df.schema if c.name == col_name][0]\n\n if simple_return_mode:\n return SparkDatatype(df_col.dataType.typeName()).value\n\n return df_col.dataType\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_all_columns_of_specific_type","title":"get_all_columns_of_specific_type","text":"get_all_columns_of_specific_type(data_type: Union[str, SparkDatatype]) -> List[str]\n
Get all columns from the dataframe of a given type
A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will be raised.
Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you have to call this method multiple times.
Parameters:
Name Type Description Default data_type
Union[str, SparkDatatype]
The data type to get the columns for
required Returns:
Type Description List[str]
A list of column names of the given data type
Source code in src/koheesio/spark/transformations/__init__.py
def get_all_columns_of_specific_type(self, data_type: Union[str, SparkDatatype]) -> List[str]:\n \"\"\"Get all columns from the dataframe of a given type\n\n A DataFrame needs to be available in order to get the columns. If no DataFrame is available, a ValueError will\n be raised.\n\n Note: only one data type can be passed to this method. If you want to get columns of multiple data types, you\n have to call this method multiple times.\n\n Parameters\n ----------\n data_type: Union[str, SparkDatatype]\n The data type to get the columns for\n\n Returns\n -------\n List[str]\n A list of column names of the given data type\n \"\"\"\n if not self.df:\n raise ValueError(\"No dataframe available - cannot get columns\")\n\n expected_data_type = (SparkDatatype.from_string(data_type) if isinstance(data_type, str) else data_type).value\n\n columns_of_given_type: List[str] = [\n col for col in self.df.columns if self.df.schema[col].dataType.typeName() == expected_data_type\n ]\n return columns_of_given_type\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_columns","title":"get_columns","text":"get_columns() -> iter\n
Return an iterator of the columns
Source code in src/koheesio/spark/transformations/__init__.py
def get_columns(self) -> iter:\n \"\"\"Return an iterator of the columns\"\"\"\n # If `run_for_all_is_set` is True, we want to run the transformation for all columns of a given type\n if self.run_for_all_is_set:\n columns = []\n for data_type in self.ColumnConfig.run_for_all_data_type:\n columns += self.get_all_columns_of_specific_type(data_type)\n else:\n columns = self.columns\n\n for column in columns:\n if self.is_column_type_correct(column):\n yield column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.get_limit_data_types","title":"get_limit_data_types","text":"get_limit_data_types()\n
Get the limit_data_type as a list of strings
Source code in src/koheesio/spark/transformations/__init__.py
def get_limit_data_types(self):\n \"\"\"Get the limit_data_type as a list of strings\"\"\"\n return [dt.value for dt in self.ColumnConfig.limit_data_type]\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.is_column_type_correct","title":"is_column_type_correct","text":"is_column_type_correct(column)\n
Check if column type is correct and handle it if not, when limit_data_type is set
Source code in src/koheesio/spark/transformations/__init__.py
def is_column_type_correct(self, column):\n \"\"\"Check if column type is correct and handle it if not, when limit_data_type is set\"\"\"\n if not self.limit_data_type_is_set:\n return True\n\n if self.column_type_of_col(column) in (limit_data_types := self.get_limit_data_types()):\n return True\n\n # Raises a ValueError if the Column object is not of a given type and data_type_strict_mode is set\n if self.data_type_strict_mode_is_set:\n raise ValueError(\n f\"Critical error: {column} is not of type {limit_data_types}. Exception is raised because \"\n f\"`data_type_strict_mode` is set to True for {self.name}.\"\n )\n\n # Otherwise, throws a warning that the Column object is not of a given type\n self.log.warning(f\"Column `{column}` is not of type `{limit_data_types}` and will be skipped.\")\n return False\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformation.set_columns","title":"set_columns","text":"set_columns(columns_value)\n
Validate columns through the columns configuration provided
Source code in src/koheesio/spark/transformations/__init__.py
@field_validator(\"columns\", mode=\"before\")\ndef set_columns(cls, columns_value):\n \"\"\"Validate columns through the columns configuration provided\"\"\"\n columns = columns_value\n run_for_all_data_type = cls.ColumnConfig.run_for_all_data_type\n\n if run_for_all_data_type and len(columns) == 0:\n columns = [\"*\"]\n\n if columns[0] == \"*\" and not run_for_all_data_type:\n raise ValueError(\"Cannot use '*' as a column name when no run_for_all_data_type is set\")\n\n return columns\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget","title":"koheesio.spark.transformations.ColumnsTransformationWithTarget","text":"Extended ColumnsTransformation class with an additional target_column
field
Using this class makes implementing Transformations significantly easier.
Concept A ColumnsTransformationWithTarget
is a ColumnsTransformation
with an additional target_column
field. This field can be used to store the result of the transformation in a new column.
If the target_column
is not provided, the result will be stored in the source column.
If more than one column is passed, the behavior of the Class changes this way:
- the transformation will be run in a loop against all the given columns
- automatically handles the renaming of the columns when more than one column is passed
- the
target_column
will be used as a suffix. Leaving this blank will result in the original columns being renamed
The func
method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target
method to loop over all the columns and apply this function to transform the DataFrame.
Parameters:
Name Type Description Default columns
ListOfColumns
The column (or list of columns) to apply the transformation to. Alias: column. If not provided, the run_for_all_data_type
will be used to determine the data type. If run_for_all_data_type
is not set, the transformation will be run for all columns of a given type.
*
target_column
Optional[str]
The name of the column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this input will be used as a suffix instead.
None
Example Writing your own transformation using the ColumnsTransformationWithTarget
class:
from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n def func(self, col: Column):\n return col + 1\n
In the above example, the func
method is implemented to add 1 to the values of a given column.
In order to use this transformation, we can call the transform
method:
from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOneWithTarget(column=\"id\", target_column=\"new_id\").transform(df)\n
The output_df
will now contain the original DataFrame with an additional column called new_id
with the values of id
+ 1.
output_df:
id new_id 0 1 1 2 2 3 Note: The target_column
will be used as a suffix when more than one column is given as source. Leaving this blank will result in the original columns being renamed.
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: Optional[str] = Field(default=None, alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.execute","title":"execute","text":"execute()\n
Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output) This can be left unchanged, and hence should not be implemented in the child class.
Source code in src/koheesio/spark/transformations/__init__.py
def execute(self):\n \"\"\"Execute on a ColumnsTransformationWithTarget handles self.df (input) and set self.output.df (output)\n This can be left unchanged, and hence should not be implemented in the child class.\n \"\"\"\n df = self.df\n\n for target_column, column in self.get_columns_with_target():\n func = self.func # select the applicable function\n df = df.withColumn(\n target_column,\n func(f.col(column)),\n )\n\n self.output.df = df\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.func","title":"func abstractmethod
","text":"func(column: Column) -> Column\n
The function that will be run on a single Column of the DataFrame
The func
method should be implemented in the child class. This method should return the transformation that will be applied to the column(s). The execute method (already preset) will use the get_columns_with_target
method to loop over all the columns and apply this function to transform the DataFrame.
Parameters:
Name Type Description Default column
Column
The column to apply the transformation to
required Returns:
Type Description Column
The transformed column
Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef func(self, column: Column) -> Column:\n \"\"\"The function that will be run on a single Column of the DataFrame\n\n The `func` method should be implemented in the child class. This method should return the transformation that\n will be applied to the column(s). The execute method (already preset) will use the `get_columns_with_target`\n method to loop over all the columns and apply this function to transform the DataFrame.\n\n Parameters\n ----------\n column: Column\n The column to apply the transformation to\n\n Returns\n -------\n Column\n The transformed column\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.ColumnsTransformationWithTarget.get_columns_with_target","title":"get_columns_with_target","text":"get_columns_with_target() -> iter\n
Return an iterator of the columns
Works just like in get_columns from the ColumnsTransformation class except that it handles the target_column
as well.
If more than one column is passed, the behavior of the Class changes this way: - the transformation will be run in a loop against all the given columns - the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.
Returns:
Type Description iter
An iterator of tuples containing the target column name and the original column name
Source code in src/koheesio/spark/transformations/__init__.py
def get_columns_with_target(self) -> iter:\n \"\"\"Return an iterator of the columns\n\n Works just like in get_columns from the ColumnsTransformation class except that it handles the `target_column`\n as well.\n\n If more than one column is passed, the behavior of the Class changes this way:\n - the transformation will be run in a loop against all the given columns\n - the target_column will be used as a suffix. Leaving this blank will result in the original columns being\n renamed.\n\n Returns\n -------\n iter\n An iterator of tuples containing the target column name and the original column name\n \"\"\"\n columns = [*self.get_columns()]\n\n for column in columns:\n # ensures that we at least use the original column name\n target_column = self.target_column or column\n\n if len(columns) > 1: # target_column becomes a suffix when more than 1 column is given\n # dict.fromkeys is used to avoid duplicates in the name while maintaining order\n _cols = [column, target_column]\n target_column = \"_\".join(list(dict.fromkeys(_cols)))\n\n yield target_column, column\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation","title":"koheesio.spark.transformations.Transformation","text":"Base class for all transformations
Concept A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is transformed based on the logic implemented in the execute
method. Any additional parameters that are needed for the transformation can be passed to the constructor.
Parameters:
Name Type Description Default df
The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the transform-method.
required Example from koheesio.steps.transformations import Transformation\nfrom pyspark.sql import functions as f\n\n\nclass AddOne(Transformation):\n def execute(self):\n self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n
In the example above, the execute
method is implemented to add 1 to the values of the old_column
and store the result in a new column called new_column
.
In order to use this transformation, we can call the transform
method:
from pyspark.sql import SparkSession\n\n# create a DataFrame with 3 rows\ndf = SparkSession.builder.getOrCreate().range(3)\n\noutput_df = AddOne().transform(df)\n
The output_df
will now contain the original DataFrame with an additional column called new_column
with the values of old_column
+ 1.
output_df:
id new_column 0 1 1 2 2 3 ... Alternatively, we can pass the DataFrame to the constructor and call the execute
or transform
method without any arguments:
output_df = AddOne(df).transform()\n# or\noutput_df = AddOne(df).execute().output.df\n
Note: that the transform method was not implemented explicitly in the AddOne class. This is because the transform
method is already implemented in the Transformation
class. This means that all classes that inherit from the Transformation class will have the transform
method available. Only the execute method needs to be implemented.
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.execute","title":"execute abstractmethod
","text":"execute() -> Output\n
Execute on a Transformation should handle self.df (input) and set self.output.df (output)
This method should be implemented in the child class. The input DataFrame is available as self.df
and the output DataFrame should be stored in self.output.df
.
For example:
def execute(self):\n self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n
The transform method will call this method and return the output DataFrame.
Source code in src/koheesio/spark/transformations/__init__.py
@abstractmethod\ndef execute(self) -> SparkStep.Output:\n \"\"\"Execute on a Transformation should handle self.df (input) and set self.output.df (output)\n\n This method should be implemented in the child class. The input DataFrame is available as `self.df` and the\n output DataFrame should be stored in `self.output.df`.\n\n For example:\n ```python\n def execute(self):\n self.output.df = self.df.withColumn(\"new_column\", f.col(\"old_column\") + 1)\n ```\n\n The transform method will call this method and return the output DataFrame.\n \"\"\"\n # self.df # input dataframe\n # self.output.df # output dataframe\n self.output.df = ... # implement the transformation logic\n raise NotImplementedError\n
"},{"location":"api_reference/spark/transformations/index.html#koheesio.spark.transformations.Transformation.transform","title":"transform","text":"transform(df: Optional[DataFrame] = None) -> DataFrame\n
Execute the transformation and return the output DataFrame
Note: when creating a child from this, don't implement this transform method. Instead, implement execute!
See Also Transformation.execute
Parameters:
Name Type Description Default df
Optional[DataFrame]
The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor will be used.
None
Returns:
Type Description DataFrame
The transformed DataFrame
Source code in src/koheesio/spark/transformations/__init__.py
def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n \"\"\"Execute the transformation and return the output DataFrame\n\n Note: when creating a child from this, don't implement this transform method. Instead, implement execute!\n\n See Also\n --------\n `Transformation.execute`\n\n Parameters\n ----------\n df: Optional[DataFrame]\n The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor\n will be used.\n\n Returns\n -------\n DataFrame\n The transformed DataFrame\n \"\"\"\n self.df = df or self.df\n if not self.df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n self.execute()\n return self.output.df\n
"},{"location":"api_reference/spark/transformations/arrays.html","title":"Arrays","text":"A collection of classes for performing various transformations on arrays in PySpark.
These transformations include operations such as removing duplicates, exploding arrays into separate rows, reversing the order of elements, sorting elements, removing certain values, and calculating aggregate statistics like minimum, maximum, sum, mean, and median.
Concept - Every transformation in this module is implemented as a class that inherits from the
ArrayTransformation
class. - The
ArrayTransformation
class is a subclass of ColumnsTransformationWithTarget
- The
ArrayTransformation
class implements the func
method, which is used to define the transformation logic. - The
func
method takes a column
as input and returns a Column
object. - The
Column
object is a PySpark column that can be used to perform transformations on a DataFrame column. - The
ArrayTransformation
limits the data type of the transformation to array by setting the ColumnConfig
class to run_for_all_data_type = [SparkDatatype.ARRAY]
and limit_data_type = [SparkDatatype.ARRAY]
.
See Also - koheesio.spark.transformations Module containing all transformation classes.
- koheesio.spark.transformations.ColumnsTransformationWithTarget Base class for all transformations that operate on columns and have a target column.
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortAsc","title":"koheesio.spark.transformations.arrays.ArraySortAsc module-attribute
","text":"ArraySortAsc = ArraySort\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct","title":"koheesio.spark.transformations.arrays.ArrayDistinct","text":"Remove duplicates from array
Example ArrayDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.filter_empty","title":"filter_empty class-attribute
instance-attribute
","text":"filter_empty: bool = Field(default=True, description='Remove null, nan, and empty values from array. Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayDistinct.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n _fn = F.array_distinct(column)\n\n # noinspection PyUnresolvedReferences\n element_type = self.column_type_of_col(column, None, False).elementType\n is_numeric = spark_data_type_is_numeric(element_type)\n\n if self.filter_empty:\n # Remove null values from array\n if spark_minor_version >= 3.4:\n # Run array_compact if spark version is 3.4 or higher\n # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_compact.html\n # pylint: disable=E0611\n from pyspark.sql.functions import array_compact as _array_compact\n\n _fn = _array_compact(_fn)\n # pylint: enable=E0611\n else:\n # Otherwise, remove null from array using array_except\n _fn = F.array_except(_fn, F.array(F.lit(None)))\n\n # Remove nan or empty values from array (depends on the type of the elements in array)\n if is_numeric:\n # Remove nan from array (float/int/numbers)\n _fn = F.array_except(_fn, F.array(F.lit(float(\"nan\")).cast(element_type)))\n else:\n # Remove empty values from array (string/text)\n _fn = F.array_except(_fn, F.array(F.lit(\"\"), F.lit(\" \")))\n\n return _fn\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax","title":"koheesio.spark.transformations.arrays.ArrayMax","text":"Return the maximum value in the array
Example ArrayMax(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMax.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n # Call for processing of nan values\n column = super().func(column)\n\n return F.array_max(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean","title":"koheesio.spark.transformations.arrays.ArrayMean","text":"Return the mean of the values in the array.
Note: Only numeric values are supported for calculating the mean.
Example ArrayMean(column=\"array_column\", target_column=\"average\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMean.func","title":"func","text":"func(column: Column) -> Column\n
Calculate the mean of the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"Calculate the mean of the values in the array\"\"\"\n # raise an error if the array contains non-numeric elements\n element_type = self.column_type_of_col(col=column, df=None, simple_return_mode=False).elementType\n\n if not spark_data_type_is_numeric(element_type):\n raise ValueError(\n f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n f\"Only numeric values are supported for calculating a mean.\"\n )\n\n _sum = ArraySum.from_step(self).func(column)\n # Call for processing of nan values\n column = super().func(column)\n _size = F.size(column)\n # return 0 if the size of the array is 0 to avoid division by zero\n return F.when(_size == 0, F.lit(0)).otherwise(_sum / _size)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian","title":"koheesio.spark.transformations.arrays.ArrayMedian","text":"Return the median of the values in the array.
The median is the middle value in a sorted, ascending or descending, list of numbers.
- If the size of the array is even, the median is the average of the two middle numbers.
- If the size of the array is odd, the median is the middle number.
Note: Only numeric values are supported for calculating the median.
Example ArrayMedian(column=\"array_column\", target_column=\"median\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMedian.func","title":"func","text":"func(column: Column) -> Column\n
Calculate the median of the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"Calculate the median of the values in the array\"\"\"\n # Call for processing of nan values\n column = super().func(column)\n\n sorted_array = ArraySort.from_step(self).func(column)\n _size: Column = F.size(sorted_array)\n\n # Calculate the middle index. If the size is odd, PySpark discards the fractional part.\n # Use floor function to ensure the result is an integer\n middle: Column = F.floor((_size + 1) / 2).cast(\"int\")\n\n # Define conditions\n is_size_zero: Column = _size == 0\n is_column_null: Column = column.isNull()\n is_size_even: Column = _size % 2 == 0\n\n # Define actions / responses\n # For even-sized arrays, calculate the average of the two middle elements\n average_of_middle_elements = (F.element_at(sorted_array, middle) + F.element_at(sorted_array, middle + 1)) / 2\n # For odd-sized arrays, select the middle element\n middle_element = F.element_at(sorted_array, middle)\n # In case the array is empty, return either None or 0\n none_value = F.lit(None)\n zero_value = F.lit(0)\n\n median = (\n # Check if the size of the array is 0\n F.when(\n is_size_zero,\n # If the size of the array is 0 and the column is null, return None\n # If the size of the array is 0 and the column is not null, return 0\n F.when(is_column_null, none_value).otherwise(zero_value),\n ).otherwise(\n # If the size of the array is not 0, calculate the median\n F.when(is_size_even, average_of_middle_elements).otherwise(middle_element)\n )\n )\n\n return median\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin","title":"koheesio.spark.transformations.arrays.ArrayMin","text":"Return the minimum value in the array
Example ArrayMin(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayMin.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n return F.array_min(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess","title":"koheesio.spark.transformations.arrays.ArrayNullNanProcess","text":"Process an array by removing NaN and/or NULL values from elements.
Parameters:
Name Type Description Default keep_nan
bool
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.
False
keep_null
bool
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.
False
Returns:
Name Type Description column
Column
The processed column with NaN and/or NULL values removed from elements.
Examples:
>>> input_data = [(1, [1.1, 2.1, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=False)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1]\n\n>>> input_data = [(1, [1.1, 2.2, 4.1, float(\"nan\")])]\n>>> input_schema = StructType([StructField(\"id\", IntegerType(), True),\n StructField(\"array_float\", ArrayType(FloatType()), True),\n])\n>>> spark = SparkSession.builder.getOrCreate()\n>>> df = spark.createDataFrame(input_data, schema=input_schema)\n>>> transformer = ArrayNumericNanProcess(column=\"array_float\", keep_nan=True)\n>>> transformer.transform(df)\n>>> print(transformer.output.df.collect()[0].asDict()[\"array_float\"])\n[1.1, 2.1, 4.1, nan]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_nan","title":"keep_nan class-attribute
instance-attribute
","text":"keep_nan: bool = Field(False, description='Whether to keep nan values in the array. Default is False. If set to True, the nan values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.keep_null","title":"keep_null class-attribute
instance-attribute
","text":"keep_null: bool = Field(False, description='Whether to keep null values in the array. Default is False. If set to True, the null values will be kept in the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayNullNanProcess.func","title":"func","text":"func(column: Column) -> Column\n
Process the given column by removing NaN and/or NULL values from elements.
Parameters: column : Column The column to be processed.
Returns: column : Column The processed column with NaN and/or NULL values removed from elements.
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"\n Process the given column by removing NaN and/or NULL values from elements.\n\n Parameters:\n -----------\n column : Column\n The column to be processed.\n\n Returns:\n --------\n column : Column\n The processed column with NaN and/or NULL values removed from elements.\n \"\"\"\n\n def apply_logic(x: Column):\n if self.keep_nan is False and self.keep_null is False:\n logic = x.isNotNull() & ~F.isnan(x)\n elif self.keep_nan is False:\n logic = ~F.isnan(x)\n elif self.keep_null is False:\n logic = x.isNotNull()\n\n return logic\n\n if self.keep_nan is False or self.keep_null is False:\n column = F.filter(column, apply_logic)\n\n return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove","title":"koheesio.spark.transformations.arrays.ArrayRemove","text":"Remove a certain value from the array
Parameters:
Name Type Description Default keep_nan
bool
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.
False
keep_null
bool
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.
False
Example ArrayRemove(column=\"array_column\", value=\"value_to_remove\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.make_distinct","title":"make_distinct class-attribute
instance-attribute
","text":"make_distinct: bool = Field(default=False, description='Whether to remove duplicates from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.value","title":"value class-attribute
instance-attribute
","text":"value: Any = Field(default=None, description='The value to remove from the array.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayRemove.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n value = self.value\n\n column = super().func(column)\n\n def filter_logic(x: Column, _val: Any):\n if self.keep_null and self.keep_nan:\n logic = (x != F.lit(_val)) | x.isNull() | F.isnan(x)\n elif self.keep_null:\n logic = (x != F.lit(_val)) | x.isNull()\n elif self.keep_nan:\n logic = (x != F.lit(_val)) | F.isnan(x)\n else:\n logic = x != F.lit(_val)\n\n return logic\n\n # Check if the value is iterable (i.e., a list, tuple, or set)\n if isinstance(value, (list, tuple, set)):\n result = reduce(lambda res, val: F.filter(res, lambda x: filter_logic(x, val)), value, column)\n else:\n # If the value is not iterable, simply remove the value from the array\n result = F.filter(column, lambda x: filter_logic(x, value))\n\n if self.make_distinct:\n result = F.array_distinct(result)\n\n return result\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse","title":"koheesio.spark.transformations.arrays.ArrayReverse","text":"Reverse the order of elements in the array
Example ArrayReverse(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayReverse.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n return F.reverse(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort","title":"koheesio.spark.transformations.arrays.ArraySort","text":"Sort the elements in the array
By default, the elements are sorted in ascending order. To sort the elements in descending order, set the reverse
parameter to True.
Example ArraySort(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.reverse","title":"reverse class-attribute
instance-attribute
","text":"reverse: bool = Field(default=False, description='Sort the elements in the array in a descending order. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySort.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n column = F.array_sort(column)\n if self.reverse:\n # Reverse the order of elements in the array\n column = ArrayReverse.from_step(self).func(column)\n return column\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc","title":"koheesio.spark.transformations.arrays.ArraySortDesc","text":"Sort the elements in the array in descending order
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySortDesc.reverse","title":"reverse class-attribute
instance-attribute
","text":"reverse: bool = True\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum","title":"koheesio.spark.transformations.arrays.ArraySum","text":"Return the sum of the values in the array
Parameters:
Name Type Description Default keep_nan
bool
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array.
False
keep_null
bool
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array.
False
Example ArraySum(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArraySum.func","title":"func","text":"func(column: Column) -> Column\n
Using the aggregate
function to sum the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n \"\"\"Using the `aggregate` function to sum the values in the array\"\"\"\n # raise an error if the array contains non-numeric elements\n element_type = self.column_type_of_col(column, None, False).elementType\n if not spark_data_type_is_numeric(element_type):\n raise ValueError(\n f\"{column = } contains non-numeric values. The array type is {element_type}. \"\n f\"Only numeric values are supported for summing.\"\n )\n\n # remove na values from array.\n column = super().func(column)\n\n # Using the `aggregate` function to sum the values in the array by providing the initial value as 0.0 and the\n # lambda function to add the elements together. Pyspark will automatically infer the type of the initial value\n # making 0.0 valid for both integer and float types.\n initial_value = F.lit(0.0)\n return F.aggregate(column, initial_value, lambda accumulator, x: accumulator + x)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation","title":"koheesio.spark.transformations.arrays.ArrayTransformation","text":"Base class for array transformations
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig","title":"ColumnConfig","text":"Set the data type of the Transformation to array
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [ARRAY]\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ArrayTransformation.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n raise NotImplementedError(\"This is an abstract class\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode","title":"koheesio.spark.transformations.arrays.Explode","text":"Explode the array into separate rows
Example Explode(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.distinct","title":"distinct class-attribute
instance-attribute
","text":"distinct: bool = Field(False, description='Remove duplicates from the exploded array. Default is False.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.preserve_nulls","title":"preserve_nulls class-attribute
instance-attribute
","text":"preserve_nulls: bool = Field(True, description='Preserve rows with null values in the exploded array by using explode_outer instead of explode.Default is True.')\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.Explode.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/arrays.py
def func(self, column: Column) -> Column:\n if self.distinct:\n column = ArrayDistinct.from_step(self).func(column)\n return F.explode_outer(column) if self.preserve_nulls else F.explode(column)\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct","title":"koheesio.spark.transformations.arrays.ExplodeDistinct","text":"Explode the array into separate rows while removing duplicates and empty values
Example ExplodeDistinct(column=\"array_column\")\n
"},{"location":"api_reference/spark/transformations/arrays.html#koheesio.spark.transformations.arrays.ExplodeDistinct.distinct","title":"distinct class-attribute
instance-attribute
","text":"distinct: bool = True\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html","title":"Camel to snake","text":"Class for converting DataFrame column names from camel case to snake case.
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.camel_to_snake_re","title":"koheesio.spark.transformations.camel_to_snake.camel_to_snake_re module-attribute
","text":"camel_to_snake_re = compile('([a-z0-9])([A-Z])')\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","title":"koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation","text":"Converts column names from camel case to snake cases
Parameters:
Name Type Description Default columns
Optional[ListOfColumns]
The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: [\"column1\", \"column2\"]
or \"column1\"
None
Example input_df:
camelCaseColumn snake_case_column ... ... output_df = CamelToSnakeTransformation(column=\"camelCaseColumn\").transform(input_df)\n
output_df:
camel_case_column snake_case_column ... ... In this example, the column camelCaseColumn
is converted to camel_case_column
.
Note: the data in the columns is not changed, only the column names.
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[ListOfColumns] = Field(default='', alias='column', description=\"The column or columns to convert. If no columns are specified, all columns will be converted. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'` \")\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.CamelToSnakeTransformation.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/camel_to_snake.py
def execute(self):\n _df = self.df\n\n # Prepare columns input:\n columns = self.df.columns if self.columns == [\"*\"] else self.columns\n\n for column in columns:\n _df = _df.withColumnRenamed(column, convert_camel_to_snake(column))\n\n self.output.df = _df\n
"},{"location":"api_reference/spark/transformations/camel_to_snake.html#koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","title":"koheesio.spark.transformations.camel_to_snake.convert_camel_to_snake","text":"convert_camel_to_snake(name: str)\n
Converts a string from camelCase to snake_case.
Parameters: name : str The string to be converted.
Returns: str The converted string in snake_case.
Source code in src/koheesio/spark/transformations/camel_to_snake.py
def convert_camel_to_snake(name: str):\n \"\"\"\n Converts a string from camelCase to snake_case.\n\n Parameters:\n ----------\n name : str\n The string to be converted.\n\n Returns:\n --------\n str\n The converted string in snake_case.\n \"\"\"\n return camel_to_snake_re.sub(r\"\\1_\\2\", name).lower()\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html","title":"Cast to datatype","text":"Transformations to cast a column or set of columns to a given datatype.
Each one of these have been vetted to throw warnings when wrong datatypes are passed (to skip erroring any job or pipeline).
Furthermore, detailed tests have been added to ensure that types are actually compatible as prescribed.
Concept - One can use the CastToDataType class directly, or use one of the more specific subclasses.
- Each subclass is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
- Compatible Data types are specified in the ColumnConfig class of each subclass, and are documented in the docstring of each subclass.
See class docstrings for more information
Note Dates, Arrays and Maps are not supported by this module.
- for dates, use the koheesio.spark.transformations.date_time module
- for arrays, use the koheesio.spark.transformations.arrays module
Classes:
Name Description CastToDatatype:
Cast a column or set of columns to a given datatype
CastToByte
Cast to Byte (a.k.a. tinyint)
CastToShort
Cast to Short (a.k.a. smallint)
CastToInteger
Cast to Integer (a.k.a. int)
CastToLong
Cast to Long (a.k.a. bigint)
CastToFloat
Cast to Float (a.k.a. real)
CastToDouble
Cast to Double
CastToDecimal
Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)
CastToString
Cast to String
CastToBinary
Cast to Binary (a.k.a. byte array)
CastToBoolean
Cast to Boolean
CastToTimestamp
Cast to Timestamp
Note The following parameters are common to all classes in this module:
Parameters:
Name Type Description Default columns
ListOfColumns
Name of the source column(s). Alias: column
required target_column
str
Name of the target column or alias if more than one column is specified. Alias: target_alias
required datatype
str or SparkDatatype
Datatype to cast to. Choose from SparkDatatype Enum (only needs to be specified in CastToDatatype, all other classes have a fixed datatype)
required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary","title":"koheesio.spark.transformations.cast_to_datatype.CastToBinary","text":"Cast to Binary (a.k.a. byte array)
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- float
- double
- decimal
- boolean
- timestamp
- date
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- string
Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = BINARY\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToBinary class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBinary.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, STRING]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean","title":"koheesio.spark.transformations.cast_to_datatype.CastToBoolean","text":"Cast to Boolean
Unsupported datatypes: Following casts are not supported
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = BOOLEAN\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToBoolean class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToBoolean.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte","title":"koheesio.spark.transformations.cast_to_datatype.CastToByte","text":"Cast to Byte (a.k.a. tinyint)
Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- boolean
- timestamp
- decimal
- double
- float
- long
- integer
- short
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- timestamp range of values too small for timestamp to have any meaning
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = BYTE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToByte class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToByte.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype","title":"koheesio.spark.transformations.cast_to_datatype.CastToDatatype","text":"Cast a column or set of columns to a given datatype
Wrapper around pyspark.sql.Column.cast
Concept This class acts as the base class for all the other CastTo* classes. It is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
Example input_df:
c1 c2 1 2 3 4 output_df = CastToDatatype(\n column=\"c1\",\n datatype=\"string\",\n target_alias=\"c1\",\n).transform(input_df)\n
output_df:
c1 c2 \"1\" 2 \"3\" 4 In the example above, the column c1
is cast to a string datatype. The column c2
is not affected.
Parameters:
Name Type Description Default columns
ListOfColumns
Name of the source column(s). Alias: column
required datatype
str or SparkDatatype
Datatype to cast to. Choose from SparkDatatype Enum
required target_column
str
Name of the target column or alias if more than one column is specified. Alias: target_alias
required"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = Field(default=..., description='Datatype. Choose from SparkDatatype Enum')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n # This is to let the IDE explicitly know that the datatype is not a string, but a `SparkDatatype` Enum\n datatype: SparkDatatype = self.datatype\n return column.cast(datatype.spark_type())\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDatatype.validate_datatype","title":"validate_datatype","text":"validate_datatype(datatype_value) -> SparkDatatype\n
Validate the datatype.
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@field_validator(\"datatype\")\ndef validate_datatype(cls, datatype_value) -> SparkDatatype:\n \"\"\"Validate the datatype.\"\"\"\n # handle string input\n try:\n if isinstance(datatype_value, str):\n datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value)\n return datatype_value\n\n # and let SparkDatatype handle the rest\n datatype_value: SparkDatatype = SparkDatatype.from_string(datatype_value.value)\n\n except AttributeError as e:\n raise AttributeError(f\"Invalid datatype: {datatype_value}\") from e\n\n return datatype_value\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal","title":"koheesio.spark.transformations.cast_to_datatype.CastToDecimal","text":"Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)
Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal
. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
The DecimalType must have fixed precision (the maximum total number of digits) and scale (the number of digits on the right of dot). For example, (5, 2) can support the value from [-999.99 to 999.99].
The precision can be up to 38, the scale must be less or equal to precision.
Spark sets the default precision and scale to (10, 0) when creating a DecimalType. However, when inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).
For compatibility reasons, Koheesio sets the default precision and scale to (38, 18) when creating a DecimalType.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- boolean
- timestamp
- date
- string
- void
- decimal spark will convert existing decimals to null if the precision and scale doesn't fit the data
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
Parameters:
Name Type Description Default columns
ListOfColumns
Name of the source column(s). Alias: column
*
target_column
str
Name of the target column or alias if more than one column is specified. Alias: target_alias
required precision
conint(gt=0, le=38)
the maximum (i.e. total) number of digits (default: 38). Must be > 0.
38
scale
conint(ge=0, le=18)
the number of digits on right side of dot. (default: 18). Must be >= 0.
18
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = DECIMAL\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.precision","title":"precision class-attribute
instance-attribute
","text":"precision: conint(gt=0, le=38) = Field(default=38, description='The maximum total number of digits (precision) of the decimal. Must be > 0. Default is 38')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.scale","title":"scale class-attribute
instance-attribute
","text":"scale: conint(ge=0, le=18) = Field(default=18, description='The number of digits to the right of the decimal point (scale). Must be >= 0. Default is 18')\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToDecimal class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
def func(self, column: Column) -> Column:\n return column.cast(self.datatype.spark_type(precision=self.precision, scale=self.scale))\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDecimal.validate_scale_and_precisions","title":"validate_scale_and_precisions","text":"validate_scale_and_precisions()\n
Validate the precision and scale values.
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
@model_validator(mode=\"after\")\ndef validate_scale_and_precisions(self):\n \"\"\"Validate the precision and scale values.\"\"\"\n precision_value = self.precision\n scale_value = self.scale\n\n if scale_value == precision_value:\n self.log.warning(\"scale and precision are equal, this will result in a null value\")\n if scale_value > precision_value:\n raise ValueError(\"scale must be < precision\")\n\n return self\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble","title":"koheesio.spark.transformations.cast_to_datatype.CastToDouble","text":"Cast to Double
Represents 8-byte double-precision floating point numbers. The range of numbers is from -1.7976931348623157E308 to 1.7976931348623157E308.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = DOUBLE\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToDouble class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToDouble.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat","title":"koheesio.spark.transformations.cast_to_datatype.CastToFloat","text":"Cast to Float (a.k.a. real)
Represents 4-byte single-precision floating point numbers. The range of numbers is from -3.402823E38 to 3.402823E38.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- double
- decimal
- boolean
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- timestamp precision is lost (use CastToDouble instead)
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = FLOAT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToFloat class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToFloat.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger","title":"koheesio.spark.transformations.cast_to_datatype.CastToInteger","text":"Cast to Integer (a.k.a. int)
Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- long
- float
- double
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = INTEGER\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToInteger class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToInteger.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong","title":"koheesio.spark.transformations.cast_to_datatype.CastToLong","text":"Cast to Long (a.k.a. bigint)
Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- long
- float
- double
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = LONG\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToLong class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToLong.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, FLOAT, DOUBLE, DECIMAL, BOOLEAN, TIMESTAMP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort","title":"koheesio.spark.transformations.cast_to_datatype.CastToShort","text":"Cast to Short (a.k.a. smallint)
Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- integer
- long
- float
- double
- decimal
- string
- boolean
- timestamp
- date
- void
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- timestamp range of values too small for timestamp to have any meaning
- date converts to null
- void skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = SHORT\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToShort class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, STRING, TIMESTAMP, DATE, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToShort.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BOOLEAN]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString","title":"koheesio.spark.transformations.cast_to_datatype.CastToString","text":"Cast to String
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- binary
- boolean
- timestamp
- date
- array
- map
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = STRING\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToString class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToString.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, BINARY, BOOLEAN, TIMESTAMP, DATE, ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","title":"koheesio.spark.transformations.cast_to_datatype.CastToTimestamp","text":"Cast to Timestamp
Numeric time stamp is the number of seconds since January 1, 1970 00:00:00.000000000 UTC. Not advised to be used on small integers, as the range of values is too small for timestamp to have any meaning.
For more fine-grained control over the timestamp format, use the date_time
module. This allows for parsing strings to timestamps and vice versa.
See Also - koheesio.spark.transformations.date_time
- https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#timestamp-pattern
Unsupported datatypes: Following casts are not supported
- binary
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- integer
- long
- float
- double
- decimal
- date
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- boolean: range of values too small for timestamp to have any meaning
- byte: range of values too small for timestamp to have any meaning
- string: converts to null in most cases, use
date_time
module instead - short: range of values too small for timestamp to have any meaning
- void: skipped by default
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.datatype","title":"datatype class-attribute
instance-attribute
","text":"datatype: Union[str, SparkDatatype] = TIMESTAMP\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig","title":"ColumnConfig","text":"Set the data types that are compatible with the CastToTimestamp class.
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, BOOLEAN, BYTE, SHORT, STRING, VOID]\n
"},{"location":"api_reference/spark/transformations/cast_to_datatype.html#koheesio.spark.transformations.cast_to_datatype.CastToTimestamp.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, DATE]\n
"},{"location":"api_reference/spark/transformations/drop_column.html","title":"Drop column","text":"This module defines the DropColumn class, a subclass of ColumnsTransformation.
"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn","title":"koheesio.spark.transformations.drop_column.DropColumn","text":"Drop one or more columns
The DropColumn class is used to drop one or more columns from a PySpark DataFrame. It wraps the pyspark.DataFrame.drop
function and can handle either a single string or a list of strings as input.
If a specified column does not exist in the DataFrame, no error or warning is thrown, and all existing columns will remain.
Expected behavior - When the
column
does not exist, all columns will remain (no error or warning is thrown) - Either a single string, or a list of strings can be specified
Example df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = DropColumn(column=\"product\").transform(df)\n
output_df:
amount country 1000 USA 1500 USA 1600 USA In this example, the product
column is dropped from the DataFrame df
.
"},{"location":"api_reference/spark/transformations/drop_column.html#koheesio.spark.transformations.drop_column.DropColumn.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/drop_column.py
def execute(self):\n self.log.info(f\"{self.column=}\")\n self.output.df = self.df.drop(*self.columns)\n
"},{"location":"api_reference/spark/transformations/dummy.html","title":"Dummy","text":"Dummy transformation for testing purposes.
"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation","title":"koheesio.spark.transformations.dummy.DummyTransformation","text":"Dummy transformation for testing purposes.
This transformation adds a new column hello
to the DataFrame with the value world
.
It is intended for testing purposes or for use in examples or reference documentation.
Example input_df:
id 1 output_df = DummyTransformation().transform(input_df)\n
output_df:
id hello 1 world In this example, the hello
column is added to the DataFrame input_df
.
"},{"location":"api_reference/spark/transformations/dummy.html#koheesio.spark.transformations.dummy.DummyTransformation.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/dummy.py
def execute(self):\n self.output.df = self.df.withColumn(\"hello\", lit(\"world\"))\n
"},{"location":"api_reference/spark/transformations/get_item.html","title":"Get item","text":"Transformation to wrap around the pyspark getItem function
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem","title":"koheesio.spark.transformations.get_item.GetItem","text":"Get item from list or map (dictionary)
Wrapper around pyspark.sql.functions.getItem
GetItem
is strict about the data type of the column. If the column is not a list or a map, an error will be raised.
Note Only MapType and ArrayType are supported.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to get the item from. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
key
Union[int, str]
The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index
required Example"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-list-arraytype","title":"Example with list (ArrayType)","text":"By specifying an integer for the parameter \"key\", getItem knows to get the element at index n of a list (index starts at 0).
input_df:
id content 1 [1, 2, 3] 2 [4, 5] 3 [6] 4 [] output_df = GetItem(\n column=\"content\",\n index=1, # get the second element of the list\n target_column=\"item\",\n).transform(input_df)\n
output_df:
id content item 1 [1, 2, 3] 2 2 [4, 5] 5 3 [6] null 4 [] null"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem--example-with-a-dict-maptype","title":"Example with a dict (MapType)","text":"input_df:
id content 1 {key1 -> value1} 2 {key1 -> value2} 3 {key2 -> hello} 4 {key2 -> world} output_df = GetItem(\n column= \"content\",\n key=\"key2,\n target_column=\"item\",\n).transform(input_df)\n
As we request the key to be \"key2\", the first 2 rows will be null, because it does not have \"key2\". output_df:
id content item 1 {key1 -> value1} null 2 {key1 -> value2} null 3 {key2 -> hello} hello 4 {key2 -> world} world"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.key","title":"key class-attribute
instance-attribute
","text":"key: Union[int, str] = Field(default=..., alias='index', description='The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string. Alias: index')\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig","title":"ColumnConfig","text":"Limit the data types to ArrayType and MapType.
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.data_type_strict_mode","title":"data_type_strict_mode class-attribute
instance-attribute
","text":"data_type_strict_mode = True\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = run_for_all_data_type\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [ARRAY, MAP]\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.GetItem.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/get_item.py
def func(self, column: Column) -> Column:\n return get_item(column, self.key)\n
"},{"location":"api_reference/spark/transformations/get_item.html#koheesio.spark.transformations.get_item.get_item","title":"koheesio.spark.transformations.get_item.get_item","text":"get_item(column: Column, key: Union[str, int])\n
Wrapper around pyspark.sql.functions.getItem
Parameters:
Name Type Description Default column
Column
The column to get the item from
required key
Union[str, int]
The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer. If the column is a dict (MapType), this should be a string.
required Returns:
Type Description Column
The column with the item
Source code in src/koheesio/spark/transformations/get_item.py
def get_item(column: Column, key: Union[str, int]):\n \"\"\"\n Wrapper around pyspark.sql.functions.getItem\n\n Parameters\n ----------\n column : Column\n The column to get the item from\n key : Union[str, int]\n The key (or index) to get from the list or map. If the column is a list (ArrayType), this should be an integer.\n If the column is a dict (MapType), this should be a string.\n\n Returns\n -------\n Column\n The column with the item\n \"\"\"\n return column.getItem(key)\n
"},{"location":"api_reference/spark/transformations/hash.html","title":"Hash","text":"Module for hashing data using SHA-2 family of hash functions
See the docstring of the Sha2Hash class for more information.
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.HASH_ALGORITHM","title":"koheesio.spark.transformations.hash.HASH_ALGORITHM module-attribute
","text":"HASH_ALGORITHM = Literal[224, 256, 384, 512]\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.STRING","title":"koheesio.spark.transformations.hash.STRING module-attribute
","text":"STRING = STRING\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash","title":"koheesio.spark.transformations.hash.Sha2Hash","text":"hash the value of 1 or more columns using SHA-2 family of hash functions
Mild wrapper around pyspark.sql.functions.sha2
- https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
Note This function allows concatenating the values of multiple columns together prior to hashing.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to hash. Alias: column
required delimiter
Optional[str]
Optional separator for the string that will eventually be hashed. Defaults to '|'
|
num_bits
Optional[HASH_ALGORITHM]
Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512
256
target_column
str
The generated hash will be written to the column name specified here
required"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.delimiter","title":"delimiter class-attribute
instance-attribute
","text":"delimiter: Optional[str] = Field(default='|', description=\"Optional separator for the string that will eventually be hashed. Defaults to '|'\")\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.num_bits","title":"num_bits class-attribute
instance-attribute
","text":"num_bits: Optional[HASH_ALGORITHM] = Field(default=256, description='Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: str = Field(default=..., description='The generated hash will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.Sha2Hash.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/hash.py
def execute(self):\n columns = list(self.get_columns())\n self.output.df = (\n self.df.withColumn(\n self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)\n )\n if columns\n else self.df\n )\n
"},{"location":"api_reference/spark/transformations/hash.html#koheesio.spark.transformations.hash.sha2_hash","title":"koheesio.spark.transformations.hash.sha2_hash","text":"sha2_hash(columns: List[str], delimiter: Optional[str] = '|', num_bits: Optional[HASH_ALGORITHM] = 256)\n
hash the value of 1 or more columns using SHA-2 family of hash functions
Mild wrapper around pyspark.sql.functions.sha2
- https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.
If a null is passed, the result will also be null.
Parameters:
Name Type Description Default columns
List[str]
The columns to hash
required delimiter
Optional[str]
Optional separator for the string that will eventually be hashed. Defaults to '|'
|
num_bits
Optional[HASH_ALGORITHM]
Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512
256
Source code in src/koheesio/spark/transformations/hash.py
def sha2_hash(columns: List[str], delimiter: Optional[str] = \"|\", num_bits: Optional[HASH_ALGORITHM] = 256):\n \"\"\"\n hash the value of 1 or more columns using SHA-2 family of hash functions\n\n Mild wrapper around pyspark.sql.functions.sha2\n\n - https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html\n\n Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).\n This function allows concatenating the values of multiple columns together prior to hashing.\n\n If a null is passed, the result will also be null.\n\n Parameters\n ----------\n columns : List[str]\n The columns to hash\n delimiter : Optional[str], optional, default=|\n Optional separator for the string that will eventually be hashed. Defaults to '|'\n num_bits : Optional[HASH_ALGORITHM], optional, default=256\n Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512\n \"\"\"\n # make sure all columns are of type pyspark.sql.Column and cast to string\n _columns = []\n for c in columns:\n if isinstance(c, str):\n c: Column = col(c)\n _columns.append(c.cast(STRING.spark_type()))\n\n # concatenate columns if more than 1 column is provided\n if len(_columns) > 1:\n column = concat_ws(delimiter, *_columns)\n else:\n column = _columns[0]\n\n return sha2(column, num_bits)\n
"},{"location":"api_reference/spark/transformations/lookup.html","title":"Lookup","text":"Lookup transformation for joining two dataframes together
Classes:
Name Description JoinMapping
TargetColumn
JoinType
JoinHint
DataframeLookup
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup","title":"koheesio.spark.transformations.lookup.DataframeLookup","text":"Lookup transformation for joining two dataframes together
Parameters:
Name Type Description Default df
DataFrame
The left Spark DataFrame
required other
DataFrame
The right Spark DataFrame
required on
List[JoinMapping] | JoinMapping
List of join mappings. If only one mapping is passed, it can be passed as a single object.
required targets
List[TargetColumn] | TargetColumn
List of target columns. If only one target is passed, it can be passed as a single object.
required how
JoinType
What type of join to perform. Defaults to left. See JoinType for more information.
required hint
JoinHint
What type of join hint to use. Defaults to None. See JoinHint for more information.
required Example from pyspark.sql import SparkSession\nfrom koheesio.spark.transformations.lookup import (\n DataframeLookup,\n JoinMapping,\n TargetColumn,\n JoinType,\n)\n\nspark = SparkSession.builder.getOrCreate()\n\n# create the dataframes\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\n# perform the lookup\nlookup = DataframeLookup(\n df=left_df,\n other=right_df,\n on=JoinMapping(source_column=\"id\", joined_column=\"id\"),\n targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n how=JoinType.LEFT,\n)\n\noutput_df = lookup.transform()\n
output_df:
id value right_value 1 A A 2 B null In this example, the left_df
and right_df
dataframes are joined together using the id
column. The value
column from the right_df
is aliased as right_value
in the output dataframe.
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.df","title":"df class-attribute
instance-attribute
","text":"df: DataFrame = Field(default=None, description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.hint","title":"hint class-attribute
instance-attribute
","text":"hint: Optional[JoinHint] = Field(default=None, description='What type of join hint to use. Defaults to None. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.how","title":"how class-attribute
instance-attribute
","text":"how: Optional[JoinType] = Field(default=LEFT, description='What type of join to perform. Defaults to left. ' + __doc__)\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.on","title":"on class-attribute
instance-attribute
","text":"on: Union[List[JoinMapping], JoinMapping] = Field(default=..., alias='join_mapping', description='List of join mappings. If only one mapping is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.other","title":"other class-attribute
instance-attribute
","text":"other: DataFrame = Field(default=None, description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.targets","title":"targets class-attribute
instance-attribute
","text":"targets: Union[List[TargetColumn], TargetColumn] = Field(default=..., alias='target_columns', description='List of target columns. If only one target is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output","title":"Output","text":"Output for the lookup transformation
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.left_df","title":"left_df class-attribute
instance-attribute
","text":"left_df: DataFrame = Field(default=..., description='The left Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.Output.right_df","title":"right_df class-attribute
instance-attribute
","text":"right_df: DataFrame = Field(default=..., description='The right Spark DataFrame')\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.execute","title":"execute","text":"execute() -> Output\n
Execute the lookup transformation
Source code in src/koheesio/spark/transformations/lookup.py
def execute(self) -> Output:\n \"\"\"Execute the lookup transformation\"\"\"\n # prepare the right dataframe\n prepared_right_df = self.get_right_df().select(\n *[join_mapping.column for join_mapping in self.on],\n *[target.column for target in self.targets],\n )\n if self.hint:\n prepared_right_df = prepared_right_df.hint(self.hint)\n\n # generate the output\n self.output.left_df = self.df\n self.output.right_df = prepared_right_df\n self.output.df = self.df.join(\n prepared_right_df,\n on=[join_mapping.source_column for join_mapping in self.on],\n how=self.how,\n )\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.get_right_df","title":"get_right_df","text":"get_right_df() -> DataFrame\n
Get the right side dataframe
Source code in src/koheesio/spark/transformations/lookup.py
def get_right_df(self) -> DataFrame:\n \"\"\"Get the right side dataframe\"\"\"\n return self.other\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.DataframeLookup.set_list","title":"set_list","text":"set_list(value)\n
Ensure that we can pass either a single object, or a list of objects
Source code in src/koheesio/spark/transformations/lookup.py
@field_validator(\"on\", \"targets\")\ndef set_list(cls, value):\n \"\"\"Ensure that we can pass either a single object, or a list of objects\"\"\"\n return [value] if not isinstance(value, list) else value\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint","title":"koheesio.spark.transformations.lookup.JoinHint","text":"Supported join hints
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.BROADCAST","title":"BROADCAST class-attribute
instance-attribute
","text":"BROADCAST = 'broadcast'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinHint.MERGE","title":"MERGE class-attribute
instance-attribute
","text":"MERGE = 'merge'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping","title":"koheesio.spark.transformations.lookup.JoinMapping","text":"Mapping for joining two dataframes together
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.column","title":"column property
","text":"column: Column\n
Get the join mapping as a pyspark.sql.Column object
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.other_column","title":"other_column instance-attribute
","text":"other_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinMapping.source_column","title":"source_column instance-attribute
","text":"source_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType","title":"koheesio.spark.transformations.lookup.JoinType","text":"Supported join types
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.ANTI","title":"ANTI class-attribute
instance-attribute
","text":"ANTI = 'anti'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.CROSS","title":"CROSS class-attribute
instance-attribute
","text":"CROSS = 'cross'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.FULL","title":"FULL class-attribute
instance-attribute
","text":"FULL = 'full'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.INNER","title":"INNER class-attribute
instance-attribute
","text":"INNER = 'inner'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.LEFT","title":"LEFT class-attribute
instance-attribute
","text":"LEFT = 'left'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.RIGHT","title":"RIGHT class-attribute
instance-attribute
","text":"RIGHT = 'right'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.JoinType.SEMI","title":"SEMI class-attribute
instance-attribute
","text":"SEMI = 'semi'\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn","title":"koheesio.spark.transformations.lookup.TargetColumn","text":"Target column for the joined dataframe
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.column","title":"column property
","text":"column: Column\n
Get the target column as a pyspark.sql.Column object
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column","title":"target_column instance-attribute
","text":"target_column: str\n
"},{"location":"api_reference/spark/transformations/lookup.html#koheesio.spark.transformations.lookup.TargetColumn.target_column_alias","title":"target_column_alias instance-attribute
","text":"target_column_alias: str\n
"},{"location":"api_reference/spark/transformations/repartition.html","title":"Repartition","text":"Repartition Transformation
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition","title":"koheesio.spark.transformations.repartition.Repartition","text":"Wrapper around DataFrame.repartition
With repartition, the number of partitions can be given as an optional value. If this is not provided, a default value is used. The default number of partitions is defined by the spark config 'spark.sql.shuffle.partitions', for which the default value is 200 and will never exceed the number or rows in the DataFrame (whichever is value is lower).
If columns are omitted, the entire DataFrame is repartitioned without considering the particular values in the columns.
Parameters:
Name Type Description Default column
Optional[Union[str, List[str]]]
Name of the source column(s). If omitted, the entire DataFrame is repartitioned without considering the particular values in the columns. Alias: columns
None
num_partitions
Optional[int]
The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.
None
Example Repartition(column=[\"c1\", \"c2\"], num_partitions=3) # results in 3 partitions\nRepartition(column=\"c1\", num_partitions=2) # results in 2 partitions\nRepartition(column=[\"c1\", \"c2\"]) # results in <= 200 partitions\nRepartition(num_partitions=5) # results in 5 partitions\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[ListOfColumns] = Field(default='', alias='column', description='Name of the source column(s)')\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.numPartitions","title":"numPartitions class-attribute
instance-attribute
","text":"numPartitions: Optional[int] = Field(default=None, alias='num_partitions', description=\"The number of partitions to repartition to. If omitted, the default number of partitions is used as defined by the spark config 'spark.sql.shuffle.partitions'.\")\n
"},{"location":"api_reference/spark/transformations/repartition.html#koheesio.spark.transformations.repartition.Repartition.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/repartition.py
def execute(self):\n # Prepare columns input:\n columns = self.df.columns if self.columns == [\"*\"] else self.columns\n # Prepare repartition input:\n # num_partitions comes first, but if it is not provided it should not be included as None.\n repartition_inputs = [i for i in [self.numPartitions, *columns] if i]\n self.output.df = self.df.repartition(*repartition_inputs)\n
"},{"location":"api_reference/spark/transformations/replace.html","title":"Replace","text":"Transformation to replace a particular value in a column with another one
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace","title":"koheesio.spark.transformations.replace.Replace","text":"Replace a particular value in a column with another one
Can handle empty strings (\"\") as well as NULL / None values.
Unsupported datatypes: Following casts are not supported
will raise an error in Spark:
- binary
- boolean
- array<...>
- map<...,...>
Supported datatypes: Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- timestamp
- date
- string
- void skipped by default
Any supported none-string datatype will be cast to string before the replacement is done.
Example input_df:
id string 1 hello 2 world 3 output_df = Replace(\n column=\"string\",\n from_value=\"hello\",\n to_value=\"programmer\",\n).transform(input_df)\n
output_df:
id string 1 programmer 2 world 3 In this example, the value \"hello\" in the column \"string\" is replaced with \"programmer\".
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.from_value","title":"from_value class-attribute
instance-attribute
","text":"from_value: Optional[str] = Field(default=None, alias='from', description=\"The original value that needs to be replaced. If no value is given, all 'null' values will be replaced with the to_value\")\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.to_value","title":"to_value class-attribute
instance-attribute
","text":"to_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig","title":"ColumnConfig","text":"Column type configurations for the column to be replaced
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [*run_for_all_data_type, VOID]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, DECIMAL, STRING, TIMESTAMP, DATE]\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.Replace.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/replace.py
def func(self, column: Column) -> Column:\n return replace(column=column, from_value=self.from_value, to_value=self.to_value)\n
"},{"location":"api_reference/spark/transformations/replace.html#koheesio.spark.transformations.replace.replace","title":"koheesio.spark.transformations.replace.replace","text":"replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None)\n
Function to replace a particular value in a column with another one
Source code in src/koheesio/spark/transformations/replace.py
def replace(column: Union[Column, str], to_value: str, from_value: Optional[str] = None):\n \"\"\"Function to replace a particular value in a column with another one\"\"\"\n # make sure we have a Column object\n if isinstance(column, str):\n column = col(column)\n\n if not from_value:\n condition = column.isNull()\n else:\n condition = column == from_value\n\n return when(condition, lit(to_value)).otherwise(column)\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html","title":"Row number dedup","text":"This module contains the RowNumberDedup class, which performs a row_number deduplication operation on a DataFrame.
See the docstring of the RowNumberDedup class for more information.
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup","title":"koheesio.spark.transformations.row_number_dedup.RowNumberDedup","text":"A class used to perform a row_number deduplication operation on a DataFrame.
This class is a specialized transformation that extends the ColumnsTransformation class. It sorts the DataFrame based on the provided sort columns and assigns a row_number to each row. It then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row can be stored in a specified target column or a default column named \"meta_row_number_column\". The class also provides an option to preserve meta columns (like the row_numberk column) in the output DataFrame.
Attributes:
Name Type Description columns
list
List of columns to apply the transformation to. If a single '*' is passed as a column name or if the columns parameter is omitted, the transformation will be applied to all columns of the data types specified in run_for_all_data_type
of the ColumnConfig. (inherited from ColumnsTransformation)
sort_columns
list
List of columns that the DataFrame will be sorted by.
target_column
(str, optional)
Column where the row_number of each row will be stored.
preserve_meta
(bool, optional)
Flag that determines whether the meta columns should be kept in the output DataFrame.
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.preserve_meta","title":"preserve_meta class-attribute
instance-attribute
","text":"preserve_meta: bool = Field(default=False, description=\"If true, meta columns are kept in output dataframe. Defaults to 'False'\")\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.sort_columns","title":"sort_columns class-attribute
instance-attribute
","text":"sort_columns: conlist(Union[str, Column], min_length=0) = Field(default_factory=list, alias='sort_column', description='List of orderBy columns. If only one column is passed, it can be passed as a single object.')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: Optional[Union[str, Column]] = Field(default='meta_row_number_column', alias='target_suffix', description='The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix')\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.window_spec","title":"window_spec property
","text":"window_spec: WindowSpec\n
Builds a WindowSpec object based on the columns defined in the configuration.
The WindowSpec object is used to define a window frame over which functions are applied in Spark. This method partitions the data by the columns returned by the get_columns
method and then orders the partitions by the columns specified in sort_columns
.
Notes The order of the columns in the WindowSpec object is preserved. If a column is passed as a string, it is converted to a Column object with DESC ordering.
Returns:
Type Description WindowSpec
A WindowSpec object that can be used to define a window frame in Spark.
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.execute","title":"execute","text":"execute() -> Output\n
Performs the row_number deduplication operation on the DataFrame.
This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row, and then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row is stored in the target column. If preserve_meta is False, the method also drops the target column from the DataFrame.
Source code in src/koheesio/spark/transformations/row_number_dedup.py
def execute(self) -> RowNumberDedup.Output:\n \"\"\"\n Performs the row_number deduplication operation on the DataFrame.\n\n This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row,\n and then filters the DataFrame to keep only the top-row_number row for each group of duplicates.\n The row_number of each row is stored in the target column. If preserve_meta is False,\n the method also drops the target column from the DataFrame.\n \"\"\"\n df = self.df\n window_spec = self.window_spec\n\n # if target_column is a string, convert it to a Column object\n if isinstance((target_column := self.target_column), str):\n target_column = col(target_column)\n\n # dedup the dataframe based on the window spec\n df = df.withColumn(self.target_column, row_number().over(window_spec)).filter(target_column == 1).select(\"*\")\n\n if not self.preserve_meta:\n df = df.drop(target_column)\n\n self.output.df = df\n
"},{"location":"api_reference/spark/transformations/row_number_dedup.html#koheesio.spark.transformations.row_number_dedup.RowNumberDedup.set_sort_columns","title":"set_sort_columns","text":"set_sort_columns(columns_value)\n
Validates and optimizes the sort_columns parameter.
This method ensures that sort_columns is a list (or single object) of unique strings or Column objects. It removes any empty strings or None values from the list and deduplicates the columns.
Parameters:
Name Type Description Default columns_value
Union[str, Column, List[Union[str, Column]]]
The value of the sort_columns parameter.
required Returns:
Type Description List[Union[str, Column]]
The optimized and deduplicated list of sort columns.
Source code in src/koheesio/spark/transformations/row_number_dedup.py
@field_validator(\"sort_columns\", mode=\"before\")\ndef set_sort_columns(cls, columns_value):\n \"\"\"\n Validates and optimizes the sort_columns parameter.\n\n This method ensures that sort_columns is a list (or single object) of unique strings or Column objects.\n It removes any empty strings or None values from the list and deduplicates the columns.\n\n Parameters\n ----------\n columns_value : Union[str, Column, List[Union[str, Column]]]\n The value of the sort_columns parameter.\n\n Returns\n -------\n List[Union[str, Column]]\n The optimized and deduplicated list of sort columns.\n \"\"\"\n # Convert single string or Column object to a list\n columns = [columns_value] if isinstance(columns_value, (str, Column)) else [*columns_value]\n\n # Remove empty strings, None, etc.\n columns = [c for c in columns if (isinstance(c, Column) and c is not None) or (isinstance(c, str) and c)]\n\n dedup_columns = []\n seen = set()\n\n # Deduplicate the columns while preserving the order\n for column in columns:\n if str(column) not in seen:\n dedup_columns.append(column)\n seen.add(str(column))\n\n return dedup_columns\n
"},{"location":"api_reference/spark/transformations/sql_transform.html","title":"Sql transform","text":"SQL Transform module
SQL Transform module provides an easy interface to transform a dataframe using SQL. This SQL can originate from a string or a file and may contain placeholders for templating.
"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform","title":"koheesio.spark.transformations.sql_transform.SqlTransform","text":"SQL Transform module provides an easy interface to transform a dataframe using SQL.
This SQL can originate from a string or a file and may contain placeholder (parameters) for templating.
- Placeholders are identified with
${placeholder}
. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).
Example sql script:
SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column\nFROM ${table_name}\n
"},{"location":"api_reference/spark/transformations/sql_transform.html#koheesio.spark.transformations.sql_transform.SqlTransform.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/sql_transform.py
def execute(self):\n table_name = get_random_string(prefix=\"sql_transform\")\n self.params = {**self.params, \"table_name\": table_name}\n\n df = self.df\n df.createOrReplaceTempView(table_name)\n query = self.query\n\n self.output.df = self.spark.sql(query)\n
"},{"location":"api_reference/spark/transformations/transform.html","title":"Transform","text":"Transform module
Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform","title":"koheesio.spark.transformations.transform.Transform","text":"Transform(func: Callable, params: Dict = None, df: DataFrame = None, **kwargs)\n
Transform aims to provide an easy interface for calling transformations on a Spark DataFrame, where the transformation is a function that accepts a DataFrame (df) and any number of keyword args.
The implementation is inspired by and based upon: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html
Parameters:
Name Type Description Default func
Callable
The function to be called on the DataFrame.
required params
Dict
The keyword arguments to be passed to the function. Defaults to None. Alternatively, keyword arguments can be passed directly as keyword arguments - they will be merged with the params
dictionary.
None
Example Source code in src/koheesio/spark/transformations/transform.py
def __init__(self, func: Callable, params: Dict = None, df: DataFrame = None, **kwargs):\n params = {**(params or {}), **kwargs}\n super().__init__(func=func, params=params, df=df)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--a-function-compatible-with-transform","title":"a function compatible with Transform:","text":"def some_func(df, a: str, b: str):\n return df.withColumn(a, f.lit(b))\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--verbose-style-input-in-transform","title":"verbose style input in Transform","text":"Transform(func=some_func, params={\"a\": \"foo\", \"b\": \"bar\"})\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--shortened-style-notation-easier-to-read","title":"shortened style notation (easier to read)","text":"Transform(some_func, a=\"foo\", b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--when-too-much-input-is-given-transform-will-ignore-extra-input","title":"when too much input is given, Transform will ignore extra input","text":"Transform(\n some_func,\n a=\"foo\",\n # ignored input\n c=\"baz\",\n title=42,\n author=\"Adams\",\n # order of params input should not matter\n b=\"bar\",\n)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform--using-the-from_func-classmethod","title":"using the from_func classmethod","text":"SomeFunc = Transform.from_func(some_func, a=\"foo\")\nsome_func = SomeFunc(b=\"bar\")\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.func","title":"func class-attribute
instance-attribute
","text":"func: Callable = Field(default=None, description='The function to be called on the DataFrame.')\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.execute","title":"execute","text":"execute()\n
Call the function on the DataFrame with the given keyword arguments.
Source code in src/koheesio/spark/transformations/transform.py
def execute(self):\n \"\"\"Call the function on the DataFrame with the given keyword arguments.\"\"\"\n func, kwargs = get_args_for_func(self.func, self.params)\n self.output.df = self.df.transform(func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/transform.html#koheesio.spark.transformations.transform.Transform.from_func","title":"from_func classmethod
","text":"from_func(func: Callable, **kwargs) -> Callable[..., Transform]\n
Create a Transform class from a function. Useful for creating a new class with a different name.
This method uses the functools.partial
function to create a new class with the given function and keyword arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for the specific use case.
Example CustomTransform = Transform.from_func(some_func, a=\"foo\")\nsome_func = CustomTransform(b=\"bar\")\n
In this example, CustomTransform
is a Transform class with the function some_func
and the keyword argument a
set to \"foo\". When calling some_func(b=\"bar\")
, the function some_func
will be called with the keyword arguments a=\"foo\"
and b=\"bar\"
.
Source code in src/koheesio/spark/transformations/transform.py
@classmethod\ndef from_func(cls, func: Callable, **kwargs) -> Callable[..., Transform]:\n \"\"\"Create a Transform class from a function. Useful for creating a new class with a different name.\n\n This method uses the `functools.partial` function to create a new class with the given function and keyword\n arguments. This way you can pre-define some of the keyword arguments for the function that might be needed for\n the specific use case.\n\n Example\n -------\n ```python\n CustomTransform = Transform.from_func(some_func, a=\"foo\")\n some_func = CustomTransform(b=\"bar\")\n ```\n\n In this example, `CustomTransform` is a Transform class with the function `some_func` and the keyword argument\n `a` set to \"foo\". When calling `some_func(b=\"bar\")`, the function `some_func` will be called with the keyword\n arguments `a=\"foo\"` and `b=\"bar\"`.\n \"\"\"\n return partial(cls, func=func, **kwargs)\n
"},{"location":"api_reference/spark/transformations/uuid5.html","title":"Uuid5","text":"Ability to generate UUID5 using native pyspark (no udf)
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5","title":"koheesio.spark.transformations.uuid5.HashUUID5","text":"Generate a UUID with the UUID5 algorithm
Spark does not provide inbuilt API to generate version 5 UUID, hence we have to use a custom implementation to provide this capability.
Prerequisites: this function has no side effects. But be aware that in most cases, the expectation is that your data is clean (e.g. trimmed of leading and trailing spaces)
Concept UUID5 is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). https://docs.python.org/3/library/uuid.html#uuid.uuid5
Based on https://github.com/MrPowers/quinn/pull/96 with the difference that since Spark 3.0.0 an OVERLAY function from ANSI SQL 2016 is available which saves coding space and string allocation(s) in place of CONCAT + SUBSTRING.
For more info on OVERLAY, see: https://docs.databricks.com/sql/language-manual/functions/overlay.html
Example Input is a DataFrame with two columns:
id string 1 hello 2 world 3 Input parameters:
- source_columns = [\"id\", \"string\"]
- target_column = \"uuid5\"
Result:
id string uuid5 1 hello f3e99bbd-85ae-5dc3-bf6e-cd0022a0ebe6 2 world b48e880f-c289-5c94-b51f-b9d21f9616c0 3 2193a99d-222e-5a0c-a7d6-48fbe78d2708 In code:
HashUUID5(source_columns=[\"id\", \"string\"], target_column=\"uuid5\").transform(input_df)\n
In this example, the id
and string
columns are concatenated and hashed using the UUID5 algorithm. The result is stored in the uuid5
column.
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.delimiter","title":"delimiter class-attribute
instance-attribute
","text":"delimiter: Optional[str] = Field(default='|', description='Separator for the string that will eventually be hashed')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.description","title":"description class-attribute
instance-attribute
","text":"description: str = 'Generate a UUID with the UUID5 algorithm'\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.extra_string","title":"extra_string class-attribute
instance-attribute
","text":"extra_string: Optional[str] = Field(default='', description='In case of collisions, one can pass an extra string to hash on.')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.namespace","title":"namespace class-attribute
instance-attribute
","text":"namespace: Optional[Union[str, UUID]] = Field(default='', description='Namespace DNS')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.source_columns","title":"source_columns class-attribute
instance-attribute
","text":"source_columns: ListOfColumns = Field(default=..., description=\"List of columns that should be hashed. Should contain the name of at least 1 column. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'`\")\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: str = Field(default=..., description='The generated UUID will be written to the column name specified here')\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.HashUUID5.execute","title":"execute","text":"execute() -> None\n
Source code in src/koheesio/spark/transformations/uuid5.py
def execute(self) -> None:\n ns = f.lit(uuid5_namespace(self.namespace).bytes)\n self.log.info(f\"UUID5 namespace '{ns}' derived from '{self.namespace}'\")\n cols_to_hash = f.concat_ws(self.delimiter, *self.source_columns)\n cols_to_hash = f.concat(f.lit(self.extra_string), cols_to_hash)\n cols_to_hash = f.encode(cols_to_hash, \"utf-8\")\n cols_to_hash = f.concat(ns, cols_to_hash)\n source_columns_sha1 = f.sha1(cols_to_hash)\n variant_part = f.substring(source_columns_sha1, 17, 4)\n variant_part = f.conv(variant_part, 16, 2)\n variant_part = f.lpad(variant_part, 16, \"0\")\n variant_part = f.overlay(variant_part, f.lit(\"10\"), 1, 2) # RFC 4122 variant.\n variant_part = f.lower(f.conv(variant_part, 2, 16))\n target_col_uuid = f.concat_ws(\n \"-\",\n f.substring(source_columns_sha1, 1, 8),\n f.substring(source_columns_sha1, 9, 4),\n f.concat(f.lit(\"5\"), f.substring(source_columns_sha1, 14, 3)), # Set version.\n variant_part,\n f.substring(source_columns_sha1, 21, 12),\n )\n # Applying the transformation to the input df, storing the result in the column specified in `target_column`.\n self.output.df = self.df.withColumn(self.target_column, target_col_uuid)\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.hash_uuid5","title":"koheesio.spark.transformations.uuid5.hash_uuid5","text":"hash_uuid5(input_value: str, namespace: Optional[Union[str, UUID]] = '', extra_string: Optional[str] = '')\n
pure python implementation of HashUUID5
See: https://docs.python.org/3/library/uuid.html#uuid.uuid5
Parameters:
Name Type Description Default input_value
str
value that will be hashed
required namespace
Optional[str | UUID]
namespace DNS
''
extra_string
Optional[str]
optional extra string that will be prepended to the input_value
''
Returns:
Type Description str
uuid.UUID (uuid5) cast to string
Source code in src/koheesio/spark/transformations/uuid5.py
def hash_uuid5(\n input_value: str,\n namespace: Optional[Union[str, uuid.UUID]] = \"\",\n extra_string: Optional[str] = \"\",\n):\n \"\"\"pure python implementation of HashUUID5\n\n See: https://docs.python.org/3/library/uuid.html#uuid.uuid5\n\n Parameters\n ----------\n input_value : str\n value that will be hashed\n namespace : Optional[str | uuid.UUID]\n namespace DNS\n extra_string : Optional[str]\n optional extra string that will be prepended to the input_value\n\n Returns\n -------\n str\n uuid.UUID (uuid5) cast to string\n \"\"\"\n if not isinstance(namespace, uuid.UUID):\n hashed_namespace = uuid5_namespace(namespace)\n else:\n hashed_namespace = namespace\n return str(uuid.uuid5(hashed_namespace, (extra_string + input_value)))\n
"},{"location":"api_reference/spark/transformations/uuid5.html#koheesio.spark.transformations.uuid5.uuid5_namespace","title":"koheesio.spark.transformations.uuid5.uuid5_namespace","text":"uuid5_namespace(ns: Optional[Union[str, UUID]]) -> UUID\n
Helper function used to provide a UUID5 hashed namespace based on the passed str
Parameters:
Name Type Description Default ns
Optional[Union[str, UUID]]
A str, an empty string (or None), or an existing UUID can be passed
required Returns:
Type Description UUID
UUID5 hashed namespace
Source code in src/koheesio/spark/transformations/uuid5.py
def uuid5_namespace(ns: Optional[Union[str, uuid.UUID]]) -> uuid.UUID:\n \"\"\"Helper function used to provide a UUID5 hashed namespace based on the passed str\n\n Parameters\n ----------\n ns : Optional[Union[str, uuid.UUID]]\n A str, an empty string (or None), or an existing UUID can be passed\n\n Returns\n -------\n uuid.UUID\n UUID5 hashed namespace\n \"\"\"\n # if we already have a UUID, we just return it\n if isinstance(ns, uuid.UUID):\n return ns\n\n # if ns is empty or none, we simply return the default NAMESPACE_DNS\n if not ns:\n ns = uuid.NAMESPACE_DNS\n return ns\n\n # else we hash the string against the NAMESPACE_DNS\n ns = uuid.uuid5(uuid.NAMESPACE_DNS, ns)\n return ns\n
"},{"location":"api_reference/spark/transformations/date_time/index.html","title":"Date time","text":"Module that holds the transformations that can be used for date and time related operations.
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone","title":"koheesio.spark.transformations.date_time.ChangeTimeZone","text":"Allows for the value of a column to be changed from one timezone to another
Adding useful metadata When add_target_timezone
is enabled (default), an additional column is created documenting which timezone a field has been converted to. Additionally, the suffix added to this column can be customized (default value is _timezone
).
Example Input:
target_column = \"some_column_name\"\ntarget_timezone = \"EST\"\nadd_target_timezone = True # default value\ntimezone_column_suffix = \"_timezone\" # default value\n
Output:
column name = \"some_column_name_timezone\" # notice the suffix\ncolumn value = \"EST\"\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.add_target_timezone","title":"add_target_timezone class-attribute
instance-attribute
","text":"add_target_timezone: bool = Field(default=True, description='Toggles whether the target timezone is added as a column. True by default.')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.from_timezone","title":"from_timezone class-attribute
instance-attribute
","text":"from_timezone: str = Field(default=..., alias='source_timezone', description='Timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.target_timezone_column_suffix","title":"target_timezone_column_suffix class-attribute
instance-attribute
","text":"target_timezone_column_suffix: Optional[str] = Field(default='_timezone', alias='suffix', description=\"Allows to customize the suffix that is added to the target_timezone column. Defaults to '_timezone'. Note: this will be ignored if 'add_target_timezone' is set to False\")\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.to_timezone","title":"to_timezone class-attribute
instance-attribute
","text":"to_timezone: str = Field(default=..., alias='target_timezone', description='Target timezone. Timezone fields are validated against the `TZ database name` column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def execute(self):\n df = self.df\n\n for target_column, column in self.get_columns_with_target():\n func = self.func # select the applicable function\n df = df.withColumn(\n target_column,\n func(f.col(column)),\n )\n\n # document which timezone a field has been converted to\n if self.add_target_timezone:\n df = df.withColumn(f\"{target_column}{self.target_timezone_column_suffix}\", f.lit(self.to_timezone))\n\n self.output.df = df\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n return change_timezone(column=column, source_timezone=self.from_timezone, target_timezone=self.to_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_no_duplicate_timezones","title":"validate_no_duplicate_timezones","text":"validate_no_duplicate_timezones(values)\n
Validate that source and target timezone are not the same
Source code in src/koheesio/spark/transformations/date_time/__init__.py
@model_validator(mode=\"before\")\ndef validate_no_duplicate_timezones(cls, values):\n \"\"\"Validate that source and target timezone are not the same\"\"\"\n from_timezone_value = values.get(\"from_timezone\")\n to_timezone_value = values.get(\"o_timezone\")\n\n if from_timezone_value == to_timezone_value:\n raise ValueError(\"Timezone conversions from and to the same timezones are not valid.\")\n\n return values\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ChangeTimeZone.validate_timezone","title":"validate_timezone","text":"validate_timezone(timezone_value)\n
Validate that the timezone is a valid timezone.
Source code in src/koheesio/spark/transformations/date_time/__init__.py
@field_validator(\"from_timezone\", \"to_timezone\")\ndef validate_timezone(cls, timezone_value):\n \"\"\"Validate that the timezone is a valid timezone.\"\"\"\n if timezone_value not in all_timezones_set:\n raise ValueError(\n \"Not a valid timezone. Refer to the `TZ database name` column here: \"\n \"https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\"\n )\n return timezone_value\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat","title":"koheesio.spark.transformations.date_time.DateFormat","text":"wrapper around pyspark.sql.functions.date_format
See Also - https://spark.apache.org/docs/3.3.2/api/python/reference/pyspark.sql/api/pyspark.sql.functions.date_format.html
- https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html
Concept This Transformation allows to convert a date/timestamp/string to a value of string in the format specified by the date format given.
A pattern could be for instance dd.MM.yyyy
and could return a string like \u201818.03.1993\u2019. All pattern letters of datetime pattern can be used, see: https://spark.apache.org/docs/3.3.2/sql-ref-datetime-pattern.html
How to use If more than one column is passed, the behavior of the Class changes this way
- the transformation will be run in a loop against all the given columns
- the target_column will be used as a suffix. Leaving this blank will result in the original columns being renamed.
Example source_column value: datetime.date(2020, 1, 1)\ntarget: \"yyyyMMdd HH:mm\"\noutput: \"20200101 00:00\"\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(..., description='The format for the resulting string. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.DateFormat.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n return date_format(column, self.format)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp","title":"koheesio.spark.transformations.date_time.ToTimestamp","text":"wrapper around pyspark.sql.functions.to_timestamp
Converts a Column (or set of Columns) into pyspark.sql.types.TimestampType
using the specified format. Specify formats according to datetime pattern https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
_.
Functionally equivalent to col.cast(\"timestamp\").
See Also Related Koheesio classes:
- koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
- koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field
pyspark.sql.functions:
- datetime pattern : https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Example"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--basic-usage-example","title":"Basic usage example:","text":"input_df:
t \"1997-02-28 10:30:00\" t
is a string
tts = ToTimestamp(\n # since the source column is the same as the target in this example, 't' will be overwritten\n column=\"t\",\n target_column=\"t\",\n format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df)\n
output_df:
t datetime.datetime(1997, 2, 28, 10, 30) Now t
is a timestamp
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp--multiple-columns-at-once","title":"Multiple columns at once:","text":"input_df:
t1 t2 \"1997-02-28 10:30:00\" \"2007-03-31 11:40:10\" t1
and t2
are strings
tts = ToTimestamp(\n columns=[\"t1\", \"t2\"],\n # 'target_suffix' is synonymous with 'target_column'\n target_suffix=\"new\",\n format=\"yyyy-MM-dd HH:mm:ss\",\n)\noutput_df = tts.transform(input_df).select(\"t1_new\", \"t2_new\")\n
output_df:
t1_new t2_new datetime.datetime(1997, 2, 28, 10, 30) datetime.datetime(2007, 3, 31, 11, 40) Now t1_new
and t2_new
are both timestamps
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default=..., description='The date format for of the timestamp field. See https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html')\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.ToTimestamp.func","title":"func","text":"func(column: Column) -> Column\n
Source code in src/koheesio/spark/transformations/date_time/__init__.py
def func(self, column: Column) -> Column:\n # convert string to timestamp\n converted_col = to_timestamp(column, self.format)\n return when(column.isNull(), lit(None)).otherwise(converted_col)\n
"},{"location":"api_reference/spark/transformations/date_time/index.html#koheesio.spark.transformations.date_time.change_timezone","title":"koheesio.spark.transformations.date_time.change_timezone","text":"change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str)\n
Helper function to change from one timezone to another
wrapper around pyspark.sql.functions.from_utc_timestamp
and to_utc_timestamp
Parameters:
Name Type Description Default column
Union[str, Column]
The column to change the timezone of
required source_timezone
str
The timezone of the source_column value. Timezone fields are validated against the TZ database name
column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
required target_timezone
str
The target timezone. Timezone fields are validated against the TZ database name
column in this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
required Source code in src/koheesio/spark/transformations/date_time/__init__.py
def change_timezone(column: Union[str, Column], source_timezone: str, target_timezone: str):\n \"\"\"Helper function to change from one timezone to another\n\n wrapper around `pyspark.sql.functions.from_utc_timestamp` and `to_utc_timestamp`\n\n Parameters\n ----------\n column : Union[str, Column]\n The column to change the timezone of\n source_timezone : str\n The timezone of the source_column value. Timezone fields are validated against the `TZ database name` column in\n this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n target_timezone : str\n The target timezone. Timezone fields are validated against the `TZ database name` column in this list:\n https://en.wikipedia.org/wiki/List_of_tz_database_time_zones\n\n \"\"\"\n column = col(column) if isinstance(column, str) else column\n return from_utc_timestamp((to_utc_timestamp(column, source_timezone)), target_timezone)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html","title":"Interval","text":"This module provides a DateTimeColumn
class that extends the Column
class from PySpark. It allows for adding or subtracting an interval value from a datetime column.
This can be used to reflect a change in a given date / time column in a more human-readable way.
Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
Background The aim is to easily add or subtract an 'interval' value to a datetime column. An interval value is a string that represents a time interval. For example, '1 day', '1 month', '5 years', '1 minute 30 seconds', '10 milliseconds', etc. These can be used to reflect a change in a given date / time column in a more human-readable way.
Typically, this can be done using the date_add()
and date_sub()
functions in Spark SQL. However, these functions only support adding or subtracting a single unit of time measured in days. Using an interval gives us much more flexibility; however, Spark SQL does not provide a function to add or subtract an interval value from a datetime column through the python API directly, so we have to use the expr()
function to do this to be able to directly use SQL.
This module provides a DateTimeColumn
class that extends the Column
class from PySpark. It allows for adding or subtracting an interval value from a datetime column using the +
and -
operators.
Additionally, this module provides two transformation classes that can be used as a transformation step in a pipeline:
DateTimeAddInterval
: adds an interval value to a datetime column DateTimeSubtractInterval
: subtracts an interval value from a datetime column
These classes are subclasses of ColumnsTransformationWithTarget
and hence can be used to perform transformations on multiple columns at once.
The above transformations both use the provided asjust_time()
function to perform the actual transformation.
See also: Related Koheesio classes:
- koheesio.spark.transformations.ColumnsTransformation : Base class for ColumnsTransformation. Defines column / columns field + recursive logic
- koheesio.spark.transformations.ColumnsTransformationWithTarget : Defines target_column / target_suffix field
pyspark.sql.functions:
- https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
- https://spark.apache.org/docs/latest/api/sql/index.html
- https://spark.apache.org/docs/latest/api/sql/#try_add
- https://spark.apache.org/docs/latest/api/sql/#try_subtract
Classes:
Name Description DateTimeColumn
A datetime column that can be adjusted by adding or subtracting an interval value using the +
and -
operators.
DateTimeAddInterval
A transformation that adds an interval value to a datetime column. This class is a subclass of ColumnsTransformationWithTarget
and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget
for more information.
DateTimeSubtractInterval
A transformation that subtracts an interval value from a datetime column. This class is a subclass of ColumnsTransformationWithTarget
and hence can be used as a transformation step in a pipeline. See ColumnsTransformationWithTarget
for more information.
Note the DateTimeAddInterval
and DateTimeSubtractInterval
classes are very similar. The only difference is that one adds an interval value to a datetime column, while the other subtracts an interval value from a datetime column.
Functions:
Name Description dt_column
Converts a column to a DateTimeColumn
. This function aims to be a drop-in replacement for pyspark.sql.functions.col
that returns a DateTimeColumn
instead of a Column
.
adjust_time
Adjusts a datetime column by adding or subtracting an interval value.
validate_interval
Validates a given interval string.
Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--various-ways-to-create-and-interact-with-datetimecolumn","title":"Various ways to create and interact with DateTimeColumn
:","text":" - Create a
DateTimeColumn
from a string: dt_column(\"my_column\")
- Create a
DateTimeColumn
from a Column
: dt_column(df.my_column)
- Use the
+
and -
operators to add or subtract an interval value from a DateTimeColumn
: dt_column(\"my_column\") + \"1 day\"
dt_column(\"my_column\") - \"1 month\"
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--functional-examples-using-adjust_time","title":"Functional examples using adjust_time()
:","text":" - Add 1 day to a column:
adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")
- Subtract 1 month from a column:
adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval--as-a-transformation-step","title":"As a transformation step:","text":"from koheesio.spark.transformations.date_time.interval import (\n DateTimeAddInterval,\n)\n\ninput_df = spark.createDataFrame([(1, \"2022-01-01 00:00:00\")], [\"id\", \"my_column\"])\n\n# add 1 day to my_column and store the result in a new column called 'one_day_later'\noutput_df = DateTimeAddInterval(column=\"my_column\", target_column=\"one_day_later\", interval=\"1 day\").transform(input_df)\n
output_df: id my_column one_day_later 1 2022-01-01 00:00:00 2022-01-02 00:00:00 DateTimeSubtractInterval
works in a similar way, but subtracts an interval value from a datetime column.
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.Operations","title":"koheesio.spark.transformations.date_time.interval.Operations module-attribute
","text":"Operations = Literal['add', 'subtract']\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeAddInterval","text":"A transformation that adds or subtracts a specified interval from a datetime column.
See also: pyspark.sql.functions:
- https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
- https://spark.apache.org/docs/latest/api/sql/index.html#interval
Parameters:
Name Type Description Default interval
str
The interval to add to the datetime column.
required operation
Operations
The operation to perform. Must be either 'add' or 'subtract'.
add
Example"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--add-1-day-to-a-column","title":"add 1 day to a column","text":"DateTimeAddInterval(\n column=\"my_column\",\n interval=\"1 day\",\n).transform(df)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval--subtract-1-month-from-my_column-and-store-the-result-in-a-new-column-called-one_month_earlier","title":"subtract 1 month from my_column
and store the result in a new column called one_month_earlier
","text":"DateTimeSubtractInterval(\n column=\"my_column\",\n target_column=\"one_month_earlier\",\n interval=\"1 month\",\n)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.interval","title":"interval class-attribute
instance-attribute
","text":"interval: str = Field(default=..., description='The interval to add to the datetime column.', examples=['1 day', '5 years', '3 months'])\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.operation","title":"operation class-attribute
instance-attribute
","text":"operation: Operations = Field(default='add', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.validate_interval","title":"validate_interval class-attribute
instance-attribute
","text":"validate_interval = field_validator('interval')(validate_interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeAddInterval.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/date_time/interval.py
def func(self, column: Column):\n return adjust_time(column, operation=self.operation, interval=self.interval)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn","title":"koheesio.spark.transformations.date_time.interval.DateTimeColumn","text":"A datetime column that can be adjusted by adding or subtracting an interval value using the +
and -
operators.
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeColumn.from_column","title":"from_column classmethod
","text":"from_column(column: Column)\n
Create a DateTimeColumn from an existing Column
Source code in src/koheesio/spark/transformations/date_time/interval.py
@classmethod\ndef from_column(cls, column: Column):\n \"\"\"Create a DateTimeColumn from an existing Column\"\"\"\n return cls(column._jc)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","title":"koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval","text":"Subtracts a specified interval from a datetime column.
Works in the same way as DateTimeAddInterval
, but subtracts the specified interval from the datetime column. See DateTimeAddInterval
for more information.
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.DateTimeSubtractInterval.operation","title":"operation class-attribute
instance-attribute
","text":"operation: Operations = Field(default='subtract', description=\"The operation to perform. Must be either 'add' or 'subtract'.\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time","title":"koheesio.spark.transformations.date_time.interval.adjust_time","text":"adjust_time(column: Column, operation: Operations, interval: str) -> Column\n
Adjusts a datetime column by adding or subtracting an interval value.
This can be used to reflect a change in a given date / time column in a more human-readable way.
See also Please refer to the Spark SQL documentation for a list of valid interval values: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal
Example Parameters:
Name Type Description Default column
Column
The datetime column to adjust.
required operation
Operations
The operation to perform. Must be either 'add' or 'subtract'.
required interval
str
The value to add or subtract. Must be a valid interval string.
required Returns:
Type Description Column
The adjusted datetime column.
Source code in src/koheesio/spark/transformations/date_time/interval.py
def adjust_time(column: Column, operation: Operations, interval: str) -> Column:\n \"\"\"\n Adjusts a datetime column by adding or subtracting an interval value.\n\n This can be used to reflect a change in a given date / time column in a more human-readable way.\n\n\n See also\n --------\n Please refer to the Spark SQL documentation for a list of valid interval values:\n https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#interval-literal\n\n ### pyspark.sql.functions:\n\n * https://spark.apache.org/docs/latest/api/sql/index.html#interval\n * https://spark.apache.org/docs/latest/api/sql/#try_add\n * https://spark.apache.org/docs/latest/api/sql/#try_subtract\n\n Example\n --------\n ### add 1 day to a column\n ```python\n adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n ```\n\n ### subtract 1 month from a column\n ```python\n adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n ```\n\n ### or, a much more complicated example\n\n In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called `my_column`.\n ```python\n adjust_time(\n \"my_column\",\n operation=\"add\",\n interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n )\n ```\n\n Parameters\n ----------\n column : Column\n The datetime column to adjust.\n operation : Operations\n The operation to perform. Must be either 'add' or 'subtract'.\n interval : str\n The value to add or subtract. Must be a valid interval string.\n\n Returns\n -------\n Column\n The adjusted datetime column.\n \"\"\"\n\n # check that value is a valid interval\n interval = validate_interval(interval)\n\n column_name = column._jc.toString()\n\n # determine the operation to perform\n try:\n operation = {\n \"add\": \"try_add\",\n \"subtract\": \"try_subtract\",\n }[operation]\n except KeyError as e:\n raise ValueError(f\"Operation '{operation}' is not valid. Must be either 'add' or 'subtract'.\") from e\n\n # perform the operation\n _expression = f\"{operation}({column_name}, interval '{interval}')\"\n column = expr(_expression)\n\n return column\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--pysparksqlfunctions","title":"pyspark.sql.functions:","text":" - https://spark.apache.org/docs/latest/api/sql/index.html#interval
- https://spark.apache.org/docs/latest/api/sql/#try_add
- https://spark.apache.org/docs/latest/api/sql/#try_subtract
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--add-1-day-to-a-column","title":"add 1 day to a column","text":"adjust_time(\"my_column\", operation=\"add\", interval=\"1 day\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--subtract-1-month-from-a-column","title":"subtract 1 month from a column","text":"adjust_time(\"my_column\", operation=\"subtract\", interval=\"1 month\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.adjust_time--or-a-much-more-complicated-example","title":"or, a much more complicated example","text":"In this example, we add 5 days, 3 hours, 7 minutes, 30 seconds, and 1 millisecond to a column called my_column
.
adjust_time(\n \"my_column\",\n operation=\"add\",\n interval=\"5 days 3 hours 7 minutes 30 seconds 1 millisecond\",\n)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column","title":"koheesio.spark.transformations.date_time.interval.dt_column","text":"dt_column(column: Union[str, Column]) -> DateTimeColumn\n
Convert a column to a DateTimeColumn
Aims to be a drop-in replacement for pyspark.sql.functions.col
that returns a DateTimeColumn instead of a Column.
Example Parameters:
Name Type Description Default column
Union[str, Column]
The column (or name of the column) to convert to a DateTimeColumn
required Source code in src/koheesio/spark/transformations/date_time/interval.py
def dt_column(column: Union[str, Column]) -> DateTimeColumn:\n \"\"\"Convert a column to a DateTimeColumn\n\n Aims to be a drop-in replacement for `pyspark.sql.functions.col` that returns a DateTimeColumn instead of a Column.\n\n Example\n --------\n ### create a DateTimeColumn from a string\n ```python\n dt_column(\"my_column\")\n ```\n\n ### create a DateTimeColumn from a Column\n ```python\n dt_column(df.my_column)\n ```\n\n Parameters\n ----------\n column : Union[str, Column]\n The column (or name of the column) to convert to a DateTimeColumn\n \"\"\"\n if isinstance(column, str):\n column = col(column)\n elif not isinstance(column, Column):\n raise TypeError(f\"Expected column to be of type str or Column, got {type(column)} instead.\")\n return DateTimeColumn.from_column(column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-string","title":"create a DateTimeColumn from a string","text":"dt_column(\"my_column\")\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.dt_column--create-a-datetimecolumn-from-a-column","title":"create a DateTimeColumn from a Column","text":"dt_column(df.my_column)\n
"},{"location":"api_reference/spark/transformations/date_time/interval.html#koheesio.spark.transformations.date_time.interval.validate_interval","title":"koheesio.spark.transformations.date_time.interval.validate_interval","text":"validate_interval(interval: str)\n
Validate an interval string
Parameters:
Name Type Description Default interval
str
The interval string to validate
required Raises:
Type Description ValueError
If the interval string is invalid
Source code in src/koheesio/spark/transformations/date_time/interval.py
def validate_interval(interval: str):\n \"\"\"Validate an interval string\n\n Parameters\n ----------\n interval : str\n The interval string to validate\n\n Raises\n ------\n ValueError\n If the interval string is invalid\n \"\"\"\n try:\n expr(f\"interval '{interval}'\")\n except ParseException as e:\n raise ValueError(f\"Value '{interval}' is not a valid interval.\") from e\n return interval\n
"},{"location":"api_reference/spark/transformations/strings/index.html","title":"Strings","text":"Adds a number of Transformations that are intended to be used with StringType column input. Some will work with other types however, but will output StringType or an array of StringType.
These Transformations take full advantage of Koheesio's ColumnsTransformationWithTarget class, allowing a user to apply column transformations to multiple columns at once. See the class docstrings for more information.
The following Transformations are included:
change_case:
Lower
Converts a string column to lower case. Upper
Converts a string column to upper case. TitleCase
or InitCap
Converts a string column to title case, where each word starts with a capital letter.
concat:
Concat
Concatenates multiple input columns together into a single column, optionally using the given separator.
pad:
Pad
Pads the values of source_column
with the character
up until it reaches length
of characters LPad
Pad with a character on the left side of the string. RPad
Pad with a character on the right side of the string.
regexp:
RegexpExtract
Extract a specific group matched by a Java regexp from the specified string column. RegexpReplace
Searches for the given regexp and replaces all instances with what is in 'replacement'.
replace:
Replace
Replace all instances of a string in a column with another string.
split:
SplitAll
Splits the contents of a column on basis of a split_pattern. SplitAtFirstMatch
Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.
substring:
Substring
Extracts a substring from a string column starting at the given position.
trim:
Trim
Trim whitespace from the beginning and/or end of a string. LTrim
Trim whitespace from the beginning of a string. RTrim
Trim whitespace from the end of a string.
"},{"location":"api_reference/spark/transformations/strings/change_case.html","title":"Change case","text":"Convert the case of a string column to upper case, lower case, or title case
Classes:
Name Description `Lower`
Converts a string column to lower case.
`Upper`
Converts a string column to upper case.
`TitleCase` or `InitCap`
Converts a string column to title case, where each word starts with a capital letter.
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.InitCap","title":"koheesio.spark.transformations.strings.change_case.InitCap module-attribute
","text":"InitCap = TitleCase\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase","title":"koheesio.spark.transformations.strings.change_case.LowerCase","text":"This function makes the contents of a column lower case.
Wraps the pyspark.sql.functions.lower
function.
Warnings If the type of the column is not string, LowerCase
will not be run. A Warning will be thrown indicating this.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The name of the column or columns to convert to lower case. Alias: column. Lower case will be applied to all columns in the list. Column is required to be of string type.
required target_column
The name of the column to store the result in. If None, the result will be stored in the same column as the input.
required Example input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = LowerCase(column=\"product\", target_column=\"product_lower\").transform(df)\n
output_df:
product amount country product_lower Banana lemon orange 1000 USA banana lemon orange Carrots Blueberries 1500 USA carrots blueberries Beans 1600 USA beans In this example, the column product
is converted to product_lower
and the contents of this column are converted to lower case.
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig","title":"ColumnConfig","text":"Limit data type to string
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.LowerCase.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n return lower(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase","title":"koheesio.spark.transformations.strings.change_case.TitleCase","text":"This function makes the contents of a column title case. This means that every word starts with an upper case.
Wraps the pyspark.sql.functions.initcap
function.
Warnings If the type of the column is not string, TitleCase will not be run. A Warning will be thrown indicating this.
Parameters:
Name Type Description Default columns
The name of the column or columns to convert to title case. Alias: column. Title case will be applied to all columns in the list. Column is required to be of string type.
required target_column
The name of the column to store the result in. If None, the result will be stored in the same column as the input.
required Example input_df:
product amount country Banana lemon orange 1000 USA Carrots blueberries 1500 USA Beans 1600 USA output_df = TitleCase(column=\"product\", target_column=\"product_title\").transform(df)\n
output_df:
product amount country product_title Banana lemon orange 1000 USA Banana Lemon Orange Carrots blueberries 1500 USA Carrots Blueberries Beans 1600 USA Beans In this example, the column product
is converted to product_title
and the contents of this column are converted to title case (each word now starts with an upper case).
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.TitleCase.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n return initcap(column)\n
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase","title":"koheesio.spark.transformations.strings.change_case.UpperCase","text":"This function makes the contents of a column upper case.
Wraps the pyspark.sql.functions.upper
function.
Warnings If the type of the column is not string, UpperCase
will not be run. A Warning will be thrown indicating this.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The name of the column or columns to convert to upper case. Alias: column. Upper case will be applied to all columns in the list. Column is required to be of string type.
required target_column
The name of the column to store the result in. If None, the result will be stored in the same column as the input.
required Examples:
input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = UpperCase(column=\"product\", target_column=\"product_upper\").transform(df)\n
output_df:
product amount country product_upper Banana lemon orange 1000 USA BANANA LEMON ORANGE Carrots Blueberries 1500 USA CARROTS BLUEBERRIES Beans 1600 USA BEANS In this example, the column product
is converted to product_upper
and the contents of this column are converted to upper case.
"},{"location":"api_reference/spark/transformations/strings/change_case.html#koheesio.spark.transformations.strings.change_case.UpperCase.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/change_case.py
def func(self, column: Column):\n return upper(column)\n
"},{"location":"api_reference/spark/transformations/strings/concat.html","title":"Concat","text":"Concatenates multiple input columns together into a single column, optionally using a given separator.
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat","title":"koheesio.spark.transformations.strings.concat.Concat","text":"This is a wrapper around PySpark concat() and concat_ws() functions
Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.
Concept When working with arrays, the function will return the result of the concatenation of the elements in the array.
- If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.
- If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.
When working with date/timestamps, the function will return the result of the concatenation of the elements in the array. The timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss
.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first.
required target_column
Optional[str]
Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.
None
spacer
Optional[str]
Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used
None
Example"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-a-string-column-and-a-timestamp-column","title":"Example using a string column and a timestamp column","text":"input_df:
column_a column_b text 1997-02-28 10:30:00 output_df = Concat(\n columns=[\"column_a\", \"column_b\"],\n target_column=\"concatenated_column\",\n spacer=\"--\",\n).transform(input_df)\n
output_df:
column_a column_b concatenated_column text 1997-02-28 10:30:00 text--1997-02-28 10:30:00 In the example above, the resulting column is a string column.
If we had left out the spacer, the resulting column would have had the value of text1997-02-28 10:30:00
(a string). Note that the timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss
.
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat--example-using-two-array-columns","title":"Example using two array columns","text":"input_df:
array_col_1 array_col_2 [text1, text2] [text3, text4] output_df = Concat(\n columns=[\"array_col_1\", \"array_col_2\"],\n target_column=\"concatenated_column\",\n spacer=\"--\",\n).transform(input_df)\n
output_df:
array_col_1 array_col_2 concatenated_column [text1, text2] [text3, text4] \"text1--text2--text3\" Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would have been an array with the values of [\"text1\", \"text2\", \"text3\"]
.
Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.spacer","title":"spacer class-attribute
instance-attribute
","text":"spacer: Optional[str] = Field(default=None, description='Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used', alias='sep')\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.target_column","title":"target_column class-attribute
instance-attribute
","text":"target_column: Optional[str] = Field(default=None, description=\"Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.\")\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.execute","title":"execute","text":"execute() -> DataFrame\n
Source code in src/koheesio/spark/transformations/strings/concat.py
def execute(self) -> DataFrame:\n columns = [col(s) for s in self.get_columns()]\n self.output.df = self.df.withColumn(\n self.target_column, concat_ws(self.spacer, *columns) if self.spacer else concat(*columns)\n )\n
"},{"location":"api_reference/spark/transformations/strings/concat.html#koheesio.spark.transformations.strings.concat.Concat.get_target_column","title":"get_target_column","text":"get_target_column(target_column_value, values)\n
Get the target column name if it is not provided.
If not provided, a name will be generated by concatenating the names of the source columns with an '_'.
Source code in src/koheesio/spark/transformations/strings/concat.py
@field_validator(\"target_column\")\ndef get_target_column(cls, target_column_value, values):\n \"\"\"Get the target column name if it is not provided.\n\n If not provided, a name will be generated by concatenating the names of the source columns with an '_'.\"\"\"\n if not target_column_value:\n columns_value: List = values[\"columns\"]\n columns = list(dict.fromkeys(columns_value)) # dict.fromkeys is used to dedup while maintaining order\n return \"_\".join(columns)\n\n return target_column_value\n
"},{"location":"api_reference/spark/transformations/strings/pad.html","title":"Pad","text":"Pad the values of a column with a character up until it reaches a certain length.
Classes:
Name Description Pad
Pads the values of source_column
with the character
up until it reaches length
of characters
LPad
Pad with a character on the left side of the string.
RPad
Pad with a character on the right side of the string.
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.LPad","title":"koheesio.spark.transformations.strings.pad.LPad module-attribute
","text":"LPad = Pad\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.pad_directions","title":"koheesio.spark.transformations.strings.pad.pad_directions module-attribute
","text":"pad_directions = Literal['left', 'right']\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad","title":"koheesio.spark.transformations.strings.pad.Pad","text":"Pads the values of source_column
with the character
up until it reaches length
of characters The direction
param can be changed to apply either a left or a right pad. Defaults to left pad.
Wraps the lpad
and rpad
functions from PySpark.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to pad. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
character
constr(min_length=1)
The character to use for padding
required length
PositiveInt
Positive integer to indicate the intended length
required direction
Optional[pad_directions]
On which side to add the characters. Either \"left\" or \"right\". Defaults to \"left\"
left
Example input_df:
column hello world output_df = Pad(\n column=\"column\",\n target_column=\"padded_column\",\n character=\"*\",\n length=10,\n direction=\"right\",\n).transform(input_df)\n
output_df:
column padded_column hello hello***** world world***** Note: in the example above, we could have used the RPad class instead of Pad with direction=\"right\" to achieve the same result.
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.character","title":"character class-attribute
instance-attribute
","text":"character: constr(min_length=1) = Field(default=..., description='The character to use for padding')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.direction","title":"direction class-attribute
instance-attribute
","text":"direction: Optional[pad_directions] = Field(default='left', description='On which side to add the characters . Either \"left\" or \"right\". Defaults to \"left\"')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.length","title":"length class-attribute
instance-attribute
","text":"length: PositiveInt = Field(default=..., description='Positive integer to indicate the intended length')\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.Pad.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/pad.py
def func(self, column: Column):\n func = lpad if self.direction == \"left\" else rpad\n return func(column, self.length, self.character)\n
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad","title":"koheesio.spark.transformations.strings.pad.RPad","text":"Pad with a character on the right side of the string.
See Pad class docstring for more information.
"},{"location":"api_reference/spark/transformations/strings/pad.html#koheesio.spark.transformations.strings.pad.RPad.direction","title":"direction class-attribute
instance-attribute
","text":"direction: Optional[pad_directions] = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html","title":"Regexp","text":"String transformations using regular expressions.
This module contains transformations that use regular expressions to transform strings.
Classes:
Name Description RegexpExtract
Extract a specific group matched by a Java regexp from the specified string column.
RegexpReplace
Searches for the given regexp and replaces all instances with what is in 'replacement'.
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract","title":"koheesio.spark.transformations.strings.regexp.RegexpExtract","text":"Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.
A wrapper around the pyspark regexp_extract function
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to extract from. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
regexp
str
The Java regular expression to extract
required index
Optional[int]
When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.
0
Example"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--extracting-the-year-and-week-number-from-a-string","title":"Extracting the year and week number from a string","text":"Let's say we have a column containing the year and week in a format like Y## W#
and we would like to extract the week numbers.
input_df:
YWK 2020 W1 2021 WK2 output_df = RegexpExtract(\n column=\"YWK\",\n target_column=\"week_number\",\n regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n index=2, # remember that this is 1-indexed! So 2 will get the week number in this example.\n).transform(input_df)\n
output_df:
YWK week_number 2020 W1 1 2021 WK2 2"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract--using-the-same-example-but-extracting-the-year-instead","title":"Using the same example, but extracting the year instead","text":"If you want to extract the year, you can use index=1.
output_df = RegexpExtract(\n column=\"YWK\",\n target_column=\"year\",\n regexp=\"Y([0-9]+) ?WK?([0-9]+)\",\n index=1, # remember that this is 1-indexed! So 1 will get the year in this example.\n).transform(input_df)\n
output_df:
YWK year 2020 W1 2020 2021 WK2 2021"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.index","title":"index class-attribute
instance-attribute
","text":"index: Optional[int] = Field(default=0, description='When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.regexp","title":"regexp class-attribute
instance-attribute
","text":"regexp: str = Field(default=..., description='The Java regular expression to extract')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpExtract.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n return regexp_extract(column, self.regexp, self.index)\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace","title":"koheesio.spark.transformations.strings.regexp.RegexpReplace","text":"Searches for the given regexp and replaces all instances with what is in 'replacement'.
A wrapper around the pyspark regexp_replace function
Parameters:
Name Type Description Default columns
The column (or list of columns) to replace in. Alias: column
required target_column
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
required regexp
The regular expression to replace
required replacement
String to replace matched pattern with.
required Examples:
input_df: | content | |------------| | hello world|
Let's say you want to replace 'hello'.
output_df = RegexpReplace(\n column=\"content\",\n target_column=\"replaced\",\n regexp=\"hello\",\n replacement=\"gutentag\",\n).transform(input_df)\n
output_df: | content | replaced | |------------|---------------| | hello world| gutentag world|
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.regexp","title":"regexp class-attribute
instance-attribute
","text":"regexp: str = Field(default=..., description='The regular expression to replace')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.replacement","title":"replacement class-attribute
instance-attribute
","text":"replacement: str = Field(default=..., description='String to replace matched pattern with.')\n
"},{"location":"api_reference/spark/transformations/strings/regexp.html#koheesio.spark.transformations.strings.regexp.RegexpReplace.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/regexp.py
def func(self, column: Column):\n return regexp_replace(column, self.regexp, self.replacement)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html","title":"Replace","text":"String replacements without using regular expressions.
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace","title":"koheesio.spark.transformations.strings.replace.Replace","text":"Replace all instances of a string in a column with another string.
This transformation uses PySpark when().otherwise() functions.
Notes - If original_value is not set, the transformation will replace all null values with new_value
- If original_value is set, the transformation will replace all values matching original_value with new_value
- Numeric values are supported, but will be cast to string in the process
- Replace is meant for simple string replacements. If more advanced replacements are needed, use the
RegexpReplace
transformation instead.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to replace values in. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
original_value
Optional[str]
The original value that needs to be replaced. Alias: from
None
new_value
str
The new value to replace this with. Alias: to
required Examples:
input_df:
column hello world None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-null-values-with-a-new-value","title":"Replace all null values with a new value","text":"output_df = Replace(\n column=\"column\",\n target_column=\"replaced_column\",\n original_value=None, # This is the default value, so it can be omitted\n new_value=\"programmer\",\n).transform(input_df)\n
output_df:
column replaced_column hello hello world world None programmer"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace--replace-all-instances-of-a-string-in-a-column-with-another-string","title":"Replace all instances of a string in a column with another string","text":"output_df = Replace(\n column=\"column\",\n target_column=\"replaced_column\",\n original_value=\"world\",\n new_value=\"programmer\",\n).transform(input_df)\n
output_df:
column replaced_column hello hello world programmer None None"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.new_value","title":"new_value class-attribute
instance-attribute
","text":"new_value: str = Field(default=..., alias='to', description='The new value to replace this with')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.original_value","title":"original_value class-attribute
instance-attribute
","text":"original_value: Optional[str] = Field(default=None, alias='from', description='The original value that needs to be replaced')\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.cast_values_to_str","title":"cast_values_to_str","text":"cast_values_to_str(value)\n
Cast values to string if they are not None
Source code in src/koheesio/spark/transformations/strings/replace.py
@field_validator(\"original_value\", \"new_value\", mode=\"before\")\ndef cast_values_to_str(cls, value):\n \"\"\"Cast values to string if they are not None\"\"\"\n if value:\n return str(value)\n
"},{"location":"api_reference/spark/transformations/strings/replace.html#koheesio.spark.transformations.strings.replace.Replace.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/replace.py
def func(self, column: Column):\n when_statement = (\n when(column.isNull(), lit(self.new_value))\n if not self.original_value\n else when(\n column == self.original_value,\n lit(self.new_value),\n )\n )\n return when_statement.otherwise(column)\n
"},{"location":"api_reference/spark/transformations/strings/split.html","title":"Split","text":"Splits the contents of a column on basis of a split_pattern
Classes:
Name Description SplitAll
Splits the contents of a column on basis of a split_pattern.
SplitAtFirstMatch
Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll","title":"koheesio.spark.transformations.strings.split.SplitAll","text":"This function splits the contents of a column on basis of a split_pattern.
It splits at al the locations the pattern is found. The new column will be of ArrayType.
Wraps the pyspark.sql.functions.split function.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to split. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
split_pattern
str
This is the pattern that will be used to split the column contents.
required Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = SplitColumn(column=\"product\", target_column=\"split\", split_pattern=\" \").transform(input_df)\n
output_df:
product amount country split Banana lemon orange 1000 USA [\"Banana\", \"lemon\" \"orange\"] Carrots Blueberries 1500 USA [\"Carrots\", \"Blueberries\"] Beans 1600 USA [\"Beans\"]"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.split_pattern","title":"split_pattern class-attribute
instance-attribute
","text":"split_pattern: str = Field(default=..., description='The pattern to split the column contents.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAll.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n return split(column, pattern=self.split_pattern)\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch","title":"koheesio.spark.transformations.strings.split.SplitAtFirstMatch","text":"Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..
Note - SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you can toggle the parameter retrieve_first_part.
- The new column will be of StringType.
- If you want to split a column more than once, you should call this function multiple times.
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to split. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
split_pattern
str
This is the pattern that will be used to split the column contents.
required retrieve_first_part
Optional[bool]
Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.
True
Example"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch--splitting-with-a-space-as-a-pattern","title":"Splitting with a space as a pattern:","text":"input_df:
product amount country Banana lemon orange 1000 USA Carrots Blueberries 1500 USA Beans 1600 USA output_df = SplitColumn(column=\"product\", target_column=\"split_first\", split_pattern=\"an\").transform(input_df)\n
output_df:
product amount country split_first Banana lemon orange 1000 USA B Carrots Blueberries 1500 USA Carrots Blueberries Beans 1600 USA Be"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.retrieve_first_part","title":"retrieve_first_part class-attribute
instance-attribute
","text":"retrieve_first_part: Optional[bool] = Field(default=True, description='Takes the first part of the split when true, the second part when False. Other parts are ignored.')\n
"},{"location":"api_reference/spark/transformations/strings/split.html#koheesio.spark.transformations.strings.split.SplitAtFirstMatch.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):\n split_func = split(column, pattern=self.split_pattern)\n\n # first part\n if self.retrieve_first_part:\n return split_func.getItem(0)\n\n # or, second part\n return coalesce(split_func.getItem(1), lit(\"\"))\n
"},{"location":"api_reference/spark/transformations/strings/substring.html","title":"Substring","text":"Extracts a substring from a string column starting at the given position.
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring","title":"koheesio.spark.transformations.strings.substring.Substring","text":"Extracts a substring from a string column starting at the given position.
This is a wrapper around PySpark substring() function
Notes - Numeric columns will be cast to string
- start is 1-indexed, not 0-indexed!
Parameters:
Name Type Description Default columns
Union[str, List[str]]
The column (or list of columns) to substring. Alias: column
required target_column
Optional[str]
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
None
start
PositiveInt
Positive int. Defines where to begin the substring from. The first character of the field has index 1!
required length
Optional[int]
Optional. If not provided, the substring will go until end of string.
-1
Example"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring--extract-a-substring-from-a-string-column-starting-at-the-given-position","title":"Extract a substring from a string column starting at the given position.","text":"input_df:
column skyscraper output_df = Substring(\n column=\"column\",\n target_column=\"substring_column\",\n start=3, # 1-indexed! So this will start at the 3rd character\n length=4,\n).transform(input_df)\n
output_df:
column substring_column skyscraper yscr"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.length","title":"length class-attribute
instance-attribute
","text":"length: Optional[int] = Field(default=-1, description='The target length for the string. use -1 to perform until end')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.start","title":"start class-attribute
instance-attribute
","text":"start: PositiveInt = Field(default=..., description='The starting position')\n
"},{"location":"api_reference/spark/transformations/strings/substring.html#koheesio.spark.transformations.strings.substring.Substring.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/substring.py
def func(self, column: Column):\n return when(column.isNull(), None).otherwise(substring(column, self.start, self.length)).cast(StringType())\n
"},{"location":"api_reference/spark/transformations/strings/trim.html","title":"Trim","text":"Trim whitespace from the beginning and/or end of a string.
Classes:
Name Description - `Trim`
Trim whitespace from the beginning and/or end of a string.
- `LTrim`
Trim whitespace from the beginning of a string.
- `RTrim`
Trim whitespace from the end of a string.
See class docstrings for more information.
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.trim_type","title":"koheesio.spark.transformations.strings.trim.trim_type module-attribute
","text":"trim_type = Literal['left', 'right', 'left-right']\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim","title":"koheesio.spark.transformations.strings.trim.LTrim","text":"Trim whitespace from the beginning of a string. Alias: LeftTrim
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.LTrim.direction","title":"direction class-attribute
instance-attribute
","text":"direction: trim_type = 'left'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim","title":"koheesio.spark.transformations.strings.trim.RTrim","text":"Trim whitespace from the end of a string. Alias: RightTrim
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.RTrim.direction","title":"direction class-attribute
instance-attribute
","text":"direction: trim_type = 'right'\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim","title":"koheesio.spark.transformations.strings.trim.Trim","text":"Trim whitespace from the beginning and/or end of a string.
This is a wrapper around PySpark ltrim() and rtrim() functions
The direction
parameter can be changed to apply either a left or a right trim. Defaults to left AND right trim.
Note: If the type of the column is not string, Trim will not be run. A Warning will be thrown indicating this
Parameters:
Name Type Description Default columns
The column (or list of columns) to trim. Alias: column If no columns are provided, all string columns will be trimmed.
required target_column
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.
required direction
On which side to remove the spaces. Either \"left\", \"right\" or \"left-right\". Defaults to \"left-right\"
required Examples:
input_df: | column | |-----------| | \" hello \" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-beginning-of-a-string","title":"Trim whitespace from the beginning of a string","text":"output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"left\").transform(input_df)\n
output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello \" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-both-sides-of-a-string","title":"Trim whitespace from both sides of a string","text":"output_df = Trim(\n column=\"column\",\n target_column=\"trimmed_column\",\n direction=\"left-right\", # default value\n).transform(input_df)\n
output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \"hello\" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim--trim-whitespace-from-the-end-of-a-string","title":"Trim whitespace from the end of a string","text":"output_df = Trim(column=\"column\", target_column=\"trimmed_column\", direction=\"right\").transform(input_df)\n
output_df: | column | trimmed_column | |-----------|----------------| | \" hello \" | \" hello\" |
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.columns","title":"columns class-attribute
instance-attribute
","text":"columns: ListOfColumns = Field(default='*', alias='column', description='The column (or list of columns) to trim. Alias: column. If no columns are provided, all stringcolumns will be trimmed.')\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.direction","title":"direction class-attribute
instance-attribute
","text":"direction: trim_type = Field(default='left-right', description=\"On which side to remove the spaces. Either 'left', 'right' or 'left-right'\")\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig","title":"ColumnConfig","text":"Limit data types to string only.
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.limit_data_type","title":"limit_data_type class-attribute
instance-attribute
","text":"limit_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.ColumnConfig.run_for_all_data_type","title":"run_for_all_data_type class-attribute
instance-attribute
","text":"run_for_all_data_type = [STRING]\n
"},{"location":"api_reference/spark/transformations/strings/trim.html#koheesio.spark.transformations.strings.trim.Trim.func","title":"func","text":"func(column: Column)\n
Source code in src/koheesio/spark/transformations/strings/trim.py
def func(self, column: Column):\n if self.direction == \"left\":\n return f.ltrim(column)\n\n if self.direction == \"right\":\n return f.rtrim(column)\n\n # both (left-right)\n return f.rtrim(f.ltrim(column))\n
"},{"location":"api_reference/spark/writers/index.html","title":"Writers","text":"The Writer class is used to write the DataFrame to a target.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode","title":"koheesio.spark.writers.BatchOutputMode","text":"For Batch:
- append: Append the contents of the DataFrame to the output table, default option in Koheesio.
- overwrite: overwrite the existing data.
- ignore: ignore the operation (i.e. no-op).
- error or errorifexists: throw an exception at runtime.
- merge: update matching data in the table and insert rows that do not exist.
- merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERROR","title":"ERROR class-attribute
instance-attribute
","text":"ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute
instance-attribute
","text":"ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.IGNORE","title":"IGNORE class-attribute
instance-attribute
","text":"IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE","title":"MERGE class-attribute
instance-attribute
","text":"MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute
instance-attribute
","text":"MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute
instance-attribute
","text":"MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute
instance-attribute
","text":"OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode","title":"koheesio.spark.writers.StreamingOutputMode","text":"For Streaming:
- append: only the new rows in the streaming DataFrame will be written to the sink.
- complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.
- update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.COMPLETE","title":"COMPLETE class-attribute
instance-attribute
","text":"COMPLETE = 'complete'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.StreamingOutputMode.UPDATE","title":"UPDATE class-attribute
instance-attribute
","text":"UPDATE = 'update'\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer","title":"koheesio.spark.writers.Writer","text":"The Writer class is used to write the DataFrame to a target.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.df","title":"df class-attribute
instance-attribute
","text":"df: Optional[DataFrame] = Field(default=None, description='The Spark DataFrame')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.format","title":"format class-attribute
instance-attribute
","text":"format: str = Field(default='delta', description='The format of the output')\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.streaming","title":"streaming property
","text":"streaming: bool\n
Check if the DataFrame is a streaming DataFrame or not.
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.execute","title":"execute abstractmethod
","text":"execute()\n
Execute on a Writer should handle writing of the self.df (input) as a minimum
Source code in src/koheesio/spark/writers/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Execute on a Writer should handle writing of the self.df (input) as a minimum\"\"\"\n # self.df # input dataframe\n ...\n
"},{"location":"api_reference/spark/writers/index.html#koheesio.spark.writers.Writer.write","title":"write","text":"write(df: Optional[DataFrame] = None) -> Output\n
Write the DataFrame to the output using execute() and return the output.
If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.
Source code in src/koheesio/spark/writers/__init__.py
def write(self, df: Optional[DataFrame] = None) -> SparkStep.Output:\n \"\"\"Write the DataFrame to the output using execute() and return the output.\n\n If no DataFrame is passed, the self.df will be used.\n If no self.df is set, a RuntimeError will be thrown.\n \"\"\"\n self.df = df or self.df\n if not self.df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n self.execute()\n return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html","title":"Buffer","text":"This module contains classes for writing data to a buffer before writing to the final destination.
The BufferWriter
class is a base class for writers that write to a buffer first. It provides methods for writing, reading, and resetting the buffer, as well as checking if the buffer is compressed and compressing the buffer.
The PandasCsvBufferWriter
class is a subclass of BufferWriter
that writes a Spark DataFrame to CSV file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).
The PandasJsonBufferWriter
class is a subclass of BufferWriter
that writes a Spark DataFrame to JSON file(s) using Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data to more arbitrary file systems (e.g., SFTP).
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter","title":"koheesio.spark.writers.buffer.BufferWriter","text":"Base class for writers that write to a buffer first, before writing to the final destination.
execute()
method should implement how the incoming DataFrame is written to the buffer object (e.g. BytesIO) in the output.
The default implementation uses a SpooledTemporaryFile
as the buffer. This is a file-like object that starts off stored in memory and automatically rolls over to a temporary file on disk if it exceeds a certain size. A SpooledTemporaryFile
behaves similar to BytesIO
, but with the added benefit of being able to handle larger amounts of data.
This approach provides a balance between speed and memory usage, allowing for fast in-memory operations for smaller amounts of data while still being able to handle larger amounts of data that would not otherwise fit in memory.
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output","title":"Output","text":"Output class for BufferWriter
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.buffer","title":"buffer class-attribute
instance-attribute
","text":"buffer: InstanceOf[SpooledTemporaryFile] = Field(default_factory=partial(SpooledTemporaryFile, mode='w+b', max_size=0), exclude=True)\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.compress","title":"compress","text":"compress()\n
Compress the file_buffer in place using GZIP
Source code in src/koheesio/spark/writers/buffer.py
def compress(self):\n \"\"\"Compress the file_buffer in place using GZIP\"\"\"\n # check if the buffer is already compressed\n if self.is_compressed():\n self.logger.warn(\"Buffer is already compressed. Nothing to compress...\")\n return self\n\n # compress the file_buffer\n file_buffer = self.buffer\n compressed = gzip.compress(file_buffer.read())\n\n # write the compressed content back to the buffer\n self.reset_buffer()\n self.buffer.write(compressed)\n\n return self # to allow for chaining\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.is_compressed","title":"is_compressed","text":"is_compressed()\n
Check if the buffer is compressed.
Source code in src/koheesio/spark/writers/buffer.py
def is_compressed(self):\n \"\"\"Check if the buffer is compressed.\"\"\"\n self.rewind_buffer()\n magic_number_present = self.buffer.read(2) == b\"\\x1f\\x8b\"\n self.rewind_buffer()\n return magic_number_present\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.read","title":"read","text":"read()\n
Read the buffer
Source code in src/koheesio/spark/writers/buffer.py
def read(self):\n \"\"\"Read the buffer\"\"\"\n self.rewind_buffer()\n data = self.buffer.read()\n self.rewind_buffer()\n return data\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.reset_buffer","title":"reset_buffer","text":"reset_buffer()\n
Reset the buffer
Source code in src/koheesio/spark/writers/buffer.py
def reset_buffer(self):\n \"\"\"Reset the buffer\"\"\"\n self.buffer.truncate(0)\n self.rewind_buffer()\n return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.Output.rewind_buffer","title":"rewind_buffer","text":"rewind_buffer()\n
Rewind the buffer
Source code in src/koheesio/spark/writers/buffer.py
def rewind_buffer(self):\n \"\"\"Rewind the buffer\"\"\"\n self.buffer.seek(0)\n return self\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.BufferWriter.write","title":"write","text":"write(df=None) -> Output\n
Write the DataFrame to the buffer
Source code in src/koheesio/spark/writers/buffer.py
def write(self, df=None) -> Output:\n \"\"\"Write the DataFrame to the buffer\"\"\"\n self.df = df or self.df\n if not self.df:\n raise RuntimeError(\"No valid Dataframe was passed\")\n self.output.reset_buffer()\n self.execute()\n return self.output\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter","title":"koheesio.spark.writers.buffer.PandasCsvBufferWriter","text":"Write a Spark DataFrame to CSV file(s) using Pandas.
Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
See also: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
Note This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).
Pyspark vs Pandas The following table shows the mapping between Pyspark, Pandas, and Koheesio properties. Note that the default values are mostly the same as Pyspark's DataFrameWriter
implementation, with some exceptions (see below).
This class implements the most commonly used properties. If a property is not explicitly implemented, it can be accessed through params
.
PySpark Property Default PySpark Pandas Property Default Pandas Koheesio Property Default Koheesio Notes maxRecordsPerFile ... chunksize None max_records_per_file ... Spark property name: spark.sql.files.maxRecordsPerFile sep , sep , sep , lineSep \\n
line_terminator os.linesep lineSep (alias=line_terminator) \\n N/A ... index True index False Determines whether row labels (index) are included in the output header False header True header True quote \" quotechar \" quote (alias=quotechar) \" quoteAll False doublequote True quoteAll (alias=doublequote) False escape \\
escapechar None escapechar (alias=escape) \\ escapeQuotes True N/A N/A N/A ... Not available in Pandas ignoreLeadingWhiteSpace True N/A N/A N/A ... Not available in Pandas ignoreTrailingWhiteSpace True N/A N/A N/A ... Not available in Pandas charToEscapeQuoteEscaping escape or \u0000
N/A N/A N/A ... Not available in Pandas dateFormat yyyy-MM-dd
N/A N/A N/A ... Pandas implements Timestamp, not Date timestampFormat yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
date_format N/A timestampFormat (alias=date_format) yyyy-MM-dd'T'HHss.SSS Follows PySpark defaults timestampNTZFormat yyyy-MM-dd'T'HH:mm:ss[.SSS]
N/A N/A N/A ... Pandas implements Timestamp, see above compression None compression infer compression None encoding utf-8 encoding utf-8 N/A ... Not explicitly implemented nullValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented emptyValue \"\" na_rep \"\" N/A \"\" Not explicitly implemented N/A ... float_format N/A N/A ... Not explicitly implemented N/A ... decimal N/A N/A ... Not explicitly implemented N/A ... index_label None N/A ... Not explicitly implemented N/A ... columns N/A N/A ... Not explicitly implemented N/A ... mode N/A N/A ... Not explicitly implemented N/A ... quoting N/A N/A ... Not explicitly implemented N/A ... errors N/A N/A ... Not explicitly implemented N/A ... storage_options N/A N/A ... Not explicitly implemented differences with Pyspark: - dateFormat -> Pandas implements Timestamp, not just Date. Hence, Koheesio sets the default to the python equivalent of PySpark's default.
- compression -> Spark does not compress by default, hence Koheesio does not compress by default. Compression can be provided though.
Parameters:
Name Type Description Default header
bool
Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.
True
sep
str
Field delimiter for the output file. Default is ','.
,
quote
str
String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'. Default is '\"'.
\"
quoteAll
bool
A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio sets the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'. Default is False.
False
escape
str
String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to \\
to match Pyspark's default behavior. In Pandas, this field is called 'escapechar', and defaults to None. Default is '\\'.
\\
timestampFormat
str
Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
which mimics the iso8601 format (datetime.isoformat()
). Default is '%Y-%m-%dT%H:%M:%S.%f'.
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
lineSep
str, optional, default=
String of length 1. Defines the character used as line separator that should be used for writing. Default is os.linesep.
required compression
Optional[Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', 'tar']]
A string representing the compression to use for on-the-fly compression of the output data. Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.
None
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.compression","title":"compression class-attribute
instance-attribute
","text":"compression: Optional[CompressionOptions] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Note: Pandas sets this default to 'infer', Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to one of 'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'tar'. See Pandas documentation for more details.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.escape","title":"escape class-attribute
instance-attribute
","text":"escape: constr(max_length=1) = Field(default='\\\\', description=\"String of length 1. Character used to escape sep and quotechar when appropriate. Koheesio sets this default to `\\\\` to match Pyspark's default behavior. In Pandas, this is called 'escapechar', and defaults to None.\", alias='escapechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.header","title":"header class-attribute
instance-attribute
","text":"header: bool = Field(default=True, description=\"Whether to write the names of columns as the first line. In Pandas a list of strings can be given assumed to be aliases for the column names - this is not supported in this class. Instead, the column names as used in the dataframe are used as the header. Koheesio sets this default to True to match Pandas' default.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.index","title":"index class-attribute
instance-attribute
","text":"index: bool = Field(default=False, description='Toggles whether to write row names (index). Default False in Koheesio - pandas default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.lineSep","title":"lineSep class-attribute
instance-attribute
","text":"lineSep: Optional[constr(max_length=1)] = Field(default=linesep, description='String of length 1. Defines the character used as line separator that should be used for writing.', alias='line_terminator')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quote","title":"quote class-attribute
instance-attribute
","text":"quote: constr(max_length=1) = Field(default='\"', description=\"String of length 1. Character used to quote fields. In PySpark, this is called 'quote', in Pandas this is called 'quotechar'.\", alias='quotechar')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.quoteAll","title":"quoteAll class-attribute
instance-attribute
","text":"quoteAll: bool = Field(default=False, description=\"A flag indicating whether all values should always be enclosed in quotes in a field. Koheesio set the default (False) to only escape values containing a quote character - this is Pyspark's default behavior. In Pandas, this is called 'doublequote'.\", alias='doublequote')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.sep","title":"sep class-attribute
instance-attribute
","text":"sep: constr(max_length=1) = Field(default=',', description='Field delimiter for the output file')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.timestampFormat","title":"timestampFormat class-attribute
instance-attribute
","text":"timestampFormat: str = Field(default='%Y-%m-%dT%H:%M:%S.%f', description=\"Sets the string that indicates a date format for datetime objects. Koheesio sets this default to a close equivalent of Pyspark's default (excluding timezone information). In Pandas, this field is called 'date_format'. Note: Pandas does not support Java Timestamps, only Python Timestamps. The Pyspark default is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` which mimics the iso8601 format (`datetime.isoformat()`).\", alias='date_format')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output","title":"Output","text":"Output class for PandasCsvBufferWriter
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.Output.pandas_df","title":"pandas_df class-attribute
instance-attribute
","text":"pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.execute","title":"execute","text":"execute()\n
Write the DataFrame to the buffer using Pandas to_csv() method. Compression is handled by pandas to_csv() method.
Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n \"\"\"Write the DataFrame to the buffer using Pandas to_csv() method.\n Compression is handled by pandas to_csv() method.\n \"\"\"\n # convert the Spark DataFrame to a Pandas DataFrame\n self.output.pandas_df = self.df.toPandas()\n\n # create csv file in memory\n file_buffer = self.output.buffer\n self.output.pandas_df.to_csv(file_buffer, **self.get_options(options_type=\"spark\"))\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasCsvBufferWriter.get_options","title":"get_options","text":"get_options(options_type: str = 'csv')\n
Returns the options to pass to Pandas' to_csv() method.
Source code in src/koheesio/spark/writers/buffer.py
def get_options(self, options_type: str = \"csv\"):\n \"\"\"Returns the options to pass to Pandas' to_csv() method.\"\"\"\n try:\n import pandas as _pd\n\n # Get the pandas version as a tuple of integers\n pandas_version = tuple(int(i) for i in _pd.__version__.split(\".\"))\n except ImportError:\n raise ImportError(\"Pandas is required to use this writer\")\n\n # Use line_separator for pandas 2.0.0 and later\n line_sep_option_naming = \"line_separator\" if pandas_version >= (2, 0, 0) else \"line_terminator\"\n\n csv_options = {\n \"header\": self.header,\n \"sep\": self.sep,\n \"quotechar\": self.quote,\n \"doublequote\": self.quoteAll,\n \"escapechar\": self.escape,\n \"na_rep\": self.emptyValue or self.nullValue,\n line_sep_option_naming: self.lineSep,\n \"index\": self.index,\n \"date_format\": self.timestampFormat,\n \"compression\": self.compression,\n **self.params,\n }\n\n if options_type == \"spark\":\n csv_options[\"lineterminator\"] = csv_options.pop(line_sep_option_naming)\n elif options_type == \"kohesio_pandas_buffer_writer\":\n csv_options[\"line_terminator\"] = csv_options.pop(line_sep_option_naming)\n\n return csv_options\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter","title":"koheesio.spark.writers.buffer.PandasJsonBufferWriter","text":"Write a Spark DataFrame to JSON file(s) using Pandas.
Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
Note This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).
Parameters:
Name Type Description Default orient
Format of the resulting JSON string. Default is 'records'.
required lines
Format output as one JSON object per line. Only used when orient='records'. Default is True. - If true, the output will be formatted as one JSON object per line. - If false, the output will be written as a single JSON object. Note: this value is only used when orient='records' and will be ignored otherwise.
required date_format
Type of date conversion. Default is 'iso'. See Date and Timestamp Formats
for a detailed description and more information.
required double_precision
Number of decimal places for encoding floating point values. Default is 10.
required force_ascii
Force encoded string to be ASCII. Default is True.
required compression
A string representing the compression to use for on-the-fly compression of the output data. Koheesio sets this default to 'None' leaving the data uncompressed. Can be set to gzip' optionally. Other compression options are currently not supported by Koheesio for JSON output.
required The
required dates
required The
required different
required original
required Note
required then
required References
required"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.columns","title":"columns class-attribute
instance-attribute
","text":"columns: Optional[list[str]] = Field(default=None, description='The columns to write. If None, all columns will be written.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.compression","title":"compression class-attribute
instance-attribute
","text":"compression: Optional[Literal['gzip']] = Field(default=None, description=\"A string representing the compression to use for on-the-fly compression of the output data.Koheesio sets this default to 'None' leaving the data uncompressed by default. Can be set to 'gzip' optionally.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.date_format","title":"date_format class-attribute
instance-attribute
","text":"date_format: Literal['iso', 'epoch'] = Field(default='iso', description=\"Type of date conversion. Default is 'iso'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.double_precision","title":"double_precision class-attribute
instance-attribute
","text":"double_precision: int = Field(default=10, description='Number of decimal places for encoding floating point values. Default is 10.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.force_ascii","title":"force_ascii class-attribute
instance-attribute
","text":"force_ascii: bool = Field(default=True, description='Force encoded string to be ASCII. Default is True.')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.lines","title":"lines class-attribute
instance-attribute
","text":"lines: bool = Field(default=True, description=\"Format output as one JSON object per line. Only used when orient='records'. Default is True.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.orient","title":"orient class-attribute
instance-attribute
","text":"orient: Literal['split', 'records', 'index', 'columns', 'values', 'table'] = Field(default='records', description=\"Format of the resulting JSON string. Default is 'records'.\")\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output","title":"Output","text":"Output class for PandasJsonBufferWriter
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.Output.pandas_df","title":"pandas_df class-attribute
instance-attribute
","text":"pandas_df: Optional[DataFrame] = Field(None, description='The Pandas DataFrame that was written')\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.execute","title":"execute","text":"execute()\n
Write the DataFrame to the buffer using Pandas to_json() method.
Source code in src/koheesio/spark/writers/buffer.py
def execute(self):\n \"\"\"Write the DataFrame to the buffer using Pandas to_json() method.\"\"\"\n df = self.df\n if self.columns:\n df = df[self.columns]\n\n # convert the Spark DataFrame to a Pandas DataFrame\n self.output.pandas_df = df.toPandas()\n\n # create json file in memory\n file_buffer = self.output.buffer\n self.output.pandas_df.to_json(file_buffer, **self.get_options())\n\n # compress the buffer if compression is set\n if self.compression:\n self.output.compress()\n
"},{"location":"api_reference/spark/writers/buffer.html#koheesio.spark.writers.buffer.PandasJsonBufferWriter.get_options","title":"get_options","text":"get_options()\n
Returns the options to pass to Pandas' to_json() method.
Source code in src/koheesio/spark/writers/buffer.py
def get_options(self):\n \"\"\"Returns the options to pass to Pandas' to_json() method.\"\"\"\n json_options = {\n \"orient\": self.orient,\n \"date_format\": self.date_format,\n \"double_precision\": self.double_precision,\n \"force_ascii\": self.force_ascii,\n \"lines\": self.lines,\n **self.params,\n }\n\n # ignore the 'lines' parameter if orient is not 'records'\n if self.orient != \"records\":\n del json_options[\"lines\"]\n\n return json_options\n
"},{"location":"api_reference/spark/writers/dummy.html","title":"Dummy","text":"Module for the DummyWriter class.
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter","title":"koheesio.spark.writers.dummy.DummyWriter","text":"A simple DummyWriter that performs the equivalent of a df.show() on the given DataFrame and returns the first row of data as a dict.
This Writer does not actually write anything to a source/destination, but is useful for debugging or testing purposes.
Parameters:
Name Type Description Default n
PositiveInt
Number of rows to show.
20
truncate
bool | PositiveInt
If set to True
, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate
and align cells right.
True
vertical
bool
If set to True
, print output rows vertically (one line per column value).
False
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.n","title":"n class-attribute
instance-attribute
","text":"n: PositiveInt = Field(default=20, description='Number of rows to show.', gt=0)\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.truncate","title":"truncate class-attribute
instance-attribute
","text":"truncate: Union[bool, PositiveInt] = Field(default=True, description='If set to ``True``, truncate strings longer than 20 chars by default.If set to a number greater than one, truncates long strings to length ``truncate`` and align cells right.')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.vertical","title":"vertical class-attribute
instance-attribute
","text":"vertical: bool = Field(default=False, description='If set to ``True``, print output rows vertically (one line per column value).')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output","title":"Output","text":"DummyWriter output
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.df_content","title":"df_content class-attribute
instance-attribute
","text":"df_content: str = Field(default=..., description='The content of the DataFrame as a string')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.Output.head","title":"head class-attribute
instance-attribute
","text":"head: Dict[str, Any] = Field(default=..., description='The first row of the DataFrame as a dict')\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.execute","title":"execute","text":"execute() -> Output\n
Execute the DummyWriter
Source code in src/koheesio/spark/writers/dummy.py
def execute(self) -> Output:\n \"\"\"Execute the DummyWriter\"\"\"\n df: DataFrame = self.df\n\n # noinspection PyProtectedMember\n df_content = df._jdf.showString(self.n, self.truncate, self.vertical)\n\n # logs the equivalent of doing df.show()\n self.log.info(f\"content of df that was passed to DummyWriter:\\n{df_content}\")\n\n self.output.head = self.df.head().asDict()\n self.output.df_content = df_content\n
"},{"location":"api_reference/spark/writers/dummy.html#koheesio.spark.writers.dummy.DummyWriter.int_truncate","title":"int_truncate","text":"int_truncate(truncate_value) -> int\n
Truncate is either a bool or an int.
Parameters: truncate_value : int | bool, optional, default=True If int, specifies the maximum length of the string. If bool and True, defaults to a maximum length of 20 characters.
Returns: int The maximum length of the string.
Source code in src/koheesio/spark/writers/dummy.py
@field_validator(\"truncate\")\ndef int_truncate(cls, truncate_value) -> int:\n \"\"\"\n Truncate is either a bool or an int.\n\n Parameters:\n -----------\n truncate_value : int | bool, optional, default=True\n If int, specifies the maximum length of the string.\n If bool and True, defaults to a maximum length of 20 characters.\n\n Returns:\n --------\n int\n The maximum length of the string.\n\n \"\"\"\n # Same logic as what is inside DataFrame.show()\n if isinstance(truncate_value, bool) and truncate_value is True:\n return 20 # default is 20 chars\n return int(truncate_value) # otherwise 0, or whatever the user specified\n
"},{"location":"api_reference/spark/writers/kafka.html","title":"Kafka","text":"Kafka writer to write batch or streaming data into kafka topics
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter","title":"koheesio.spark.writers.kafka.KafkaWriter","text":"Kafka writer to write batch or streaming data into kafka topics
All kafka specific options can be provided as additional init params
Parameters:
Name Type Description Default broker
str
broker url of the kafka cluster
required topic
str
full topic name to write the data to
required trigger
Optional[Union[Trigger, str, Dict]]
Indicates optionally how to stream the data into kafka, continuous or batch
required checkpoint_location
str
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs.
required Example KafkaWriter(\n write_broker=\"broker.com:9500\",\n topic=\"test-topic\",\n trigger=Trigger(continuous=True)\n includeHeaders: \"true\",\n key.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n value.serializer: \"org.apache.kafka.common.serialization.StringSerializer\",\n kafka.group.id: \"test-group\",\n checkpoint_location: \"s3://bucket/test-topic\"\n)\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.batch_writer","title":"batch_writer property
","text":"batch_writer: DataFrameWriter\n
returns a batch writer
Returns:
Type Description DataFrameWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.broker","title":"broker class-attribute
instance-attribute
","text":"broker: str = Field(default=..., description='Kafka brokers to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.checkpoint_location","title":"checkpoint_location class-attribute
instance-attribute
","text":"checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.format","title":"format class-attribute
instance-attribute
","text":"format: str = 'kafka'\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.logged_option_keys","title":"logged_option_keys property
","text":"logged_option_keys\n
keys to be logged
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.options","title":"options property
","text":"options\n
retrieve the kafka options incl topic and broker.
Returns:
Type Description dict
Dict being the combination of kafka options + topic + broker
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.stream_writer","title":"stream_writer property
","text":"stream_writer: DataStreamWriter\n
returns a stream writer
Returns:
Type Description DataStreamWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.streaming_query","title":"streaming_query property
","text":"streaming_query: Optional[Union[str, StreamingQuery]]\n
return the streaming query
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.topic","title":"topic class-attribute
instance-attribute
","text":"topic: str = Field(default=..., description='Kafka topic to write to')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.trigger","title":"trigger class-attribute
instance-attribute
","text":"trigger: Optional[Union[Trigger, str, Dict]] = Field(Trigger(available_now=True), description='Set the trigger for the stream query. If not set data is processed in batch')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.writer","title":"writer property
","text":"writer: Union[DataStreamWriter, DataFrameWriter]\n
function to get the writer of proper type according to whether the data to written is a stream or not This function will also set the trigger property in case of a datastream.
Returns:
Type Description Union[DataStreamWriter, DataFrameWriter]
In case of streaming data -> DataStreamWriter, else -> DataFrameWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output","title":"Output","text":"Output of the KafkaWriter
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.Output.streaming_query","title":"streaming_query class-attribute
instance-attribute
","text":"streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/kafka.html#koheesio.spark.writers.kafka.KafkaWriter.execute","title":"execute","text":"execute()\n
Effectively write the data from the dataframe (streaming of batch) to kafka topic.
Returns:
Type Description Output
streaming_query function can be used to gain insights on running write.
Source code in src/koheesio/spark/writers/kafka.py
def execute(self):\n \"\"\"Effectively write the data from the dataframe (streaming of batch) to kafka topic.\n\n Returns\n -------\n KafkaWriter.Output\n streaming_query function can be used to gain insights on running write.\n \"\"\"\n applied_options = {k: v for k, v in self.options.items() if k in self.logged_option_keys}\n self.log.debug(f\"Applying options {applied_options}\")\n\n self._validate_dataframe()\n\n _writer = self.writer.format(self.format).options(**self.options)\n self.output.streaming_query = _writer.start() if self.streaming else _writer.save()\n
"},{"location":"api_reference/spark/writers/snowflake.html","title":"Snowflake","text":"This module contains the SnowflakeWriter class, which is used to write data to Snowflake.
"},{"location":"api_reference/spark/writers/stream.html","title":"Stream","text":"Module that holds some classes and functions to be able to write to a stream
Classes:
Name Description Trigger
class to set the trigger for a stream query
StreamWriter
abstract class for stream writers
ForEachBatchStreamWriter
class to run a writer for each batch
Functions:
Name Description writer_to_foreachbatch
function to be used as batch_function for StreamWriter (sub)classes
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter","title":"koheesio.spark.writers.stream.ForEachBatchStreamWriter","text":"Runnable ForEachBatchWriter
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.ForEachBatchStreamWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n self.streaming_query = self.writer.start()\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter","title":"koheesio.spark.writers.stream.StreamWriter","text":"ABC Stream Writer
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.batch_function","title":"batch_function class-attribute
instance-attribute
","text":"batch_function: Optional[Callable] = Field(default=None, description='allows you to run custom batch functions for each micro batch', alias='batch_function_for_each_df')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.checkpoint_location","title":"checkpoint_location class-attribute
instance-attribute
","text":"checkpoint_location: str = Field(default=..., alias='checkpointLocation', description='In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information and the running aggregates to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system that is accessible by Spark (e.g. Databricks Unity Catalog External Location)')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.output_mode","title":"output_mode class-attribute
instance-attribute
","text":"output_mode: StreamingOutputMode = Field(default=APPEND, alias='outputMode', description=__doc__)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.stream_writer","title":"stream_writer property
","text":"stream_writer: DataStreamWriter\n
Returns the stream writer for the given DataFrame and settings
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.streaming_query","title":"streaming_query class-attribute
instance-attribute
","text":"streaming_query: Optional[Union[str, StreamingQuery]] = Field(default=None, description='Query ID of the stream query')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.trigger","title":"trigger class-attribute
instance-attribute
","text":"trigger: Optional[Union[Trigger, str, Dict]] = Field(default=Trigger(available_now=True), description='Set the trigger for the stream query. If this is not set it process data as batch')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.writer","title":"writer property
","text":"writer\n
Returns the stream writer since we don't have a batch mode for streams
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.await_termination","title":"await_termination","text":"await_termination(timeout: Optional[int] = None)\n
Await termination of the stream query
Source code in src/koheesio/spark/writers/stream.py
def await_termination(self, timeout: Optional[int] = None):\n \"\"\"Await termination of the stream query\"\"\"\n self.streaming_query.awaitTermination(timeout=timeout)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.StreamWriter.execute","title":"execute abstractmethod
","text":"execute()\n
Source code in src/koheesio/spark/writers/stream.py
@abstractmethod\ndef execute(self):\n raise NotImplementedError\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger","title":"koheesio.spark.writers.stream.Trigger","text":"Trigger types for a stream query.
Only one trigger can be set!
Example - processingTime='5 seconds'
- continuous='5 seconds'
- availableNow=True
- once=True
See Also - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.available_now","title":"available_now class-attribute
instance-attribute
","text":"available_now: Optional[bool] = Field(default=None, alias='availableNow', description='if set to True, set a trigger that processes all available data in multiple batches then terminates the query.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.continuous","title":"continuous class-attribute
instance-attribute
","text":"continuous: Optional[str] = Field(default=None, description=\"a time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a continuous query with a given checkpoint interval.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(validate_default=False, extra='forbid')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.once","title":"once class-attribute
instance-attribute
","text":"once: Optional[bool] = Field(default=None, deprecated=True, description='if set to True, set a trigger that processes only one batch of data in a streaming query then terminates the query. use `available_now` instead of `once`.')\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.processing_time","title":"processing_time class-attribute
instance-attribute
","text":"processing_time: Optional[str] = Field(default=None, alias='processingTime', description=\"a processing time interval as a string, e.g. '5 seconds', '1 minute'.Set a trigger that runs a microbatch query periodically based on the processing time.\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.triggers","title":"triggers property
","text":"triggers\n
Returns a list of tuples with the value for each trigger
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.value","title":"value property
","text":"value: Dict[str, str]\n
Returns the trigger value as a dictionary
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.execute","title":"execute","text":"execute()\n
Returns the trigger value as a dictionary This method can be skipped, as the value can be accessed directly from the value
property
Source code in src/koheesio/spark/writers/stream.py
def execute(self):\n \"\"\"Returns the trigger value as a dictionary\n This method can be skipped, as the value can be accessed directly from the `value` property\n \"\"\"\n self.log.warning(\"Trigger.execute is deprecated. Use Trigger.value directly instead\")\n self.output.value = self.value\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_any","title":"from_any classmethod
","text":"from_any(value)\n
Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a dictionary
This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types
Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_any(cls, value):\n \"\"\"Dynamically creates a Trigger class based on either another Trigger class, a passed string value, or a\n dictionary\n\n This way Trigger.from_any can be used as part of a validator, without needing to worry about supported types\n \"\"\"\n if isinstance(value, Trigger):\n return value\n\n if isinstance(value, str):\n return cls.from_string(value)\n\n if isinstance(value, dict):\n return cls.from_dict(value)\n\n raise RuntimeError(f\"Unable to create Trigger based on the given value: {value}\")\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_dict","title":"from_dict classmethod
","text":"from_dict(_dict)\n
Creates a Trigger class based on a dictionary
Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_dict(cls, _dict):\n \"\"\"Creates a Trigger class based on a dictionary\"\"\"\n return cls(**_dict)\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string","title":"from_string classmethod
","text":"from_string(trigger: str)\n
Creates a Trigger class based on a string
Example Source code in src/koheesio/spark/writers/stream.py
@classmethod\ndef from_string(cls, trigger: str):\n \"\"\"Creates a Trigger class based on a string\n\n Example\n -------\n ### happy flow\n\n * processingTime='5 seconds'\n * processing_time=\"5 hours\"\n * processingTime=4 minutes\n * once=True\n * once=true\n * available_now=true\n * continuous='3 hours'\n * once=TrUe\n * once=TRUE\n\n ### unhappy flow\n valid values, but should fail the validation check of the class\n\n * availableNow=False\n * continuous=True\n * once=false\n \"\"\"\n import re\n\n trigger_from_string = re.compile(r\"(?P<triggerType>\\w+)=[\\'\\\"]?(?P<value>.+)[\\'\\\"]?\")\n _match = trigger_from_string.match(trigger)\n\n if _match is None:\n raise ValueError(\n f\"Cannot parse value for Trigger: '{trigger}'. \\n\"\n f\"Valid types are {', '.join(cls._all_triggers_with_alias())}\"\n )\n\n trigger_type, value = _match.groups()\n\n # strip the value of any quotes\n value = value.strip(\"'\").strip('\"')\n\n # making value a boolean when given\n value = convert_str_to_bool(value)\n\n return cls.from_dict({trigger_type: value})\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--happy-flow","title":"happy flow","text":" - processingTime='5 seconds'
- processing_time=\"5 hours\"
- processingTime=4 minutes
- once=True
- once=true
- available_now=true
- continuous='3 hours'
- once=TrUe
- once=TRUE
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.from_string--unhappy-flow","title":"unhappy flow","text":"valid values, but should fail the validation check of the class
- availableNow=False
- continuous=True
- once=false
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_available_now","title":"validate_available_now","text":"validate_available_now(available_now)\n
Validate the available_now trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"available_now\", mode=\"before\")\ndef validate_available_now(cls, available_now):\n \"\"\"Validate the available_now trigger value\"\"\"\n # making value a boolean when given\n available_now = convert_str_to_bool(available_now)\n\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n if available_now is not True:\n raise ValueError(f\"Value for availableNow must be True. Got:{available_now}\")\n return available_now\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_continuous","title":"validate_continuous","text":"validate_continuous(continuous)\n
Validate the continuous trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"continuous\", mode=\"before\")\ndef validate_continuous(cls, continuous):\n \"\"\"Validate the continuous trigger value\"\"\"\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger` except that the if statement is not\n # split in two parts\n if not isinstance(continuous, str):\n raise ValueError(f\"Value for continuous must be a string. Got: {continuous}\")\n\n if len(continuous.strip()) == 0:\n raise ValueError(f\"Value for continuous must be a non empty string. Got: {continuous}\")\n return continuous\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_once","title":"validate_once","text":"validate_once(once)\n
Validate the once trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"once\", mode=\"before\")\ndef validate_once(cls, once):\n \"\"\"Validate the once trigger value\"\"\"\n # making value a boolean when given\n once = convert_str_to_bool(once)\n\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n if once is not True:\n raise ValueError(f\"Value for once must be True. Got: {once}\")\n return once\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_processing_time","title":"validate_processing_time","text":"validate_processing_time(processing_time)\n
Validate the processing time trigger value
Source code in src/koheesio/spark/writers/stream.py
@field_validator(\"processing_time\", mode=\"before\")\ndef validate_processing_time(cls, processing_time):\n \"\"\"Validate the processing time trigger value\"\"\"\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`\n if not isinstance(processing_time, str):\n raise ValueError(f\"Value for processing_time must be a string. Got: {processing_time}\")\n\n if len(processing_time.strip()) == 0:\n raise ValueError(f\"Value for processingTime must be a non empty string. Got: {processing_time}\")\n return processing_time\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.Trigger.validate_triggers","title":"validate_triggers","text":"validate_triggers(triggers: Dict)\n
Validate the trigger value
Source code in src/koheesio/spark/writers/stream.py
@model_validator(mode=\"before\")\ndef validate_triggers(cls, triggers: Dict):\n \"\"\"Validate the trigger value\"\"\"\n params = [*triggers.values()]\n\n # adapted from `pyspark.sql.streaming.readwriter.DataStreamWriter.trigger`; modified to work with pydantic v2\n if not triggers:\n raise ValueError(\"No trigger provided\")\n if len(params) > 1:\n raise ValueError(\"Multiple triggers not allowed.\")\n\n return triggers\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch","title":"koheesio.spark.writers.stream.writer_to_foreachbatch","text":"writer_to_foreachbatch(writer: Writer)\n
Call writer.execute
on each batch
To be passed as batch_function for StreamWriter (sub)classes.
Example Source code in src/koheesio/spark/writers/stream.py
def writer_to_foreachbatch(writer: Writer):\n \"\"\"Call `writer.execute` on each batch\n\n To be passed as batch_function for StreamWriter (sub)classes.\n\n Example\n -------\n ### Writing to a Delta table and a Snowflake table\n ```python\n DeltaTableStreamWriter(\n table=\"my_table\",\n checkpointLocation=\"my_checkpointlocation\",\n batch_function=writer_to_foreachbatch(\n SnowflakeWriter(\n **sfOptions,\n table=\"snowflake_table\",\n insert_type=SnowflakeWriter.InsertType.APPEND,\n )\n ),\n )\n ```\n \"\"\"\n\n def inner(df, batch_id: int):\n \"\"\"Inner method\n\n As per the Spark documentation:\n In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a\n DataFrame and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the\n output (that is, the provided Dataset) to external systems. The output DataFrame is guaranteed to exactly\n same for the same batchId (assuming all operations are deterministic in the query).\n \"\"\"\n writer.log.debug(f\"Running batch function for batch {batch_id}\")\n writer.write(df)\n\n return inner\n
"},{"location":"api_reference/spark/writers/stream.html#koheesio.spark.writers.stream.writer_to_foreachbatch--writing-to-a-delta-table-and-a-snowflake-table","title":"Writing to a Delta table and a Snowflake table","text":"DeltaTableStreamWriter(\n table=\"my_table\",\n checkpointLocation=\"my_checkpointlocation\",\n batch_function=writer_to_foreachbatch(\n SnowflakeWriter(\n **sfOptions,\n table=\"snowflake_table\",\n insert_type=SnowflakeWriter.InsertType.APPEND,\n )\n ),\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html","title":"Delta","text":"This module is the entry point for the koheesio.spark.writers.delta package.
It imports and exposes the DeltaTableWriter and DeltaTableStreamWriter classes for external use.
Classes: DeltaTableWriter: Class to write data in batch mode to a Delta table. DeltaTableStreamWriter: Class to write data in streaming mode to a Delta table.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode","title":"koheesio.spark.writers.delta.BatchOutputMode","text":"For Batch:
- append: Append the contents of the DataFrame to the output table, default option in Koheesio.
- overwrite: overwrite the existing data.
- ignore: ignore the operation (i.e. no-op).
- error or errorifexists: throw an exception at runtime.
- merge: update matching data in the table and insert rows that do not exist.
- merge_all: update matching data in the table and insert rows that do not exist.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.APPEND","title":"APPEND class-attribute
instance-attribute
","text":"APPEND = 'append'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERROR","title":"ERROR class-attribute
instance-attribute
","text":"ERROR = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.ERRORIFEXISTS","title":"ERRORIFEXISTS class-attribute
instance-attribute
","text":"ERRORIFEXISTS = 'error'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.IGNORE","title":"IGNORE class-attribute
instance-attribute
","text":"IGNORE = 'ignore'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE","title":"MERGE class-attribute
instance-attribute
","text":"MERGE = 'merge'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGEALL","title":"MERGEALL class-attribute
instance-attribute
","text":"MERGEALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.MERGE_ALL","title":"MERGE_ALL class-attribute
instance-attribute
","text":"MERGE_ALL = 'merge_all'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.BatchOutputMode.OVERWRITE","title":"OVERWRITE class-attribute
instance-attribute
","text":"OVERWRITE = 'overwrite'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.DeltaTableStreamWriter","text":"Delta table stream writer
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options","title":"Options","text":"Options for DeltaTableStreamWriter
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute
instance-attribute
","text":"allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute
instance-attribute
","text":"maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute
instance-attribute
","text":"maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableStreamWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n if self.batch_function:\n self.streaming_query = self.writer.start()\n else:\n self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter","title":"koheesio.spark.writers.delta.DeltaTableWriter","text":"Delta table Writer for both batch and streaming dataframes.
Example Parameters:
Name Type Description Default table
Union[DeltaTableStep, str]
The table to write to
required output_mode
Optional[Union[str, BatchOutputMode, StreamingOutputMode]]
The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.
required params
Optional[dict]
Additional parameters to use for specific mode
required"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGEALL,\n output_mode_params={\n \"merge_cond\": \"target.id=source.id\",\n \"update_cond\": \"target.col1_val>=source.col1_val\",\n \"insert_cond\": \"source.col_bk IS NOT NULL\",\n \"target_alias\": \"target\", # <------ DEFAULT, can be changed by providing custom value\n \"source_alias\": \"source\", # <------ DEFAULT, can be changed by providing custom value\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge","title":"Example for MERGE
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n 'merge_builder': (\n DeltaTable\n .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n .alias(target_alias)\n .merge(source=df, condition=merge_cond)\n .whenMatchedUpdateAll(condition=update_cond)\n .whenNotMatchedInsertAll(condition=insert_cond)\n )\n }\n )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE
","text":"in case the table isn't created yet, first run will execute an APPEND operation
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n \"merge_builder\": [\n {\n \"clause\": \"whenMatchedUpdate\",\n \"set\": {\"value\": \"source.value\"},\n \"condition\": \"<update_condition>\",\n },\n {\n \"clause\": \"whenNotMatchedInsert\",\n \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n \"condition\": \"<insert_condition>\",\n },\n ],\n \"merge_cond\": \"<merge_condition>\",\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"dataframe writer options can be passed as keyword arguments
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.APPEND,\n partitionOverwriteMode=\"dynamic\",\n mergeSchema=\"false\",\n)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.format","title":"format class-attribute
instance-attribute
","text":"format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.output_mode","title":"output_mode class-attribute
instance-attribute
","text":"output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.partition_by","title":"partition_by class-attribute
instance-attribute
","text":"partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.writer","title":"writer property
","text":"writer: Union[DeltaMergeBuilder, DataFrameWriter]\n
Specify DeltaTableWriter
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n _writer = self.writer\n\n if self.table.create_if_not_exists and not self.table.exists:\n _writer = _writer.options(**self.table.default_create_properties)\n\n if isinstance(_writer, DeltaMergeBuilder):\n _writer.execute()\n else:\n if options := self.params:\n # should we add options only if mode is not merge?\n _writer = _writer.options(**options)\n _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod
","text":"get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n
Retrieve an OutputMode by validating choice
against a set of option OutputModes.
Currently supported output modes can be found in:
- BatchOutputMode
- StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n Currently supported output modes can be found in:\n\n - BatchOutputMode\n - StreamingOutputMode\n \"\"\"\n for enum_type in options:\n if choice.upper() in [om.value.upper() for om in enum_type]:\n return getattr(enum_type, choice.upper())\n raise AttributeError(\n f\"\"\"\n Invalid outputMode specified '{choice}'. Allowed values are:\n Batch Mode - {BatchOutputMode.__doc__}\n Streaming Mode - {StreamingOutputMode.__doc__}\n \"\"\"\n )\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.SCD2DeltaTableWriter","text":"A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.
Attributes:
Name Type Description table
InstanceOf[DeltaTableStep]
The table to merge to.
merge_key
str
The key used for merging data.
include_columns
List[str]
Columns to be merged. Will be selected from DataFrame. Default is all columns.
exclude_columns
List[str]
Columns to be excluded from DataFrame.
scd2_columns
List[str]
List of attributes for SCD2 type (track changes).
scd2_timestamp_col
Optional[Column]
Timestamp column for SCD2 type (track changes). Default to current_timestamp.
scd1_columns
List[str]
List of attributes for SCD1 type (just update).
meta_scd2_struct_col_name
str
SCD2 struct name.
meta_scd2_effective_time_col_name
str
Effective col name.
meta_scd2_is_current_col_name
str
Current col name.
meta_scd2_end_time_col_name
str
End time col name.
target_auto_generated_columns
List[str]
Auto generated columns from target Delta table. Will be used to exclude from merge logic.
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute
instance-attribute
","text":"exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute
instance-attribute
","text":"include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute
instance-attribute
","text":"merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute
instance-attribute
","text":"meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute
instance-attribute
","text":"meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute
instance-attribute
","text":"scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute
instance-attribute
","text":"scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute
instance-attribute
","text":"scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute
instance-attribute
","text":"target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/index.html#koheesio.spark.writers.delta.SCD2DeltaTableWriter.execute","title":"execute","text":"execute() -> None\n
Execute the SCD Type 2 operation.
This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.
Raises:
Type Description TypeError
If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.
Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n \"\"\"\n Execute the SCD Type 2 operation.\n\n This method executes the SCD Type 2 operation on the DataFrame.\n It validates the existing Delta table, prepares the merge conditions, stages the data,\n and then performs the merge operation.\n\n Raises\n ------\n TypeError\n If the scd2_timestamp_col is not of date or timestamp type.\n If the source DataFrame is missing any of the required merge columns.\n\n \"\"\"\n self.df: DataFrame\n self.spark: SparkSession\n delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n # Prepare required merge columns\n required_merge_columns = [self.merge_key]\n\n if self.scd2_columns:\n required_merge_columns += self.scd2_columns\n\n if self.scd1_columns:\n required_merge_columns += self.scd1_columns\n\n if not all(c in self.df.columns for c in required_merge_columns):\n missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n # Check that required columns are present in the source DataFrame\n if self.scd2_timestamp_col is not None:\n timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n raise TypeError(\n f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n f\"or timestamp type.Current type is {timestamp_col_type}\"\n )\n\n # Prepare columns to process\n include_columns = self.include_columns if self.include_columns else self.df.columns\n exclude_columns = self.exclude_columns\n columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n # Constructing column names for SCD2 attributes\n meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n # Constructing system merge action logic\n system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n if updates_attrs_scd2 := self._prepare_attr_clause(\n attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n if updates_attrs_scd1 := self._prepare_attr_clause(\n attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n system_merge_action += \" ELSE NULL END\"\n\n # Prepare the staged DataFrame\n staged = (\n self.df.withColumn(\n \"__meta_scd2_timestamp\",\n self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n )\n .transform(\n func=self._prepare_staging,\n delta_table=delta_table,\n merge_action_logic=F.expr(system_merge_action),\n meta_scd2_is_current_col=meta_scd2_is_current_col,\n columns_to_process=columns_to_process,\n src_alias=src_alias,\n dest_alias=dest_alias,\n cross_alias=cross_alias,\n )\n .transform(\n func=self._preserve_existing_target_values,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n target_auto_generated_columns=self.target_auto_generated_columns,\n src_alias=src_alias,\n cross_alias=cross_alias,\n dest_alias=dest_alias,\n logger=self.log,\n )\n .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n .withColumn(\n \"__meta_scd2_effective_time\",\n self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n )\n .transform(\n func=self._add_scd2_columns,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n )\n )\n\n self._prepare_merge_builder(\n delta_table=delta_table,\n dest_alias=dest_alias,\n staged=staged,\n merge_key=self.merge_key,\n columns_to_process=columns_to_process,\n meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n ).execute()\n
"},{"location":"api_reference/spark/writers/delta/batch.html","title":"Batch","text":"This module defines the DeltaTableWriter class, which is used to write both batch and streaming dataframes to Delta tables.
DeltaTableWriter supports two output modes: MERGEALL
and MERGE
.
- The
MERGEALL
mode merges all incoming data with existing data in the table based on certain conditions. - The
MERGE
mode allows for more custom merging behavior using the DeltaMergeBuilder class from the delta.tables
library.
The output_mode_params
dictionary is used to specify conditions for merging, updating, and inserting data. The target_alias
and source_alias
keys are used to specify the aliases for the target and source dataframes in the merge conditions.
Classes:
Name Description DeltaTableWriter
A class for writing data to Delta tables.
DeltaTableStreamWriter
A class for writing streaming data to Delta tables.
Example DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGEALL,\n output_mode_params={\n \"merge_cond\": \"target.id=source.id\",\n \"update_cond\": \"target.col1_val>=source.col1_val\",\n \"insert_cond\": \"source.col_bk IS NOT NULL\",\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter","title":"koheesio.spark.writers.delta.batch.DeltaTableWriter","text":"Delta table Writer for both batch and streaming dataframes.
Example Parameters:
Name Type Description Default table
Union[DeltaTableStep, str]
The table to write to
required output_mode
Optional[Union[str, BatchOutputMode, StreamingOutputMode]]
The output mode to use. Default is BatchOutputMode.APPEND. For streaming, use StreamingOutputMode.
required params
Optional[dict]
Additional parameters to use for specific mode
required"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-mergeall","title":"Example for MERGEALL
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGEALL,\n output_mode_params={\n \"merge_cond\": \"target.id=source.id\",\n \"update_cond\": \"target.col1_val>=source.col1_val\",\n \"insert_cond\": \"source.col_bk IS NOT NULL\",\n \"target_alias\": \"target\", # <------ DEFAULT, can be changed by providing custom value\n \"source_alias\": \"source\", # <------ DEFAULT, can be changed by providing custom value\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge","title":"Example for MERGE
","text":"DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n 'merge_builder': (\n DeltaTable\n .forName(sparkSession=spark, tableOrViewName=<target_table_name>)\n .alias(target_alias)\n .merge(source=df, condition=merge_cond)\n .whenMatchedUpdateAll(condition=update_cond)\n .whenNotMatchedInsertAll(condition=insert_cond)\n )\n }\n )\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-merge_1","title":"Example for MERGE
","text":"in case the table isn't created yet, first run will execute an APPEND operation
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.MERGE,\n output_mode_params={\n \"merge_builder\": [\n {\n \"clause\": \"whenMatchedUpdate\",\n \"set\": {\"value\": \"source.value\"},\n \"condition\": \"<update_condition>\",\n },\n {\n \"clause\": \"whenNotMatchedInsert\",\n \"values\": {\"id\": \"source.id\", \"value\": \"source.value\"},\n \"condition\": \"<insert_condition>\",\n },\n ],\n \"merge_cond\": \"<merge_condition>\",\n },\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter--example-for-append","title":"Example for APPEND","text":"dataframe writer options can be passed as keyword arguments
DeltaTableWriter(\n table=\"test_table\",\n output_mode=BatchOutputMode.APPEND,\n partitionOverwriteMode=\"dynamic\",\n mergeSchema=\"false\",\n)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.format","title":"format class-attribute
instance-attribute
","text":"format: str = 'delta'\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.output_mode","title":"output_mode class-attribute
instance-attribute
","text":"output_mode: Optional[Union[BatchOutputMode, StreamingOutputMode]] = Field(default=APPEND, alias='outputMode', description=f'{__doc__}\n{__doc__}')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[dict] = Field(default_factory=dict, alias='output_mode_params', description='Additional parameters to use for specific mode')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.partition_by","title":"partition_by class-attribute
instance-attribute
","text":"partition_by: Optional[List[str]] = Field(default=None, alias='partitionBy', description='The list of fields to partition the Delta table on')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: Union[DeltaTableStep, str] = Field(default=..., description='The table to write to')\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.writer","title":"writer property
","text":"writer: Union[DeltaMergeBuilder, DataFrameWriter]\n
Specify DeltaTableWriter
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/batch.py
def execute(self):\n _writer = self.writer\n\n if self.table.create_if_not_exists and not self.table.exists:\n _writer = _writer.options(**self.table.default_create_properties)\n\n if isinstance(_writer, DeltaMergeBuilder):\n _writer.execute()\n else:\n if options := self.params:\n # should we add options only if mode is not merge?\n _writer = _writer.options(**options)\n _writer.saveAsTable(self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/batch.html#koheesio.spark.writers.delta.batch.DeltaTableWriter.get_output_mode","title":"get_output_mode classmethod
","text":"get_output_mode(choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]\n
Retrieve an OutputMode by validating choice
against a set of option OutputModes.
Currently supported output modes can be found in:
- BatchOutputMode
- StreamingOutputMode
Source code in src/koheesio/spark/writers/delta/batch.py
@classmethod\ndef get_output_mode(cls, choice: str, options: Set[Type]) -> Union[BatchOutputMode, StreamingOutputMode]:\n \"\"\"Retrieve an OutputMode by validating `choice` against a set of option OutputModes.\n\n Currently supported output modes can be found in:\n\n - BatchOutputMode\n - StreamingOutputMode\n \"\"\"\n for enum_type in options:\n if choice.upper() in [om.value.upper() for om in enum_type]:\n return getattr(enum_type, choice.upper())\n raise AttributeError(\n f\"\"\"\n Invalid outputMode specified '{choice}'. Allowed values are:\n Batch Mode - {BatchOutputMode.__doc__}\n Streaming Mode - {StreamingOutputMode.__doc__}\n \"\"\"\n )\n
"},{"location":"api_reference/spark/writers/delta/scd.html","title":"Scd","text":"This module defines writers to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.
Slowly Changing Dimension (SCD) is a technique used in data warehousing to handle changes to dimension data over time. SCD Type 2 is one of the most common types of SCD, where historical changes are tracked by creating new records for each change.
Koheesio is a powerful data processing framework that provides advanced capabilities for working with Delta tables in Apache Spark. It offers a convenient and efficient way to handle SCD Type 2 operations on Delta tables.
To learn more about Slowly Changing Dimension and SCD Type 2, you can refer to the following resources: - Slowly Changing Dimension (SCD) - Wikipedia
By using Koheesio, you can benefit from its efficient merge logic, support for SCD Type 2 and SCD Type 1 attributes, and seamless integration with Delta tables in Spark.
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","title":"koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter","text":"A class used to write Slowly Changing Dimension (SCD) Type 2 data to a Delta table.
Attributes:
Name Type Description table
InstanceOf[DeltaTableStep]
The table to merge to.
merge_key
str
The key used for merging data.
include_columns
List[str]
Columns to be merged. Will be selected from DataFrame. Default is all columns.
exclude_columns
List[str]
Columns to be excluded from DataFrame.
scd2_columns
List[str]
List of attributes for SCD2 type (track changes).
scd2_timestamp_col
Optional[Column]
Timestamp column for SCD2 type (track changes). Default to current_timestamp.
scd1_columns
List[str]
List of attributes for SCD1 type (just update).
meta_scd2_struct_col_name
str
SCD2 struct name.
meta_scd2_effective_time_col_name
str
Effective col name.
meta_scd2_is_current_col_name
str
Current col name.
meta_scd2_end_time_col_name
str
End time col name.
target_auto_generated_columns
List[str]
Auto generated columns from target Delta table. Will be used to exclude from merge logic.
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.exclude_columns","title":"exclude_columns class-attribute
instance-attribute
","text":"exclude_columns: List[str] = Field(default_factory=list, description='Columns to be excluded from DataFrame')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.include_columns","title":"include_columns class-attribute
instance-attribute
","text":"include_columns: List[str] = Field(default_factory=list, description='Columns to be merged. Will be selected from DataFrame.Default is all columns')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.merge_key","title":"merge_key class-attribute
instance-attribute
","text":"merge_key: str = Field(..., description='Merge key')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_effective_time_col_name","title":"meta_scd2_effective_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_effective_time_col_name: str = Field(default='effective_time', description='Effective col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_end_time_col_name","title":"meta_scd2_end_time_col_name class-attribute
instance-attribute
","text":"meta_scd2_end_time_col_name: str = Field(default='end_time', description='End time col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_is_current_col_name","title":"meta_scd2_is_current_col_name class-attribute
instance-attribute
","text":"meta_scd2_is_current_col_name: str = Field(default='is_current', description='Current col name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.meta_scd2_struct_col_name","title":"meta_scd2_struct_col_name class-attribute
instance-attribute
","text":"meta_scd2_struct_col_name: str = Field(default='_scd2', description='SCD2 struct name')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd1_columns","title":"scd1_columns class-attribute
instance-attribute
","text":"scd1_columns: List[str] = Field(default_factory=list, description='List of attributes for scd1 type (just update)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_columns","title":"scd2_columns class-attribute
instance-attribute
","text":"scd2_columns: List[str] = Field(default_factory=list, description='List of attributes for scd2 type (track changes)')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.scd2_timestamp_col","title":"scd2_timestamp_col class-attribute
instance-attribute
","text":"scd2_timestamp_col: Optional[Column] = Field(default=None, description='Timestamp column for SCD2 type (track changes). Default to current_timestamp')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.table","title":"table class-attribute
instance-attribute
","text":"table: InstanceOf[DeltaTableStep] = Field(..., description='The table to merge to')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.target_auto_generated_columns","title":"target_auto_generated_columns class-attribute
instance-attribute
","text":"target_auto_generated_columns: List[str] = Field(default_factory=list, description='Auto generated columns from target Delta table. Will be used to exclude from merge logic')\n
"},{"location":"api_reference/spark/writers/delta/scd.html#koheesio.spark.writers.delta.scd.SCD2DeltaTableWriter.execute","title":"execute","text":"execute() -> None\n
Execute the SCD Type 2 operation.
This method executes the SCD Type 2 operation on the DataFrame. It validates the existing Delta table, prepares the merge conditions, stages the data, and then performs the merge operation.
Raises:
Type Description TypeError
If the scd2_timestamp_col is not of date or timestamp type. If the source DataFrame is missing any of the required merge columns.
Source code in src/koheesio/spark/writers/delta/scd.py
def execute(self) -> None:\n \"\"\"\n Execute the SCD Type 2 operation.\n\n This method executes the SCD Type 2 operation on the DataFrame.\n It validates the existing Delta table, prepares the merge conditions, stages the data,\n and then performs the merge operation.\n\n Raises\n ------\n TypeError\n If the scd2_timestamp_col is not of date or timestamp type.\n If the source DataFrame is missing any of the required merge columns.\n\n \"\"\"\n self.df: DataFrame\n self.spark: SparkSession\n delta_table = DeltaTable.forName(sparkSession=self.spark, tableOrViewName=self.table.table_name)\n src_alias, cross_alias, dest_alias = \"src\", \"cross\", \"tgt\"\n\n # Prepare required merge columns\n required_merge_columns = [self.merge_key]\n\n if self.scd2_columns:\n required_merge_columns += self.scd2_columns\n\n if self.scd1_columns:\n required_merge_columns += self.scd1_columns\n\n if not all(c in self.df.columns for c in required_merge_columns):\n missing_columns = [c for c in required_merge_columns if c not in self.df.columns]\n raise TypeError(f\"The source DataFrame is missing the columns: {missing_columns!r}\")\n\n # Check that required columns are present in the source DataFrame\n if self.scd2_timestamp_col is not None:\n timestamp_col_type = self.df.select(self.scd2_timestamp_col).schema.fields[0].dataType\n\n if not isinstance(timestamp_col_type, (DateType, TimestampType)):\n raise TypeError(\n f\"The scd2_timestamp_col '{self.scd2_timestamp_col}' must be of date \"\n f\"or timestamp type.Current type is {timestamp_col_type}\"\n )\n\n # Prepare columns to process\n include_columns = self.include_columns if self.include_columns else self.df.columns\n exclude_columns = self.exclude_columns\n columns_to_process = [c for c in include_columns if c not in exclude_columns]\n\n # Constructing column names for SCD2 attributes\n meta_scd2_is_current_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_is_current_col_name}\"\n meta_scd2_effective_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_effective_time_col_name}\"\n meta_scd2_end_time_col = f\"{self.meta_scd2_struct_col_name}.{self.meta_scd2_end_time_col_name}\"\n\n # Constructing system merge action logic\n system_merge_action = f\"CASE WHEN tgt.{self.merge_key} is NULL THEN 'I' \"\n\n if updates_attrs_scd2 := self._prepare_attr_clause(\n attrs=self.scd2_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd2} THEN 'UC' \"\n\n if updates_attrs_scd1 := self._prepare_attr_clause(\n attrs=self.scd1_columns, src_alias=src_alias, dest_alias=dest_alias\n ):\n system_merge_action += f\" WHEN {updates_attrs_scd1} THEN 'U' \"\n\n system_merge_action += \" ELSE NULL END\"\n\n # Prepare the staged DataFrame\n staged = (\n self.df.withColumn(\n \"__meta_scd2_timestamp\",\n self._scd2_timestamp(scd2_timestamp_col=self.scd2_timestamp_col, spark=self.spark),\n )\n .transform(\n func=self._prepare_staging,\n delta_table=delta_table,\n merge_action_logic=F.expr(system_merge_action),\n meta_scd2_is_current_col=meta_scd2_is_current_col,\n columns_to_process=columns_to_process,\n src_alias=src_alias,\n dest_alias=dest_alias,\n cross_alias=cross_alias,\n )\n .transform(\n func=self._preserve_existing_target_values,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n target_auto_generated_columns=self.target_auto_generated_columns,\n src_alias=src_alias,\n cross_alias=cross_alias,\n dest_alias=dest_alias,\n logger=self.log,\n )\n .withColumn(\"__meta_scd2_end_time\", self._scd2_end_time(meta_scd2_end_time_col=meta_scd2_end_time_col))\n .withColumn(\"__meta_scd2_is_current\", self._scd2_is_current())\n .withColumn(\n \"__meta_scd2_effective_time\",\n self._scd2_effective_time(meta_scd2_effective_time_col=meta_scd2_effective_time_col),\n )\n .transform(\n func=self._add_scd2_columns,\n meta_scd2_struct_col_name=self.meta_scd2_struct_col_name,\n meta_scd2_effective_time_col_name=self.meta_scd2_effective_time_col_name,\n meta_scd2_end_time_col_name=self.meta_scd2_end_time_col_name,\n meta_scd2_is_current_col_name=self.meta_scd2_is_current_col_name,\n )\n )\n\n self._prepare_merge_builder(\n delta_table=delta_table,\n dest_alias=dest_alias,\n staged=staged,\n merge_key=self.merge_key,\n columns_to_process=columns_to_process,\n meta_scd2_effective_time_col=meta_scd2_effective_time_col,\n ).execute()\n
"},{"location":"api_reference/spark/writers/delta/stream.html","title":"Stream","text":"This module defines the DeltaTableStreamWriter class, which is used to write streaming dataframes to Delta tables.
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","title":"koheesio.spark.writers.delta.stream.DeltaTableStreamWriter","text":"Delta table stream writer
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options","title":"Options","text":"Options for DeltaTableStreamWriter
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.allow_population_by_field_name","title":"allow_population_by_field_name class-attribute
instance-attribute
","text":"allow_population_by_field_name: bool = Field(default=True, description=' To do convert to Field and pass as .options(**config)')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxBytesPerTrigger","title":"maxBytesPerTrigger class-attribute
instance-attribute
","text":"maxBytesPerTrigger: Optional[str] = Field(default=None, description='How much data to be processed per trigger. The default is 1GB')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.Options.maxFilesPerTrigger","title":"maxFilesPerTrigger class-attribute
instance-attribute
","text":"maxFilesPerTrigger: int = Field(default == 1000, description='The maximum number of new files to be considered in every trigger (default: 1000).')\n
"},{"location":"api_reference/spark/writers/delta/stream.html#koheesio.spark.writers.delta.stream.DeltaTableStreamWriter.execute","title":"execute","text":"execute()\n
Source code in src/koheesio/spark/writers/delta/stream.py
def execute(self):\n if self.batch_function:\n self.streaming_query = self.writer.start()\n else:\n self.streaming_query = self.writer.toTable(tableName=self.table.table_name)\n
"},{"location":"api_reference/spark/writers/delta/utils.html","title":"Utils","text":"This module provides utility functions while working with delta framework.
"},{"location":"api_reference/spark/writers/delta/utils.html#koheesio.spark.writers.delta.utils.log_clauses","title":"koheesio.spark.writers.delta.utils.log_clauses","text":"log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]\n
Prepare log message for clauses of DeltaMergePlan statement.
Parameters:
Name Type Description Default clauses
JavaObject
The clauses of the DeltaMergePlan statement.
required source_alias
str
The source alias.
required target_alias
str
The target alias.
required Returns:
Type Description Optional[str]
The log message if there are clauses, otherwise None.
Notes This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses, processes the conditions, and constructs the log message based on the clause type and columns.
If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is None, it sets the condition_clause to \"No conditions required\".
The log message includes the clauses type, the clause type, the columns, and the condition.
Source code in src/koheesio/spark/writers/delta/utils.py
def log_clauses(clauses: JavaObject, source_alias: str, target_alias: str) -> Optional[str]:\n \"\"\"\n Prepare log message for clauses of DeltaMergePlan statement.\n\n Parameters\n ----------\n clauses : JavaObject\n The clauses of the DeltaMergePlan statement.\n source_alias : str\n The source alias.\n target_alias : str\n The target alias.\n\n Returns\n -------\n Optional[str]\n The log message if there are clauses, otherwise None.\n\n Notes\n -----\n This function prepares a log message for the clauses of a DeltaMergePlan statement. It iterates over the clauses,\n processes the conditions, and constructs the log message based on the clause type and columns.\n\n If the condition is a value, it replaces the source and target aliases in the condition string. If the condition is\n None, it sets the condition_clause to \"No conditions required\".\n\n The log message includes the clauses type, the clause type, the columns, and the condition.\n \"\"\"\n log_message = None\n\n if not clauses.isEmpty():\n clauses_type = clauses.last().nodeName().replace(\"DeltaMergeInto\", \"\")\n _processed_clauses = {}\n\n for i in range(0, clauses.length()):\n clause = clauses.apply(i)\n condition = clause.condition()\n\n if \"value\" in dir(condition):\n condition_clause = (\n condition.value()\n .toString()\n .replace(f\"'{source_alias}\", source_alias)\n .replace(f\"'{target_alias}\", target_alias)\n )\n elif condition.toString() == \"None\":\n condition_clause = \"No conditions required\"\n\n clause_type: str = clause.clauseType().capitalize()\n columns = \"ALL\" if clause_type == \"Delete\" else clause.actions().toList().apply(0).toString()\n\n if clause_type.lower() not in _processed_clauses:\n _processed_clauses[clause_type.lower()] = []\n\n log_message = (\n f\"{clauses_type} will perform action:{clause_type} columns ({columns}) if `{condition_clause}`\"\n )\n\n return log_message\n
"},{"location":"api_reference/sso/index.html","title":"Sso","text":""},{"location":"api_reference/sso/okta.html","title":"Okta","text":"This module contains Okta integration steps.
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter","title":"koheesio.sso.okta.LoggerOktaTokenFilter","text":"LoggerOktaTokenFilter(okta_object: OktaAccessToken, name: str = 'OktaToken')\n
Filter which hides token value from log.
Source code in src/koheesio/sso/okta.py
def __init__(self, okta_object: OktaAccessToken, name: str = \"OktaToken\"):\n self.__okta_object = okta_object\n super().__init__(name=name)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.LoggerOktaTokenFilter.filter","title":"filter","text":"filter(record)\n
Source code in src/koheesio/sso/okta.py
def filter(self, record):\n # noinspection PyUnresolvedReferences\n if token := self.__okta_object.output.token:\n token_value = token.get_secret_value()\n record.msg = record.msg.replace(token_value, \"<SECRET_TOKEN>\")\n\n return True\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta","title":"koheesio.sso.okta.Okta","text":"Base Okta class
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_id","title":"client_id class-attribute
instance-attribute
","text":"client_id: str = Field(default=..., alias='okta_id', description='Okta account ID')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.client_secret","title":"client_secret class-attribute
instance-attribute
","text":"client_secret: SecretStr = Field(default=..., alias='okta_secret', description='Okta account secret', repr=False)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.Okta.data","title":"data class-attribute
instance-attribute
","text":"data: Optional[Union[Dict[str, str], str]] = Field(default={'grant_type': 'client_credentials'}, description='Data to be sent along with the token request')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken","title":"koheesio.sso.okta.OktaAccessToken","text":"OktaAccessToken(**kwargs)\n
Get Okta authorization token
Example:
token = (\n OktaAccessToken(\n url=\"https://org.okta.com\",\n client_id=\"client\",\n client_secret=SecretStr(\"secret\"),\n params={\n \"p1\": \"foo\",\n \"p2\": \"bar\",\n },\n )\n .execute()\n .token\n)\n
Source code in src/koheesio/sso/okta.py
def __init__(self, **kwargs):\n _logger = LoggingFactory.get_logger(name=self.__class__.__name__, inherit_from_koheesio=True)\n logger_filter = LoggerOktaTokenFilter(okta_object=self)\n _logger.addFilter(logger_filter)\n super().__init__(**kwargs)\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output","title":"Output","text":"Output class for OktaAccessToken.
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.Output.token","title":"token class-attribute
instance-attribute
","text":"token: Optional[SecretStr] = Field(default=None, description='Okta authentication token')\n
"},{"location":"api_reference/sso/okta.html#koheesio.sso.okta.OktaAccessToken.execute","title":"execute","text":"execute()\n
Execute an HTTP Post call to Okta service and retrieve the access token.
Source code in src/koheesio/sso/okta.py
def execute(self):\n \"\"\"\n Execute an HTTP Post call to Okta service and retrieve the access token.\n \"\"\"\n HttpPostStep.execute(self)\n\n # noinspection PyUnresolvedReferences\n status_code = self.output.status_code\n # noinspection PyUnresolvedReferences\n raw_payload = self.output.raw_payload\n\n if status_code != 200:\n raise HTTPError(f\"Request failed with '{status_code}' code. Payload: {raw_payload}\")\n\n # noinspection PyUnresolvedReferences\n json_payload = self.output.json_payload\n\n if token := json_payload.get(\"access_token\"):\n self.output.token = SecretStr(token)\n else:\n raise ValueError(f\"No 'access_token' found in the Okta response: {json_payload}\")\n
"},{"location":"api_reference/steps/index.html","title":"Steps","text":"Steps Module
This module contains the definition of the Step
class, which serves as the base class for custom units of logic that can be executed. It also includes the StepOutput
class, which defines the output data model for a Step
.
The Step
class is designed to be subclassed for creating new steps in a data pipeline. Each subclass should implement the execute
method, specifying the expected inputs and outputs.
This module also exports the SparkStep
class for steps that interact with Spark
Classes: - Step: Base class for a custom unit of logic that can be executed.
- StepOutput: Defines the output data model for a
Step
.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step","title":"koheesio.steps.Step","text":"Base class for a step
A custom unit of logic that can be executed.
The Step class is designed to be subclassed. To create a new step, one would subclass Step and implement the def execute(self)
method, specifying the expected inputs and outputs.
Note: since the Step class is meta classed, the execute method is wrapped with the do_execute
function making it always return the Step's output. Hence, an explicit return is not needed when implementing execute.
Methods and Attributes The Step class has several attributes and methods.
Background A Step is an atomic operation and serves as the building block of data pipelines built with the framework. Tasks typically consist of a series of Steps.
A step can be seen as an operation on a set of inputs, that returns a set of outputs. This however does not imply that steps are stateless (e.g. data writes)!
The diagram serves to illustrate the concept of a Step:
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Steps are built on top of Pydantic, which is a data validation and settings management using python type annotations. This allows for the automatic validation of the inputs and outputs of a Step.
- Step inherits from BaseModel, which is a Pydantic class used to define data models. This allows Step to automatically validate data against the defined fields and their types.
- Step is metaclassed by StepMetaClass, which is a custom metaclass that wraps the
execute
method of the Step class with the _execute_wrapper
function. This ensures that the execute
method always returns the output of the Step along with providing logging and validation of the output. - Step has an
Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. - The
Output
class can be extended to add additional fields to the output of the Step.
Examples:
class MyStep(Step):\n a: str # input\n\n class Output(StepOutput): # output\n b: str\n\n def execute(self) -> MyStep.Output:\n self.output.b = f\"{self.a}-some-suffix\"\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--input","title":"INPUT","text":"The following fields are available by default on the Step class: - name
: Name of the Step. If not set, the name of the class will be used. - description
: Description of the Step. If not set, the docstring of the class will be used. If the docstring contains multiple lines, only the first line will be used.
When subclassing a Step, any additional pydantic field will be treated as input
to the Step. See also the explanation on the .execute()
method below.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--output","title":"OUTPUT","text":"Every Step has an Output
class, which is a subclass of StepOutput
. This class is used to validate the output of the Step. The Output
class is defined as an inner class of the Step class. The Output
class can be accessed through the Step.Output
attribute. The Output
class can be extended to add additional fields to the output of the Step. See also the explanation on the .execute()
.
Output
: A nested class representing the output of the Step used to validate the output of the Step and based on the StepOutput class. output
: Allows you to interact with the Output of the Step lazily (see above and StepOutput)
When subclassing a Step, any additional pydantic field added to the nested Output
class will be treated as output
of the Step. See also the description of StepOutput
for more information.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--methods","title":"Methods:","text":" execute
: Abstract method to implement for new steps. - The Inputs of the step can be accessed, using
self.input_name
. - The output of the step can be accessed, using
self.output.output_name
.
run
: Alias to .execute() method. You can use this to run the step, but execute is preferred. to_yaml
: YAML dump the step get_description
: Get the description of the Step
When subclassing a Step, execute
is the only method that needs to be implemented. Any additional method added to the class will be treated as a method of the Step.
Note: since the Step class is meta-classed, the execute method is automatically wrapped with the do_execute
function making it always return a StepOutput. See also the explanation on the do_execute
function.
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--class-methods","title":"class methods:","text":" from_step
: Returns a new Step instance based on the data of another Step instance. for example: MyStep.from_step(other_step, a=\"foo\")
get_description
: Get the description of the Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step--dunder-methods","title":"dunder methods:","text":" __getattr__
: Allows input to be accessed through self.input_name
__repr__
and __str__
: String representation of a step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.output","title":"output property
writable
","text":"output: Output\n
Interact with the output of the Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.Output","title":"Output","text":"Output class for Step
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.execute","title":"execute abstractmethod
","text":"execute()\n
Abstract method to implement for new steps.
The Inputs of the step can be accessed, using self.input_name
Note: since the Step class is meta-classed, the execute method is wrapped with the do_execute
function making it always return the Steps output
Source code in src/koheesio/steps/__init__.py
@abstractmethod\ndef execute(self):\n \"\"\"Abstract method to implement for new steps.\n\n The Inputs of the step can be accessed, using `self.input_name`\n\n Note: since the Step class is meta-classed, the execute method is wrapped with the `do_execute` function making\n it always return the Steps output\n \"\"\"\n raise NotImplementedError\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.from_step","title":"from_step classmethod
","text":"from_step(step: Step, **kwargs)\n
Returns a new Step instance based on the data of another Step or BaseModel instance
Source code in src/koheesio/steps/__init__.py
@classmethod\ndef from_step(cls, step: Step, **kwargs):\n \"\"\"Returns a new Step instance based on the data of another Step or BaseModel instance\"\"\"\n return cls.from_basemodel(step, **kwargs)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_json","title":"repr_json","text":"repr_json(simple=False) -> str\n
dump the step to json, meant for representation
Note: use to_json if you want to dump the step to json for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_json())\n{\"input\": {\"a\": \"foo\"}}\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid json
Source code in src/koheesio/steps/__init__.py
def repr_json(self, simple=False) -> str:\n \"\"\"dump the step to json, meant for representation\n\n Note: use to_json if you want to dump the step to json for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_json())\n {\"input\": {\"a\": \"foo\"}}\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid json\n \"\"\"\n model_dump_options = dict(warnings=\"none\", exclude_unset=True)\n\n _result = {}\n\n # extract input\n _input = self.model_dump(**model_dump_options)\n\n # remove name and description from input and add to result if simple is not set\n name = _input.pop(\"name\", None)\n description = _input.pop(\"description\", None)\n if not simple:\n if name:\n _result[\"name\"] = name\n if description:\n _result[\"description\"] = description\n else:\n model_dump_options[\"exclude\"] = {\"name\", \"description\"}\n\n # extract output\n _output = self.output.model_dump(**model_dump_options)\n\n # add output to result\n if _output:\n _result[\"output\"] = _output\n\n # add input to result\n _result[\"input\"] = _input\n\n class MyEncoder(json.JSONEncoder):\n \"\"\"Custom JSON Encoder to handle non-serializable types\"\"\"\n\n def default(self, o: Any) -> Any:\n try:\n return super().default(o)\n except TypeError:\n return o.__class__.__name__\n\n # Use MyEncoder when converting the dictionary to a JSON string\n json_str = json.dumps(_result, cls=MyEncoder)\n\n return json_str\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.repr_yaml","title":"repr_yaml","text":"repr_yaml(simple=False) -> str\n
dump the step to yaml, meant for representation
Note: use to_yaml if you want to dump the step to yaml for serialization This method is meant for representation purposes only!
Examples:
>>> step = MyStep(a=\"foo\")\n>>> print(step.repr_yaml())\ninput:\n a: foo\n
Parameters:
Name Type Description Default simple
When toggled to True, a briefer output will be produced. This is friendlier for logging purposes
False
Returns:
Type Description str
A string, which is valid yaml
Source code in src/koheesio/steps/__init__.py
def repr_yaml(self, simple=False) -> str:\n \"\"\"dump the step to yaml, meant for representation\n\n Note: use to_yaml if you want to dump the step to yaml for serialization\n This method is meant for representation purposes only!\n\n Examples\n --------\n ```python\n >>> step = MyStep(a=\"foo\")\n >>> print(step.repr_yaml())\n input:\n a: foo\n ```\n\n Parameters\n ----------\n simple: bool\n When toggled to True, a briefer output will be produced. This is friendlier for logging purposes\n\n Returns\n -------\n str\n A string, which is valid yaml\n \"\"\"\n json_str = self.repr_json(simple=simple)\n\n # Parse the JSON string back into a dictionary\n _result = json.loads(json_str)\n\n return yaml.dump(_result)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.Step.run","title":"run","text":"run()\n
Alias to .execute()
Source code in src/koheesio/steps/__init__.py
def run(self):\n \"\"\"Alias to .execute()\"\"\"\n return self.execute()\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepMetaClass","title":"koheesio.steps.StepMetaClass","text":"StepMetaClass has to be set up as a Metaclass extending ModelMetaclass to allow Pydantic to be unaffected while allowing for the execute method to be auto-decorated with do_execute
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput","title":"koheesio.steps.StepOutput","text":"Class for the StepOutput model
Usage Setting up the StepOutputs class is done like this:
class YourOwnOutput(StepOutput):\n a: str\n b: int\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(validate_default=False, defer_build=True)\n
"},{"location":"api_reference/steps/index.html#koheesio.steps.StepOutput.validate_output","title":"validate_output","text":"validate_output() -> StepOutput\n
Validate the output of the Step
Essentially, this method is a wrapper around the validate method of the BaseModel class
Source code in src/koheesio/steps/__init__.py
def validate_output(self) -> StepOutput:\n \"\"\"Validate the output of the Step\n\n Essentially, this method is a wrapper around the validate method of the BaseModel class\n \"\"\"\n validated_model = self.validate()\n return StepOutput.from_basemodel(validated_model)\n
"},{"location":"api_reference/steps/dummy.html","title":"Dummy","text":"Dummy step for testing purposes.
This module contains a dummy step for testing purposes. It is used to test the Koheesio framework or to provide a simple example of how to create a new step.
Example s = DummyStep(a=\"a\", b=2)\ns.execute()\n
In this case, s.output
will be equivalent to the following dictionary: {\"a\": \"a\", \"b\": 2, \"c\": \"aa\"}\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput","title":"koheesio.steps.dummy.DummyOutput","text":"Dummy output for testing purposes.
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.a","title":"a instance-attribute
","text":"a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyOutput.b","title":"b instance-attribute
","text":"b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep","title":"koheesio.steps.dummy.DummyStep","text":"Dummy step for testing purposes.
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.a","title":"a instance-attribute
","text":"a: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.b","title":"b instance-attribute
","text":"b: int\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output","title":"Output","text":"Dummy output for testing purposes.
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.Output.c","title":"c instance-attribute
","text":"c: str\n
"},{"location":"api_reference/steps/dummy.html#koheesio.steps.dummy.DummyStep.execute","title":"execute","text":"execute()\n
Dummy execute for testing purposes.
Source code in src/koheesio/steps/dummy.py
def execute(self):\n \"\"\"Dummy execute for testing purposes.\"\"\"\n self.output.a = self.a\n self.output.b = self.b\n self.output.c = self.a * self.b\n
"},{"location":"api_reference/steps/http.html","title":"Http","text":"This module contains a few simple HTTP Steps that can be used to perform API Calls to HTTP endpoints
Example from koheesio.steps.http import HttpGetStep\n\nresponse = HttpGetStep(url=\"https://google.com\").execute().json_payload\n
In the above example, the response
variable will contain the JSON response from the HTTP request.
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep","title":"koheesio.steps.http.HttpDeleteStep","text":"send DELETE requests
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpDeleteStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = DELETE\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep","title":"koheesio.steps.http.HttpGetStep","text":"send GET requests
Example response = HttpGetStep(url=\"https://google.com\").execute().json_payload\n
In the above example, the response
variable will contain the JSON response from the HTTP request."},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpGetStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = GET\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod","title":"koheesio.steps.http.HttpMethod","text":"Enumeration of allowed http methods
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.DELETE","title":"DELETE class-attribute
instance-attribute
","text":"DELETE = 'delete'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.GET","title":"GET class-attribute
instance-attribute
","text":"GET = 'get'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.POST","title":"POST class-attribute
instance-attribute
","text":"POST = 'post'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.PUT","title":"PUT class-attribute
instance-attribute
","text":"PUT = 'put'\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpMethod.from_string","title":"from_string classmethod
","text":"from_string(value: str)\n
Allows for getting the right Method Enum by simply passing a string value This method is not case-sensitive
Source code in src/koheesio/steps/http.py
@classmethod\ndef from_string(cls, value: str):\n \"\"\"Allows for getting the right Method Enum by simply passing a string value\n This method is not case-sensitive\n \"\"\"\n return getattr(cls, value.upper())\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep","title":"koheesio.steps.http.HttpPostStep","text":"send POST requests
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPostStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = POST\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep","title":"koheesio.steps.http.HttpPutStep","text":"send PUT requests
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpPutStep.method","title":"method class-attribute
instance-attribute
","text":"method: HttpMethod = PUT\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep","title":"koheesio.steps.http.HttpStep","text":"Can be used to perform API Calls to HTTP endpoints
Understanding Retries This class includes a built-in retry mechanism for handling temporary issues, such as network errors or server downtime, that might cause the HTTP request to fail. The retry mechanism is controlled by three parameters: max_retries
, initial_delay
, and backoff
.
-
max_retries
determines the number of retries after the initial request. For example, if max_retries
is set to 4, the request will be attempted a total of 5 times (1 initial attempt + 4 retries). If max_retries
is set to 0, no retries will be attempted, and the request will be tried only once.
-
initial_delay
sets the waiting period before the first retry. If initial_delay
is set to 3, the delay before the first retry will be 3 seconds. Changing the initial_delay
value directly affects the amount of delay before each retry.
-
backoff
controls the rate at which the delay increases for each subsequent retry. If backoff
is set to 2 (the default), the delay will double with each retry. If backoff
is set to 1, the delay between retries will remain constant. Changing the backoff
value affects how quickly the delay increases.
Given the default values of max_retries=3
, initial_delay=2
, and backoff=2
, the delays between retries would be 2 seconds, 4 seconds, and 8 seconds, respectively. This results in a total delay of 14 seconds before all retries are exhausted.
For example, if you set initial_delay=3
and backoff=2
, the delays before the retries would be 3 seconds
, 6 seconds
, and 12 seconds
. If you set initial_delay=2
and backoff=3
, the delays before the retries would be 2 seconds
, 6 seconds
, and 18 seconds
. If you set initial_delay=2
and backoff=1
, the delays before the retries would be 2 seconds
, 2 seconds
, and 2 seconds
.
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.data","title":"data class-attribute
instance-attribute
","text":"data: Optional[Union[Dict[str, str], str]] = Field(default_factory=dict, description='[Optional] Data to be sent along with the request', alias='body')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.headers","title":"headers class-attribute
instance-attribute
","text":"headers: Optional[Dict[str, Union[str, SecretStr]]] = Field(default_factory=dict, description='Request headers', alias='header')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.method","title":"method class-attribute
instance-attribute
","text":"method: Union[str, HttpMethod] = Field(default=GET, description=\"What type of Http call to perform. One of 'get', 'post', 'put', 'delete'. Defaults to 'get'.\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.params","title":"params class-attribute
instance-attribute
","text":"params: Optional[Dict[str, Any]] = Field(default_factory=dict, description='[Optional] Set of extra parameters that should be passed to HTTP request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.session","title":"session class-attribute
instance-attribute
","text":"session: Session = Field(default_factory=Session, description='Requests session object to be used for making HTTP requests', exclude=True, repr=False)\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.timeout","title":"timeout class-attribute
instance-attribute
","text":"timeout: Optional[int] = Field(default=3, description='[Optional] Request timeout')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.url","title":"url class-attribute
instance-attribute
","text":"url: str = Field(default=..., description='API endpoint URL', alias='uri')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output","title":"Output","text":"Output class for HttpStep
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.json_payload","title":"json_payload property
","text":"json_payload\n
Alias for response_json
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.raw_payload","title":"raw_payload class-attribute
instance-attribute
","text":"raw_payload: Optional[str] = Field(default=None, alias='response_text', description='The raw response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_json","title":"response_json class-attribute
instance-attribute
","text":"response_json: Optional[Union[Dict, List]] = Field(default=None, alias='json_payload', description='The JSON response for the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.response_raw","title":"response_raw class-attribute
instance-attribute
","text":"response_raw: Optional[Response] = Field(default=None, alias='response', description='The raw requests.Response object returned by the appropriate requests.request() call')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.Output.status_code","title":"status_code class-attribute
instance-attribute
","text":"status_code: Optional[int] = Field(default=None, description='The status return code of the request')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.decode_sensitive_headers","title":"decode_sensitive_headers","text":"decode_sensitive_headers(headers)\n
Authorization headers are being converted into SecretStr under the hood to avoid dumping any sensitive content into logs by the encode_sensitive_headers
method.
However, when calling the get_headers
method, the SecretStr should be converted back to string, otherwise sensitive info would have looked like '**********'.
This method decodes values of the headers
dictionary that are of type SecretStr into plain text.
Source code in src/koheesio/steps/http.py
@field_serializer(\"headers\", when_used=\"json\")\ndef decode_sensitive_headers(self, headers):\n \"\"\"\n Authorization headers are being converted into SecretStr under the hood to avoid dumping any\n sensitive content into logs by the `encode_sensitive_headers` method.\n\n However, when calling the `get_headers` method, the SecretStr should be converted back to\n string, otherwise sensitive info would have looked like '**********'.\n\n This method decodes values of the `headers` dictionary that are of type SecretStr into plain text.\n \"\"\"\n for k, v in headers.items():\n headers[k] = v.get_secret_value() if isinstance(v, SecretStr) else v\n return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.delete","title":"delete","text":"delete() -> Response\n
Execute an HTTP DELETE call
Source code in src/koheesio/steps/http.py
def delete(self) -> requests.Response:\n \"\"\"Execute an HTTP DELETE call\"\"\"\n self.method = HttpMethod.DELETE\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.encode_sensitive_headers","title":"encode_sensitive_headers","text":"encode_sensitive_headers(headers)\n
Encode potentially sensitive data into pydantic.SecretStr class to prevent them being displayed as plain text in logs.
Source code in src/koheesio/steps/http.py
@field_validator(\"headers\", mode=\"before\")\ndef encode_sensitive_headers(cls, headers):\n \"\"\"\n Encode potentially sensitive data into pydantic.SecretStr class to prevent them\n being displayed as plain text in logs.\n \"\"\"\n if auth := headers.get(\"Authorization\"):\n headers[\"Authorization\"] = auth if isinstance(auth, SecretStr) else SecretStr(auth)\n return headers\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.execute","title":"execute","text":"execute() -> Output\n
Executes the HTTP request.
This method simply calls self.request()
, which includes the retry logic. If self.request()
raises an exception, it will be propagated to the caller of this method.
Raises:
Type Description (RequestException, HTTPError)
The last exception that was caught if self.request()
fails after self.max_retries
attempts.
Source code in src/koheesio/steps/http.py
def execute(self) -> Output:\n \"\"\"\n Executes the HTTP request.\n\n This method simply calls `self.request()`, which includes the retry logic. If `self.request()` raises an\n exception, it will be propagated to the caller of this method.\n\n Raises\n ------\n requests.RequestException, requests.HTTPError\n The last exception that was caught if `self.request()` fails after `self.max_retries` attempts.\n \"\"\"\n self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get","title":"get","text":"get() -> Response\n
Execute an HTTP GET call
Source code in src/koheesio/steps/http.py
def get(self) -> requests.Response:\n \"\"\"Execute an HTTP GET call\"\"\"\n self.method = HttpMethod.GET\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_headers","title":"get_headers","text":"get_headers()\n
Dump headers into JSON without SecretStr masking.
Source code in src/koheesio/steps/http.py
def get_headers(self):\n \"\"\"\n Dump headers into JSON without SecretStr masking.\n \"\"\"\n return json.loads(self.model_dump_json()).get(\"headers\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_options","title":"get_options","text":"get_options()\n
options to be passed to requests.request()
Source code in src/koheesio/steps/http.py
def get_options(self):\n \"\"\"options to be passed to requests.request()\"\"\"\n return {\n \"url\": self.url,\n \"headers\": self.get_headers(),\n \"data\": self.data,\n \"timeout\": self.timeout,\n **self.params, # type: ignore\n }\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.get_proper_http_method_from_str_value","title":"get_proper_http_method_from_str_value","text":"get_proper_http_method_from_str_value(method_value)\n
Converts string value to HttpMethod enum value
Source code in src/koheesio/steps/http.py
@field_validator(\"method\")\ndef get_proper_http_method_from_str_value(cls, method_value):\n \"\"\"Converts string value to HttpMethod enum value\"\"\"\n if isinstance(method_value, str):\n try:\n method_value = HttpMethod.from_string(method_value)\n except AttributeError as e:\n raise AttributeError(\n \"Only values from HttpMethod class are allowed! \"\n f\"Provided value: '{method_value}', allowed values: {', '.join(HttpMethod.__members__.keys())}\"\n ) from e\n\n return method_value\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.post","title":"post","text":"post() -> Response\n
Execute an HTTP POST call
Source code in src/koheesio/steps/http.py
def post(self) -> requests.Response:\n \"\"\"Execute an HTTP POST call\"\"\"\n self.method = HttpMethod.POST\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.put","title":"put","text":"put() -> Response\n
Execute an HTTP PUT call
Source code in src/koheesio/steps/http.py
def put(self) -> requests.Response:\n \"\"\"Execute an HTTP PUT call\"\"\"\n self.method = HttpMethod.PUT\n return self.request()\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.request","title":"request","text":"request(method: Optional[HttpMethod] = None) -> Response\n
Executes the HTTP request with retry logic.
Actual http_method execution is abstracted into this method. This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.
This method will try to execute requests.request
up to self.max_retries
times. If self.request()
raises an exception, it logs a warning message and the error message, then waits for self.initial_delay * (self.backoff ** i)
seconds before retrying. The delay increases exponentially after each failed attempt due to the self.backoff ** i
term.
If self.request()
still fails after self.max_retries
attempts, it logs an error message and re-raises the last exception that was caught.
This is a good way to handle temporary issues that might cause self.request()
to fail, such as network errors or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with requests if it's struggling to respond.
Parameters:
Name Type Description Default method
HttpMethod
Optional parameter that allows calls to different HTTP methods and bypassing class level method
parameter.
None
Raises:
Type Description (RequestException, HTTPError)
The last exception that was caught if requests.request()
fails after self.max_retries
attempts.
Source code in src/koheesio/steps/http.py
def request(self, method: Optional[HttpMethod] = None) -> requests.Response:\n \"\"\"\n Executes the HTTP request with retry logic.\n\n Actual http_method execution is abstracted into this method.\n This is to avoid unnecessary code duplication. Allows to centrally log, set outputs, and validated.\n\n This method will try to execute `requests.request` up to `self.max_retries` times. If `self.request()` raises\n an exception, it logs a warning message and the error message, then waits for\n `self.initial_delay * (self.backoff ** i)` seconds before retrying. The delay increases exponentially\n after each failed attempt due to the `self.backoff ** i` term.\n\n If `self.request()` still fails after `self.max_retries` attempts, it logs an error message and re-raises the\n last exception that was caught.\n\n This is a good way to handle temporary issues that might cause `self.request()` to fail, such as network errors\n or server downtime. The exponential backoff ensures that you're not constantly bombarding a server with\n requests if it's struggling to respond.\n\n Parameters\n ----------\n method : HttpMethod\n Optional parameter that allows calls to different HTTP methods and bypassing class level `method`\n parameter.\n\n Raises\n ------\n requests.RequestException, requests.HTTPError\n The last exception that was caught if `requests.request()` fails after `self.max_retries` attempts.\n \"\"\"\n _method = (method or self.method).value.upper()\n options = self.get_options()\n\n self.log.debug(f\"Making {_method} request to {options['url']} with headers {options['headers']}\")\n\n response = self.session.request(method=_method, **options)\n response.raise_for_status()\n\n self.log.debug(f\"Received response with status code {response.status_code} and body {response.text}\")\n self.set_outputs(response)\n\n return response\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.HttpStep.set_outputs","title":"set_outputs","text":"set_outputs(response)\n
Types of response output
Source code in src/koheesio/steps/http.py
def set_outputs(self, response):\n \"\"\"\n Types of response output\n \"\"\"\n self.output.response_raw = response\n self.output.raw_payload = response.text\n self.output.status_code = response.status_code\n\n # Only decode non empty payloads to avoid triggering decoding error unnecessarily.\n if self.output.raw_payload:\n try:\n self.output.response_json = response.json()\n\n except json.decoder.JSONDecodeError as e:\n self.log.info(f\"An error occurred while processing the JSON payload. Error message:\\n{e.msg}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep","title":"koheesio.steps.http.PaginatedHtppGetStep","text":"Represents a paginated HTTP GET step.
Parameters:
Name Type Description Default paginate
bool
Whether to paginate the API response. Defaults to False.
required pages
int
Number of pages to paginate. Defaults to 1.
required offset
int
Offset for paginated API calls. Offset determines the starting page. Defaults to 1.
required limit
int
Limit for paginated API calls. Defaults to 100.
required"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.limit","title":"limit class-attribute
instance-attribute
","text":"limit: Optional[int] = Field(default=100, description='Limit for paginated API calls. The url should (optionally) contain a named limit parameter, for example: api.example.com/data?limit={limit}')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.offset","title":"offset class-attribute
instance-attribute
","text":"offset: Optional[int] = Field(default=1, description=\"Offset for paginated API calls. Offset determines the starting page. Defaults to 1. The url can (optionally) contain a named 'offset' parameter, for example: api.example.com/data?offset={offset}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.pages","title":"pages class-attribute
instance-attribute
","text":"pages: Optional[int] = Field(default=1, description='Number of pages to paginate. Defaults to 1')\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.paginate","title":"paginate class-attribute
instance-attribute
","text":"paginate: Optional[bool] = Field(default=False, description=\"Whether to paginate the API response. Defaults to False. When set to True, the API response will be paginated. The url should contain a named 'page' parameter for example: api.example.com/data?page={page}\")\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.execute","title":"execute","text":"execute() -> Output\n
Executes the HTTP GET request and handles pagination.
Returns:
Type Description Output
The output of the HTTP GET request.
Source code in src/koheesio/steps/http.py
def execute(self) -> HttpGetStep.Output:\n \"\"\"\n Executes the HTTP GET request and handles pagination.\n\n Returns\n -------\n HttpGetStep.Output\n The output of the HTTP GET request.\n \"\"\"\n # Set up pagination parameters\n offset, pages = (self.offset, self.pages + 1) if self.paginate else (1, 1) # type: ignore\n data = []\n _basic_url = self.url\n\n for page in range(offset, pages):\n if self.paginate:\n self.log.info(f\"Fetching page {page} of {pages - 1}\")\n\n self.url = self._url(basic_url=_basic_url, page=page)\n self.request()\n\n if isinstance(self.output.response_json, list):\n data += self.output.response_json\n else:\n data.append(self.output.response_json)\n\n self.url = _basic_url\n self.output.response_json = data\n self.output.response_raw = None\n self.output.raw_payload = None\n self.output.status_code = None\n
"},{"location":"api_reference/steps/http.html#koheesio.steps.http.PaginatedHtppGetStep.get_options","title":"get_options","text":"get_options()\n
Returns the options to be passed to the requests.request() function.
Returns:
Type Description dict
The options.
Source code in src/koheesio/steps/http.py
def get_options(self):\n \"\"\"\n Returns the options to be passed to the requests.request() function.\n\n Returns\n -------\n dict\n The options.\n \"\"\"\n options = {\n \"url\": self.url,\n \"headers\": self.get_headers(),\n \"data\": self.data,\n \"timeout\": self.timeout,\n **self._adjust_params(), # type: ignore\n }\n\n return options\n
"},{"location":"community/approach-documentation.html","title":"Approach documentation","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#scope","title":"Scope","text":"","tags":["doctype/explanation"]},{"location":"community/approach-documentation.html#the-system","title":"The System","text":"We will be adopting \"The Documentation System\".
From documentation.divio.com:
There is a secret that needs to be understood in order to write good software documentation: there isn\u2019t one thing called documentation, there are four.
They are: tutorials, how-to guides, technical reference and explanation. They represent four different purposes or functions, and require four different approaches to their creation. Understanding the implications of this will help improve most documentation - often immensely.
About the system The documentation system outlined here is a simple, comprehensive and nearly universally-applicable scheme. It is proven in practice across a wide variety of fields and applications.
There are some very simple principles that govern documentation that are very rarely if ever spelled out. They seem to be a secret, though they shouldn\u2019t be.
If you can put these principles into practice, it will make your documentation better and your project, product or team more successful - that\u2019s a promise.
The system is widely adopted for large and small, open and proprietary documentation projects.
Video Presentation on YouTube:
","tags":["doctype/explanation"]},{"location":"community/contribute.html","title":"Contribute","text":""},{"location":"community/contribute.html#how-to-contribute","title":"How to contribute","text":"There are a few guidelines that we need contributors to follow so that we are able to process requests as efficiently as possible. If you have any questions or concerns please feel free to contact us at opensource@nike.com.
"},{"location":"community/contribute.html#getting-started","title":"Getting Started","text":" - Review our Code of Conduct
- Make sure you have a GitHub account
- Submit a ticket for your issue, assuming one does not already exist.
- Clearly describe the issue including steps to reproduce when it is a bug.
- Make sure you fill in the earliest version that you know has the issue.
- Fork the repository on GitHub
"},{"location":"community/contribute.html#making-changes","title":"Making Changes","text":" - Create a feature branch off of
main
before you start your work. - Please avoid working directly on the
main
branch.
- Setup the required package manager hatch
- Setup the dev environment see below
- Make commits of logical units.
- You may be asked to squash unnecessary commits down to logical units.
- Check for unnecessary whitespace with
git diff --check
before committing. - Write meaningful, descriptive commit messages.
- Please follow existing code conventions when working on a file
- Make sure to check the standards on the code, see below
- Make sure to test the code before you push changes see below
"},{"location":"community/contribute.html#submitting-changes","title":"\ud83e\udd1d Submitting Changes","text":" - Push your changes to a topic branch in your fork of the repository.
- Submit a pull request to the repository in the Nike-Inc organization.
- After feedback has been given we expect responses within two weeks. After two weeks we may close the pull request if it isn't showing any activity.
- Bug fixes or features that lack appropriate tests may not be considered for merge.
- Changes that lower test coverage may not be considered for merge.
"},{"location":"community/contribute.html#make-commands","title":"\ud83d\udd28 Make commands","text":"We use make
for managing different steps of setup and maintenance in the project. You can install make by following the instructions here
For a full list of available make commands, you can run:
make help\n
"},{"location":"community/contribute.html#package-manager","title":"\ud83d\udce6 Package manager","text":"We use hatch
as our package manager.
Note: Please DO NOT use pip or conda to install the dependencies. Instead, use hatch.
To install hatch, run the following command:
make init\n
or,
make hatch-install\n
This will install hatch using brew if you are on a Mac.
If you are on a different OS, you can follow the instructions here
"},{"location":"community/contribute.html#dev-environment-setup","title":"\ud83d\udccc Dev Environment Setup","text":"To ensure our standards, make sure to install the required packages.
make dev\n
This will install all the required packages for development in the project under the .venv
directory. Use this virtual environment to run the code and tests during local development.
"},{"location":"community/contribute.html#linting-and-standards","title":"\ud83e\uddf9 Linting and Standards","text":"We use ruff
, pylint
, isort
, black
and mypy
to maintain standards in the codebase.
Run the following two commands to check the codebase for any issues:
make check\n
This will run all the checks including pylint and mypy. make fmt\n
This will format the codebase using black, isort, and ruff. Make sure that the linters and formatters do not report any errors or warnings before submitting a pull request.
"},{"location":"community/contribute.html#testing","title":"\ud83e\uddea Testing","text":"We use pytest
to test our code.
You can run the tests by running one of the following commands:
make cov # to run the tests and check the coverage\nmake all-tests # to run all the tests\nmake spark-tests # to run the spark tests\nmake non-spark-tests # to run the non-spark tests\n
Make sure that all tests pass and that you have adequate coverage before submitting a pull request.
"},{"location":"community/contribute.html#additional-resources","title":"Additional Resources","text":" - General GitHub documentation
- GitHub pull request documentation
- Nike's Code of Conduct
- Nike's Individual Contributor License Agreement
- Nike OSS
"},{"location":"includes/glossary.html","title":"Glossary","text":""},{"location":"includes/glossary.html#pydantic","title":"Pydantic","text":"Pydantic is a Python library for data validation and settings management using Python type annotations. It allows Koheesio to bring in strong typing and a high level of type safety. Essentially, it allows Koheesio to consider configurations of a pipeline (i.e. the settings used inside Steps, Tasks, etc.) as data that can be validated and structured.
"},{"location":"includes/glossary.html#pyspark","title":"PySpark","text":"PySpark is a Python library for Apache Spark, a powerful open-source data processing engine. It allows Koheesio to handle large-scale data processing tasks efficiently.
"},{"location":"misc/info.html","title":"Info","text":"{{ macros_info() }}
"},{"location":"reference/concepts/concepts.html","title":"Concepts","text":"The framework architecture is built from a set of core components. Each of the implementations that the framework provides out of the box, can be swapped out for custom implementations as long as they match the API.
The core components are the following:
Note: click on the 'Concept' to take you to the corresponding module. The module documentation will have greater detail on the specifics of the implementation
"},{"location":"reference/concepts/concepts.html#step","title":"Step","text":"A custom unit of logic that can be executed. A Step is an atomic operation and serves as the building block of data pipelines built with the framework. A step can be seen as an operation on a set of inputs, and returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.
\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% is for increasing the box size without having to mess with CSS settings\nStep[\"\n \n \n \nStep\n \n \n \n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n
Step is the core abstraction of the framework. Meaning, that it is the core building block of the framework and is used to define all the operations that can be executed.
Please see the Step documentation for more details.
"},{"location":"reference/concepts/concepts.html#task","title":"Task","text":"The unit of work of one execution of the framework.
An execution usually consists of an Extract - Transform - Load
approach of one data object. Tasks typically consist of a series of Steps.
Please see the Task documentation for more details.
"},{"location":"reference/concepts/concepts.html#context","title":"Context","text":"The Context is used to configure the environment where a Task or Step runs.
It is often based on configuration files and can be used to adapt behaviour of a Task or Step based on the environment it runs in.
Please see the Context documentation for more details.
"},{"location":"reference/concepts/concepts.html#logger","title":"logger","text":"A logger object to log messages with different levels.
Please see the Logging documentation for more details.
The interactions between the base concepts of the model is visible in the below diagram:
---\ntitle: Koheesio Class Diagram\n---\nclassDiagram\n Step .. Task\n Step .. Transformation\n Step .. Reader\n Step .. Writer\n\n class Context\n\n class LoggingFactory\n\n class Task{\n <<abstract>>\n + List~Step~ steps\n ...\n + execute() Output\n }\n\n class Step{\n <<abstract>>\n ...\n Output: ...\n + execute() Output\n }\n\n class Transformation{\n <<abstract>>\n + df: DataFrame\n ...\n Output:\n + df: DataFrame\n + transform(df: DataFrame) DataFrame\n }\n\n class Reader{\n <<abstract>>\n ...\n Output:\n + df: DataFrame\n + read() DataFrame\n }\n\n class Writer{\n <<abstract>>\n + df: DataFrame\n ...\n + write(df: DataFrame)\n }
"},{"location":"reference/concepts/context.html","title":"Context in Koheesio","text":"In the Koheesio framework, the Context
class plays a pivotal role. It serves as a flexible and powerful tool for managing configuration data and shared variables across tasks and steps in your application.
Context
behaves much like a Python dictionary, but with additional features that enhance its usability and flexibility. It allows you to store and retrieve values, including complex Python objects, with ease. You can access these values using dictionary-like methods or as class attributes, providing a simple and intuitive interface.
Moreover, Context
supports nested keys and recursive merging of contexts, making it a versatile tool for managing complex configurations. It also provides serialization and deserialization capabilities, allowing you to easily save and load configurations in JSON, YAML, or TOML formats.
Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context
provides a robust and efficient solution.
This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.
"},{"location":"reference/concepts/context.html#api-reference","title":"API Reference","text":"See API Reference for a detailed description of the Context
class and its methods.
"},{"location":"reference/concepts/context.html#key-features","title":"Key Features","text":" -
Accessing Values: Context
simplifies accessing configuration values. You can access them using dictionary-like methods or as class attributes. This allows for a more intuitive interaction with the Context
object. For example:
context = Context({\"bronze_table\": \"catalog.schema.table_name\"})\nprint(context.bronze_table) # Outputs: catalog.schema.table_name\n
-
Nested Keys: Context
supports nested keys, allowing you to access and add nested keys in a straightforward way. This is useful when dealing with complex configurations that require a hierarchical structure. For example:
context = Context({\"bronze\": {\"table\": \"catalog.schema.table_name\"}})\nprint(context.bronze.table) # Outputs: catalog.schema.table_name\n
-
Merging Contexts: You can merge two Contexts
together, with the incoming Context
having priority. Recursive merging is also supported. This is particularly useful when you want to update a Context
with new data without losing the existing values. For example:
context1 = Context({\"bronze_table\": \"catalog.schema.table_name\"})\ncontext2 = Context({\"silver_table\": \"catalog.schema.table_name\"})\ncontext1.merge(context2)\nprint(context1.silver_table) # Outputs: catalog.schema.table_name\n
-
Adding Keys: You can add keys to a Context by using the add
method. This allows you to dynamically update the Context
as needed. For example:
context.add(\"silver_table\", \"catalog.schema.table_name\")\n
-
Checking Key Existence: You can check if a key exists in a Context by using the contains
method. This is useful when you want to ensure a key is present before attempting to access its value. For example:
context.contains(\"silver_table\") # Returns: True\n
-
Getting Key-Value Pair: You can get a key-value pair from a Context by using the get_item
method. This can be useful when you want to extract a specific piece of data from the Context
. For example:
context.get_item(\"silver_table\") # Returns: {\"silver_table\": \"catalog.schema.table_name\"}\n
-
Converting to Dictionary: You can convert a Context to a dictionary by using the to_dict
method. This can be useful when you need to interact with code that expects a standard Python dictionary. For example:
context_dict = context.to_dict()\n
-
Creating from Dictionary: You can create a Context from a dictionary by using the from_dict
method. This allows you to easily convert existing data structures into a Context
. For example:
context = Context.from_dict({\"bronze_table\": \"catalog.schema.table_name\"})\n
"},{"location":"reference/concepts/context.html#advantages-over-a-dictionary","title":"Advantages over a Dictionary","text":"While a dictionary can be used to store configuration values, Context
provides several advantages:
-
Support for nested keys: Unlike a standard Python dictionary, Context
allows you to access nested keys as if they were attributes. This makes it easier to work with complex, hierarchical data.
-
Recursive merging of two Contexts
: Context
allows you to merge two Contexts
together, with the incoming Context
having priority. This is useful when you want to update a Context
with new data without losing the existing values.
-
Accessing keys as if they were class attributes: This provides a more intuitive way to interact with the Context
, as you can use dot notation to access values.
-
Code completion in IDEs: Because you can access keys as if they were attributes, IDEs can provide code completion for Context
keys. This can make your coding process more efficient and less error-prone.
-
Easy creation from a YAML, JSON, or TOML file: Context
provides methods to easily load data from YAML or JSON files, making it a great tool for managing configuration data.
"},{"location":"reference/concepts/context.html#data-formats-and-serialization","title":"Data Formats and Serialization","text":"Context
leverages JSON, YAML, and TOML for serialization and deserialization. These formats are widely used in the industry and provide a balance between readability and ease of use.
-
JSON: A lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It's widely used for APIs and web-based applications.
-
YAML: A human-friendly data serialization standard often used for configuration files. It's more readable than JSON and supports complex data structures.
-
TOML: A minimal configuration file format that's easy to read due to its clear and simple syntax. It's often used for configuration files in Python applications.
"},{"location":"reference/concepts/context.html#examples","title":"Examples","text":"In this section, we provide a variety of examples to demonstrate the capabilities of the Context
class in Koheesio.
"},{"location":"reference/concepts/context.html#basic-operations","title":"Basic Operations","text":"Here are some basic operations you can perform with Context
. These operations form the foundation of how you interact with a Context
object:
# Create a Context\ncontext = Context({\"bronze_table\": \"catalog.schema.table_name\"})\n\n# Access a value\nvalue = context.bronze_table\n\n# Add a key\ncontext.add(\"silver_table\", \"catalog.schema.table_name\")\n\n# Merge two Contexts\ncontext.merge(Context({\"silver_table\": \"catalog.schema.table_name\"}))\n
"},{"location":"reference/concepts/context.html#serialization-and-deserialization","title":"Serialization and Deserialization","text":"Context
supports serialization and deserialization to and from JSON, YAML, and TOML formats. This allows you to easily save and load Context
data:
# Load context from a JSON file\ncontext = Context.from_json(\"path/to/context.json\")\n\n# Save context to a JSON file\ncontext.to_json(\"path/to/context.json\")\n\n# Load context from a YAML file\ncontext = Context.from_yaml(\"path/to/context.yaml\")\n\n# Save context to a YAML file\ncontext.to_yaml(\"path/to/context.yaml\")\n\n# Load context from a TOML file\ncontext = Context.from_toml(\"path/to/context.toml\")\n\n# Save context to a TOML file\ncontext.to_toml(\"path/to/context.toml\")\n
"},{"location":"reference/concepts/context.html#nested-keys","title":"Nested Keys","text":"Context
supports nested keys, allowing you to create hierarchical configurations. This is useful when dealing with complex data structures:
# Create a Context with nested keys\ncontext = Context({\n \"database\": {\n \"bronze_table\": \"catalog.schema.bronze_table\",\n \"silver_table\": \"catalog.schema.silver_table\",\n \"gold_table\": \"catalog.schema.gold_table\"\n }\n})\n\n# Access a nested key\nprint(context.database.bronze_table) # Outputs: catalog.schema.bronze_table\n
"},{"location":"reference/concepts/context.html#recursive-merging","title":"Recursive Merging","text":"Context
also supports recursive merging, allowing you to merge two Contexts
together at all levels of their hierarchy. This is particularly useful when you want to update a Context
with new data without losing the existing values:
# Create two Contexts with nested keys\ncontext1 = Context({\n \"database\": {\n \"bronze_table\": \"catalog.schema.bronze_table\",\n \"silver_table\": \"catalog.schema.silver_table\"\n }\n})\n\ncontext2 = Context({\n \"database\": {\n \"silver_table\": \"catalog.schema.new_silver_table\",\n \"gold_table\": \"catalog.schema.gold_table\"\n }\n})\n\n# Merge the two Contexts\ncontext1.merge(context2)\n\n# Print the merged Context\nprint(context1.to_dict()) \n# Outputs: \n# {\n# \"database\": {\n# \"bronze_table\": \"catalog.schema.bronze_table\",\n# \"silver_table\": \"catalog.schema.new_silver_table\",\n# \"gold_table\": \"catalog.schema.gold_table\"\n# }\n# }\n
"},{"location":"reference/concepts/context.html#jsonpickle-and-complex-python-objects","title":"Jsonpickle and Complex Python Objects","text":"The Context
class in Koheesio also uses jsonpickle
for serialization and deserialization of complex Python objects to and from JSON. This allows you to convert complex Python objects, including custom classes, into a format that can be easily stored and transferred.
Here's an example of how this works:
# Import necessary modules\nfrom koheesio.context import Context\n\n# Initialize SnowflakeReader and store in a Context\nsnowflake_reader = SnowflakeReader(...) # fill in with necessary arguments\ncontext = Context({\"snowflake_reader\": snowflake_reader})\n\n# Serialize the Context to a JSON string\njson_str = context.to_json()\n\n# Print the serialized Context\nprint(json_str)\n\n# Deserialize the JSON string back into a Context\ndeserialized_context = Context.from_json(json_str)\n\n# Access the deserialized SnowflakeReader\ndeserialized_snowflake_reader = deserialized_context.snowflake_reader\n\n# Now you can use the deserialized SnowflakeReader as you would the original\n
This feature is particularly useful when you need to save the state of your application, transfer it over a network, or store it in a database. When you're ready to use the stored data, you can easily convert it back into the original Python objects.
However, there are a few things to keep in mind:
-
The classes you're serializing must be importable (i.e., they must be in the Python path) when you're deserializing the JSON. jsonpickle
needs to be able to import the class to reconstruct the object. This holds true for most Koheesio classes, as they are designed to be importable and reconstructible.
-
Not all Python objects can be serialized. For example, objects that hold a reference to a file or a network connection can't be serialized because their state can't be easily captured in a static file.
-
As mentioned in the code comments, jsonpickle
is not secure against malicious data. You should only deserialize data that you trust.
So, while the Context
class provides a powerful tool for handling complex Python objects, it's important to be aware of these limitations.
"},{"location":"reference/concepts/context.html#conclusion","title":"Conclusion","text":"In this document, we've covered the key features of the Context
class in the Koheesio framework, including its ability to handle complex Python objects, support for nested keys and recursive merging, and its serialization and deserialization capabilities.
Whether you're setting up the environment for a Task or Step, or managing variables shared across multiple tasks, Context
provides a robust and efficient solution.
"},{"location":"reference/concepts/context.html#further-reading","title":"Further Reading","text":"For more information, you can refer to the following resources:
- Python jsonpickle Documentation
- Python JSON Documentation
- Python YAML Documentation
- Python TOML Documentation
Refer to the API documentation for more details on the Context
class and its methods.
"},{"location":"reference/concepts/logger.html","title":"Python Logger Code Instructions","text":"Here you can find instructions on how to use the Koheesio Logging Factory.
"},{"location":"reference/concepts/logger.html#logging-factory","title":"Logging Factory","text":"The LoggingFactory
class is a factory for creating and configuring loggers. To use it, follow these steps:
-
Import the necessary modules:
from koheesio.logger import LoggingFactory\n
-
Initialize logging factory for koheesio modules:
factory = LoggingFactory(name=\"replace_koheesio_parent_name\", env=\"local\", logger_id=\"your_run_id\")\n# Or use default \nfactory = LoggingFactory()\n# Or just specify log level for koheesio modules\nfactory = LoggingFactory(level=\"DEBUG\")\n
-
Create a logger by calling the create_logger
method of the LoggingFactory
class, you can inherit from koheesio logger:
python logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME) # Or for koheesio modules logger = LoggingFactory.get_logger(name=factory.LOGGER_NAME,inherit_from_koheesio=True)
-
You can now use the logger
object to log messages:
logger.debug(\"Debug message\")\nlogger.info(\"Info message\")\nlogger.warning(\"Warning message\")\nlogger.error(\"Error message\")\nlogger.critical(\"Critical message\")\n
-
(Optional) You can add additional handlers to the logger by calling the add_handlers
method of the LoggingFactory
class:
handlers = [\n (\"your_handler_module.YourHandlerClass\", {\"level\": \"INFO\"}),\n # Add more handlers if needed\n]\nfactory.add_handlers(handlers)\n
-
(Optional) You can create child loggers based on the parent logger by calling the get_logger
method of the LoggingFactory
class:
child_logger = factory.get_logger(name=\"your_child_logger_name\")\n
-
(Optional) Get an independent logger without inheritance
If you need an independent logger without inheriting from the LoggingFactory
logger, you can use the get_logger
method:
your_logger = factory.get_logger(name=\"your_logger_name\", inherit=False)\n
By setting inherit
to False
, you will obtain a logger that is not tied to the LoggingFactory
logger hierarchy, only format of message will be the same, but you can also change it. This allows you to have an independent logger with its own configuration. You can use the your_logger
object to log messages:
```python\nyour_logger.debug(\"Debug message\")\nyour_logger.info(\"Info message\")\nyour_logger.warning(\"Warning message\")\nyour_logger.error(\"Error message\")\nyour_logger.critical(\"Critical message\")\n```\n
-
(Optional) You can use Masked types to masked secrets/tokens/passwords in output. The Masked types are special types provided by the koheesio library to handle sensitive data that should not be logged or printed in plain text. They are used to wrap sensitive data and override their string representation to prevent accidental exposure of the data.Here are some examples of how to use Masked types:
import logging\nfrom koheesio.logger import MaskedString, MaskedInt, MaskedFloat, MaskedDict\n\n# Set up logging\nlogger = logging.getLogger(__name__)\nlogging.basicConfig(level=logging.INFO)\n\n# Using MaskedString\nmasked_string = MaskedString(\"my secret string\")\nlogger.info(masked_string) # This will not log the actual string\n\n# Using MaskedInt\nmasked_int = MaskedInt(12345)\nlogger.info(masked_int) # This will not log the actual integer\n\n# Using MaskedFloat\nmasked_float = MaskedFloat(3.14159)\nlogger.info(masked_float) # This will not log the actual float\n\n# Using MaskedDict\nmasked_dict = MaskedDict({\"key\": \"value\"})\nlogger.info(masked_dict) # This will not log the actual dictionary\n
Please make sure to replace \"your_logger_name\", \"your_run_id\", \"your_handler_module.YourHandlerClass\", \"your_child_logger_name\", and other placeholders with your own values according to your application's requirements.
By following these steps, you can obtain an independent logger without inheriting from the LoggingFactory
logger. This allows you to customize the logger configuration and use it separately in your code.
Note: Ensure that you have imported the necessary modules, instantiated the LoggingFactory
class, and customized the logger name and other parameters according to your application's requirements.
"},{"location":"reference/concepts/logger.html#example","title":"Example","text":"import logging\n\n# Step 2: Instantiate the LoggingFactory class\nfactory = LoggingFactory(env=\"local\")\n\n# Step 3: Create an independent logger with a custom log level\nyour_logger = factory.get_logger(\"your_logger\", inherit_from_koheesio=False)\nyour_logger.setLevel(logging.DEBUG)\n\n# Step 4: Create a logger using the create_logger method from LoggingFactory with a different log level\nfactory_logger = LoggingFactory(level=\"WARNING\").get_logger(name=factory.LOGGER_NAME)\n\n# Step 5: Create a child logger with a debug level\nchild_logger = factory.get_logger(name=\"child\")\nchild_logger.setLevel(logging.DEBUG)\n\nchild2_logger = factory.get_logger(name=\"child2\")\nchild2_logger.setLevel(logging.INFO)\n\n# Step 6: Log messages at different levels for both loggers\nyour_logger.debug(\"Debug message\") # This message will be displayed\nyour_logger.info(\"Info message\") # This message will be displayed\nyour_logger.warning(\"Warning message\") # This message will be displayed\nyour_logger.error(\"Error message\") # This message will be displayed\nyour_logger.critical(\"Critical message\") # This message will be displayed\n\nfactory_logger.debug(\"Debug message\") # This message will not be displayed\nfactory_logger.info(\"Info message\") # This message will not be displayed\nfactory_logger.warning(\"Warning message\") # This message will be displayed\nfactory_logger.error(\"Error message\") # This message will be displayed\nfactory_logger.critical(\"Critical message\") # This message will be displayed\n\nchild_logger.debug(\"Debug message\") # This message will be displayed\nchild_logger.info(\"Info message\") # This message will be displayed\nchild_logger.warning(\"Warning message\") # This message will be displayed\nchild_logger.error(\"Error message\") # This message will be displayed\nchild_logger.critical(\"Critical message\") # This message will be displayed\n\nchild2_logger.debug(\"Debug message\") # This message will be displayed\nchild2_logger.info(\"Info message\") # This message will be displayed\nchild2_logger.warning(\"Warning message\") # This message will be displayed\nchild2_logger.error(\"Error message\") # This message will be displayed\nchild2_logger.critical(\"Critical message\") # This message will be displayed\n
Output:
[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [your_logger] {__init__.py:<module>:118} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [your_logger] {__init__.py:<module>:119} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [your_logger] {__init__.py:<module>:120} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [your_logger] {__init__.py:<module>:121} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [your_logger] {__init__.py:<module>:122} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio] {__init__.py:<module>:126} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio] {__init__.py:<module>:127} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio] {__init__.py:<module>:128} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [DEBUG] [koheesio.child] {__init__.py:<module>:130} - Debug message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child] {__init__.py:<module>:131} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child] {__init__.py:<module>:132} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child] {__init__.py:<module>:133} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child] {__init__.py:<module>:134} - Critical message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [INFO] [koheesio.child2] {__init__.py:<module>:137} - Info message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [WARNING] [koheesio.child2] {__init__.py:<module>:138} - Warning message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [ERROR] [koheesio.child2] {__init__.py:<module>:139} - Error message\n[a7d79f7a-f16f-4d2a-8430-1134830f61be] [2023-06-26 20:29:39,267] [CRITICAL] [koheesio.child2] {__init__.py:<module>:140} - Critical message\n
"},{"location":"reference/concepts/logger.html#loggeridfilter-class","title":"LoggerIDFilter Class","text":"The LoggerIDFilter
class is a filter that injects run_id
information into the log. To use it, follow these steps:
-
Import the necessary modules:
import logging\n
-
Create an instance of the LoggerIDFilter
class:
logger_filter = LoggerIDFilter()\n
-
Set the LOGGER_ID
attribute of the LoggerIDFilter
class to the desired run ID:
LoggerIDFilter.LOGGER_ID = \"your_run_id\"\n
-
Add the logger_filter
to your logger or handler:
logger = logging.getLogger(\"your_logger_name\")\nlogger.addFilter(logger_filter)\n
"},{"location":"reference/concepts/logger.html#loggingfactory-set-up-optional","title":"LoggingFactory Set Up (Optional)","text":" -
Import the LoggingFactory
class in your application code.
-
Set the value for the LOGGER_FILTER
variable:
- If you want to assign a specific
logging.Filter
instance, replace None
with your desired filter instance. -
If you want to keep the default value of None
, leave it unchanged.
-
Set the value for the LOGGER_LEVEL
variable:
- If you want to use the value from the
\"KOHEESIO_LOGGING_LEVEL\"
environment variable, leave the code as is. -
If you want to use a different environment variable or a specific default value, modify the code accordingly.
-
Set the value for the LOGGER_ENV
variable:
-
Replace \"local\"
with your desired environment name.
-
Set the value for the LOGGER_FORMAT
variable:
- If you want to customize the log message format, modify the value within the double quotes.
-
The format should follow the desired log message format pattern.
-
Set the value for the LOGGER_FORMATTER
variable:
- If you want to assign a specific
Formatter
instance, replace Formatter(LOGGER_FORMAT)
with your desired formatter instance. -
If you want to keep the default formatter with the defined log message format, leave it unchanged.
-
Set the value for the CONSOLE_HANDLER
variable:
- If you want to assign a specific
logging.Handler
instance, replace None
with your desired handler instance. - If you want to keep the default value of
None
, leave it unchanged.
-
Set the value for the ENV
variable:
- Replace
None
with your desired environment value if applicable. - If you don't need to set this variable, leave it as
None
.
-
Save the changes to the file.
"},{"location":"reference/concepts/step.html","title":"Steps in Koheesio","text":"In the Koheesio framework, the Step
class and its derivatives play a crucial role. They serve as the building blocks for creating data pipelines, allowing you to define custom units of logic that can be executed. This document will guide you through its key features and show you how to leverage its capabilities in your Koheesio applications.
Several type of Steps are available in Koheesio, including Reader
, Transformation
, Writer
, and Task
.
"},{"location":"reference/concepts/step.html#what-is-a-step","title":"What is a Step?","text":"A Step
is an atomic operation serving as the building block of data pipelines built with the Koheesio framework. Tasks typically consist of a series of Steps.
A step can be seen as an operation on a set of inputs, that returns a set of outputs. This does not imply that steps are stateless (e.g. data writes)! This concept is visualized in the figure below.
\nflowchart LR\n\n%% Should render like this\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 1 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 1 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 2 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 Step \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 2 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n%% \u2502 \u2502\n%% \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n%% \u2502 Input 3 \u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502 \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25b6\u2502Output 3 \u2502\n%% \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n%% is for increasing the box size without having to mess with CSS settings\nStep[\"\n \n \n \nStep\n \n \n \n\"]\n\nI1[\"Input 1\"] ---> Step\nI2[\"Input 2\"] ---> Step\nI3[\"Input 3\"] ---> Step\n\nStep ---> O1[\"Output 1\"]\nStep ---> O2[\"Output 2\"]\nStep ---> O3[\"Output 3\"]\n
"},{"location":"reference/concepts/step.html#how-to-read-a-step","title":"How to Read a Step?","text":"A Step
in Koheesio is a class that represents a unit of work in a data pipeline. It's similar to a Python built-in data class, but with additional features for execution, validation, and logging.
When you look at a Step
, you'll typically see the following components:
-
Class Definition: The Step
is defined as a class that inherits from the base Step
class in Koheesio. For example, class MyStep(Step):
.
-
Input Fields: These are defined as class attributes with type annotations, similar to attributes in a Python data class. These fields represent the inputs to the Step
. For example, a: str
defines an input field a
of type str
. Additionally, you will often see these fields defined using Pydantic's Field
class, which allows for more detailed validation and documentation as well as default values and aliasing.
-
Output Fields: These are defined in a nested class called Output
that inherits from StepOutput
. This class represents the output of the Step
. For example, class Output(StepOutput): b: str
defines an output field b
of type str
.
-
Execute Method: This is a method that you need to implement when you create a new Step
. It contains the logic of the Step
and is where you use the input fields and populate the output fields. For example, def execute(self): self.output.b = f\"{self.a}-some-suffix\"
.
Here's an example of a Step
:
class MyStep(Step):\n a: str # input\n\n class Output(StepOutput): # output\n b: str\n\n def execute(self) -> MyStep.Output:\n self.output.b = f\"{self.a}-some-suffix\"\n
In this Step
, a
is an input field of type str
, b
is an output field of type str
, and the execute
method appends -some-suffix
to the input a
and assigns it to the output b
.
When you see a Step
, you can think of it as a function where the class attributes are the inputs, the Output
class defines the outputs, and the execute
method is the function body. The main difference is that a Step
also includes automatic validation of inputs and outputs (thanks to Pydantic), logging, and error handling.
"},{"location":"reference/concepts/step.html#understanding-inheritance-in-steps","title":"Understanding Inheritance in Steps","text":"Inheritance is a core concept in object-oriented programming where a class (child or subclass) inherits properties and methods from another class (parent or superclass). In the context of Koheesio, when you create a new Step
, you're creating a subclass that inherits from the base Step
class.
When a new Step is defined (like class MyStep(Step):
), it inherits all the properties and methods from the Step
class. This includes the execute
method, which is then overridden to provide the specific functionality for that Step.
Here's a simple breakdown:
-
Parent Class (Superclass): This is the Step
class in Koheesio. It provides the basic structure and functionalities of a Step, including input and output validation, logging, and error handling.
-
Child Class (Subclass): This is the new Step you define, like MyStep
. It inherits all the properties and methods from the Step
class and can add or override them as needed.
-
Inheritance: This is the process where MyStep
inherits the properties and methods from the Step
class. In Python, this is done by mentioning the parent class in parentheses when defining the child class, like class MyStep(Step):
.
-
Overriding: This is when you provide a new implementation of a method in the child class that is already defined in the parent class. In the case of Steps, you override the execute
method to define the specific logic of your Step.
Understanding inheritance is key to understanding how Steps work in Koheesio. It allows you to leverage the functionalities provided by the Step
class and focus on implementing the specific logic of your Step.
"},{"location":"reference/concepts/step.html#benefits-of-using-steps-in-data-pipelines","title":"Benefits of Using Steps in Data Pipelines","text":"The concept of a Step
is beneficial when creating Data Pipelines or Data Products for several reasons:
-
Modularity: Each Step
represents a self-contained unit of work, which makes the pipeline modular. This makes it easier to understand, test, and maintain the pipeline. If a problem arises, you can pinpoint which step is causing the issue.
-
Reusability: Steps can be reused across different pipelines. Once a Step
is defined, it can be used in any number of pipelines. This promotes code reuse and consistency across projects.
-
Readability: Steps make the pipeline code more readable. Each Step
has a clear input, output, and execution logic, which makes it easier to understand what each part of the pipeline is doing.
-
Validation: Steps automatically validate their inputs and outputs. This ensures that the data flowing into and out of each step is of the expected type and format, which can help catch errors early.
-
Logging: Steps automatically log the start and end of their execution, along with the input and output data. This can be very useful for debugging and understanding the flow of data through the pipeline.
-
Error Handling: Steps provide built-in error handling. If an error occurs during the execution of a step, it is caught, logged, and then re-raised. This provides a clear indication of where the error occurred.
-
Scalability: Steps can be easily parallelized or distributed, which is crucial for processing large datasets. This is especially true for steps that are designed to work with distributed computing frameworks like Apache Spark.
By using the concept of a Step
, you can create data pipelines that are modular, reusable, readable, and robust, while also being easier to debug and scale.
"},{"location":"reference/concepts/step.html#compared-to-a-regular-pydantic-basemodel","title":"Compared to a regular Pydantic Basemodel","text":"A Step
in Koheesio, while built on top of Pydantic's BaseModel
, provides additional features specifically designed for creating data pipelines. Here are some key differences:
-
Execution Method: A Step
includes an execute
method that needs to be implemented. This method contains the logic of the step and is automatically decorated with functionalities such as logging and output validation.
-
Input and Output Validation: A Step
uses Pydantic models to define and validate its inputs and outputs. This ensures that the data flowing into and out of the step is of the expected type and format.
-
Automatic Logging: A Step
automatically logs the start and end of its execution, along with the input and output data. This is done through the do_execute
decorator applied to the execute
method.
-
Error Handling: A Step
provides built-in error handling. If an error occurs during the execution of the step, it is caught, logged, and then re-raised. This should help in debugging and understanding the flow of data.
-
Serialization: A Step
can be serialized to a YAML string using the to_yaml
method. This can be useful for saving and loading steps.
-
Lazy Mode Support: The StepOutput
class in a Step
supports lazy mode, which allows validation of the items stored in the class to be called at will instead of being forced to run it upfront.
In contrast, a regular Pydantic BaseModel
is a simple data validation model that doesn't include these additional features. It's used for data parsing and validation, but doesn't include methods for execution, automatic logging, error handling, or serialization to YAML.
"},{"location":"reference/concepts/step.html#key-features-of-a-step","title":"Key Features of a Step","text":""},{"location":"reference/concepts/step.html#defining-a-step","title":"Defining a Step","text":"To define a new step, you subclass the Step
class and implement the execute
method. The inputs of the step can be accessed using self.input_name
. The output of the step can be accessed using self.output.output_name
. For example:
class MyStep(Step):\n input1: str = Field(...)\n input2: int = Field(...)\n\n class Output(StepOutput):\n output1: str = Field(...)\n\n def execute(self):\n # Your logic here\n self.output.output1 = \"result\"\n
"},{"location":"reference/concepts/step.html#running-a-step","title":"Running a Step","text":"To run a step, you can call the execute
method. You can also use the run
method, which is an alias to execute
. For example:
step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-step-output","title":"Accessing Step Output","text":"The output of a step can be accessed using self.output.output_name
. For example:
step = MyStep(input1=\"value1\", input2=2)\nstep.execute()\nprint(step.output.output1) # Outputs: \"result\"\n
"},{"location":"reference/concepts/step.html#serializing-a-step","title":"Serializing a Step","text":"You can serialize a step to a YAML string using the to_yaml
method. For example:
step = MyStep(input1=\"value1\", input2=2)\nyaml_str = step.to_yaml()\n
"},{"location":"reference/concepts/step.html#getting-step-description","title":"Getting Step Description","text":"You can get the description of a step using the get_description
method. For example:
step = MyStep(input1=\"value1\", input2=2)\ndescription = step.get_description()\n
"},{"location":"reference/concepts/step.html#defining-a-step-with-multiple-inputs-and-outputs","title":"Defining a Step with Multiple Inputs and Outputs","text":"Here's an example of how to define a new step with multiple inputs and outputs:
class MyStep(Step):\n input1: str = Field(...)\n input2: int = Field(...)\n input3: int = Field(...)\n\n class Output(StepOutput):\n output1: str = Field(...)\n output2: int = Field(...)\n\n def execute(self):\n # Your logic here\n self.output.output1 = \"result\"\n self.output.output2 = self.input2 + self.input3\n
"},{"location":"reference/concepts/step.html#running-a-step-with-multiple-inputs","title":"Running a Step with Multiple Inputs","text":"To run a step with multiple inputs, you can do the following:
step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\n
"},{"location":"reference/concepts/step.html#accessing-multiple-step-outputs","title":"Accessing Multiple Step Outputs","text":"The outputs of a step can be accessed using self.output.output_name
. For example:
step = MyStep(input1=\"value1\", input2=2, input3=3)\nstep.execute()\nprint(step.output.output1) # Outputs: \"result\"\nprint(step.output.output2) # Outputs: 5\n
"},{"location":"reference/concepts/step.html#special-features","title":"Special Features","text":""},{"location":"reference/concepts/step.html#the-execute-method","title":"The Execute method","text":"The execute
method in the Step
class is automatically decorated with the StepMetaClass._execute_wrapper
function due to the metaclass StepMetaClass
. This provides several advantages:
-
Automatic Output Validation: The decorator ensures that the output of the execute
method is always a StepOutput
instance. This means that the output is automatically validated against the defined output model, ensuring data integrity and consistency.
-
Logging: The decorator provides automatic logging at the start and end of the execute
method. This includes logging the input and output of the step, which can be useful for debugging and understanding the flow of data.
-
Error Handling: If an error occurs during the execution of the Step
, the decorator catches the exception and logs an error message before re-raising the exception. This provides a clear indication of where the error occurred.
-
Simplifies Step Implementation: Since the decorator handles output validation, logging, and error handling, the user can focus on implementing the logic of the execute
method without worrying about these aspects.
-
Consistency: By automatically decorating the execute
method, the library ensures that these features are consistently applied across all steps, regardless of who implements them or how they are used. This makes the behavior of steps predictable and consistent.
-
Prevents Double Wrapping: The decorator checks if the function is already wrapped with StepMetaClass._execute_wrapper
and prevents double wrapping. This ensures that the decorator doesn't interfere with itself if execute
is overridden in subclasses.
Notice that you never have to explicitly return anything from the execute
method. The StepMetaClass._execute_wrapper
decorator takes care of that for you.
Implementation examples for custom metaclass which can be used to override the default behavior of the StepMetaClass._execute_wrapper
:
class MyMetaClass(StepMetaClass):\n @classmethod\n def _log_end_message(cls, step: Step, skip_logging: bool = False, *args, **kwargs):\n print(\"It's me from custom meta class\")\n super()._log_end_message(step, skip_logging, *args, **kwargs)\n\n class MyMetaClass2(StepMetaClass):\n @classmethod\n def _validate_output(cls, step: Step, skip_validating: bool = False, *args, **kwargs):\n # i want always have a dummy value in the output\n step.output.dummy_value = \"dummy\"\n\n class YourClassWithCustomMeta(Step, metaclass=MyMetaClass):\n def execute(self):\n self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n\n class YourClassWithCustomMeta2(Step, metaclass=MyMetaClass2):\n def execute(self):\n self.log.info(f\"This is from the execute method of {self.__class__.__name__}\")\n
"},{"location":"reference/concepts/step.html#sparkstep","title":"SparkStep","text":"The SparkStep
class is a subclass of Step
that is designed for steps that interact with Spark. It extends the Step
class with SparkSession support. Spark steps are expected to return a Spark DataFrame as output. The spark
property is available to access the active SparkSession instance. Output
in a SparkStep
is expected to be a DataFrame
although optional.
"},{"location":"reference/concepts/step.html#using-a-sparkstep","title":"Using a SparkStep","text":"Here's an example of how to use a SparkStep
:
class MySparkStep(SparkStep):\n input1: str = Field(...)\n\n class Output(StepOutput):\n output1: DataFrame = Field(...)\n\n def execute(self):\n # Your logic here\n df = self.spark.read.text(self.input1)\n self.output.output1 = df\n
To run a SparkStep
, you can do the following:
step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\n
To access the output of a SparkStep
, you can do the following:
step = MySparkStep(input1=\"path/to/textfile\")\nstep.execute()\ndf = step.output.output1\ndf.show()\n
"},{"location":"reference/concepts/step.html#conclusion","title":"Conclusion","text":"In this document, we've covered the key features of the Step
class in the Koheesio framework, including its ability to define custom units of logic, manage inputs and outputs, and support for serialization. The automatic decoration of the execute
method provides several advantages that simplify step implementation and ensure consistency across all steps.
Whether you're defining a new operation in your data pipeline or managing the flow of data between steps, Step
provides a robust and efficient solution.
"},{"location":"reference/concepts/step.html#further-reading","title":"Further Reading","text":"For more information, you can refer to the following resources:
- Python Pydantic Documentation
- Python YAML Documentation
Refer to the API documentation for more details on the Step
class and its methods.
"},{"location":"reference/spark/readers.html","title":"Reader Module","text":"The Reader
module in Koheesio provides a set of classes for reading data from various sources. A Reader
is a type of SparkStep
that reads data from a source based on the input parameters and stores the result in self.output.df
for subsequent steps.
"},{"location":"reference/spark/readers.html#what-is-a-reader","title":"What is a Reader?","text":"A Reader
is a subclass of SparkStep
that reads data from a source and stores the result. The source could be a file, a database, a web API, or any other data source. The result is stored in a DataFrame, which is accessible through the df
property of the Reader
.
"},{"location":"reference/spark/readers.html#api-reference","title":"API Reference","text":"See API Reference for a detailed description of the Reader
class and its methods.
"},{"location":"reference/spark/readers.html#key-features-of-a-reader","title":"Key Features of a Reader","text":" - Read Method: The
Reader
class provides a read
method that calls the execute
method and returns the result. Essentially, calling .read()
is a shorthand for calling .execute().output.df
. This allows you to read data from a Reader
without having to call the execute
method directly. This is a convenience method that simplifies the usage of a Reader
.
Here's an example of how to use the .read()
method:
# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the .read() method to get the data as a DataFrame\ndf = my_reader.read()\n\n# Now df is a DataFrame with the data read by MyReader\n
In this example, MyReader
is a subclass of Reader
that you've defined. After creating an instance of MyReader
, you call the .read()
method to read the data and get it back as a DataFrame. The DataFrame df
now contains the data read by MyReader
.
- DataFrame Property: The
Reader
class provides a df
property as a shorthand for accessing self.output.df
. If self.output.df
is None
, the execute
method is run first. This property ensures that the data is loaded and ready to be used, even if the execute
method hasn't been explicitly called.
Here's an example of how to use the df
property:
# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the df property to get the data as a DataFrame\ndf = my_reader.df\n\n# Now df is a DataFrame with the data read by MyReader\n
In this example, MyReader
is a subclass of Reader
that you've defined. After creating an instance of MyReader
, you access the df
property to get the data as a DataFrame. The DataFrame df
now contains the data read by MyReader
.
- SparkSession: Every
Reader
has a SparkSession
available as self.spark
. This is the currently active SparkSession
, which can be used to perform distributed data processing tasks.
Here's an example of how to use the spark
property:
# Instantiate the Reader\nmy_reader = MyReader()\n\n# Use the spark property to get the SparkSession\nspark = my_reader.spark\n\n# Now spark is the SparkSession associated with MyReader\n
In this example, MyReader
is a subclass of Reader
that you've defined. After creating an instance of MyReader
, you access the spark
property to get the SparkSession
. The SparkSession
spark
can now be used to perform distributed data processing tasks.
"},{"location":"reference/spark/readers.html#how-to-define-a-reader","title":"How to Define a Reader?","text":"To define a Reader
, you create a subclass of the Reader
class and implement the execute
method. The execute
method should read from the source and store the result in self.output.df
. This is an abstract method, which means it must be implemented in any subclass of Reader
.
Here's an example of a Reader
:
class MyReader(Reader):\n def execute(self):\n # read data from source\n data = read_from_source()\n # store result in self.output.df\n self.output.df = data\n
"},{"location":"reference/spark/readers.html#understanding-inheritance-in-readers","title":"Understanding Inheritance in Readers","text":"Just like a Step
, a Reader
is defined as a subclass that inherits from the base Reader
class. This means it inherits all the properties and methods from the Reader
class and can add or override them as needed. The main method that needs to be overridden is the execute
method, which should implement the logic for reading data from the source and storing it in self.output.df
.
"},{"location":"reference/spark/readers.html#benefits-of-using-readers-in-data-pipelines","title":"Benefits of Using Readers in Data Pipelines","text":"Using Reader
classes in your data pipelines has several benefits:
-
Simplicity: Readers abstract away the details of reading data from various sources, allowing you to focus on the logic of your pipeline.
-
Consistency: By using Readers, you ensure that data is read in a consistent manner across different parts of your pipeline.
-
Flexibility: Readers can be easily swapped out for different data sources without changing the rest of your pipeline.
-
Efficiency: Readers automatically manage resources like connections and file handles, ensuring efficient use of resources.
By using the concept of a Reader
, you can create data pipelines that are simple, consistent, flexible, and efficient.
"},{"location":"reference/spark/readers.html#examples-of-reader-classes-in-koheesio","title":"Examples of Reader Classes in Koheesio","text":"Koheesio provides a variety of Reader
subclasses for reading data from different sources. Here are just a few examples:
-
Teradata Reader: A Reader
subclass for reading data from Teradata databases. It's defined in the koheesio/steps/readers/teradata.py
file.
-
Snowflake Reader: A Reader
subclass for reading data from Snowflake databases. It's defined in the koheesio/steps/readers/snowflake.py
file.
-
Box Reader: A Reader
subclass for reading data from Box. It's defined in the koheesio/steps/integrations/box.py
file.
These are just a few examples of the many Reader
subclasses available in Koheesio. Each Reader
subclass is designed to read data from a specific source. They all inherit from the base Reader
class and implement the execute
method to read data from their respective sources and store it in self.output.df
.
Please note that this is not an exhaustive list. Koheesio provides many more Reader
subclasses for a wide range of data sources. For a complete list, please refer to the Koheesio documentation or the source code.
More readers can be found in the koheesio/steps/readers
module.
"},{"location":"reference/spark/transformations.html","title":"Transformation Module","text":"The Transformation
module in Koheesio provides a set of classes for transforming data within a DataFrame. A Transformation
is a type of SparkStep
that takes a DataFrame as input, applies a transformation, and returns a DataFrame as output. The transformation logic is implemented in the execute
method of each Transformation
subclass.
"},{"location":"reference/spark/transformations.html#what-is-a-transformation","title":"What is a Transformation?","text":"A Transformation
is a subclass of SparkStep
that applies a transformation to a DataFrame and stores the result. The transformation could be any operation that modifies the data or structure of the DataFrame, such as adding a new column, filtering rows, or aggregating data.
Using Transformation
classes ensures that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.
"},{"location":"reference/spark/transformations.html#api-reference","title":"API Reference","text":"See API Reference for a detailed description of the Transformation
classes and their methods.
"},{"location":"reference/spark/transformations.html#types-of-transformations","title":"Types of Transformations","text":"There are three main types of transformations in Koheesio:
-
Transformation
: This is the base class for all transformations. It takes a DataFrame as input and returns a DataFrame as output. The transformation logic is implemented in the execute
method.
-
ColumnsTransformation
: This is an extended Transformation
class with a preset validator for handling column(s) data. It standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.
-
ColumnsTransformationWithTarget
: This is an extended ColumnsTransformation
class with an additional target_column
field. This field can be used to store the result of the transformation in a new column. If the target_column
is not provided, the result will be stored in the source column.
Each type of transformation has its own use cases and advantages. The right one to use depends on the specific requirements of your data pipeline.
"},{"location":"reference/spark/transformations.html#how-to-define-a-transformation","title":"How to Define a Transformation","text":"To define a Transformation
, you create a subclass of the Transformation
class and implement the execute
method. The execute
method should take a DataFrame from self.input.df
, apply a transformation, and store the result in self.output.df
.
Transformation
classes abstract away some of the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.
Here's an example of a Transformation
:
class MyTransformation(Transformation):\n def execute(self):\n # get data from self.input.df\n data = self.input.df\n # apply transformation\n transformed_data = apply_transformation(data)\n # store result in self.output.df\n self.output.df = transformed_data\n
In this example, MyTransformation
is a subclass of Transformation
that you've defined. The execute
method gets the data from self.input.df
, applies a transformation called apply_transformation
(undefined in this example), and stores the result in self.output.df
.
"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformation","title":"How to Define a ColumnsTransformation","text":"To define a ColumnsTransformation
, you create a subclass of the ColumnsTransformation
class and implement the execute
method. The execute
method should apply a transformation to the specified columns of the DataFrame.
ColumnsTransformation
classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.
Here's an example of a ColumnsTransformation
:
from pyspark.sql import functions as f\nfrom koheesio.steps.transformations import ColumnsTransformation\n\nclass AddOne(ColumnsTransformation):\n def execute(self):\n for column in self.get_columns():\n self.output.df = self.df.withColumn(column, f.col(column) + 1)\n
In this example, AddOne
is a subclass of ColumnsTransformation
that you've defined. The execute
method adds 1 to each column in self.get_columns()
.
The ColumnsTransformation
class has a ColumnConfig
class that can be used to configure the behavior of the class. This class has the following fields:
run_for_all_data_type
: Allows to run the transformation for all columns of a given type. limit_data_type
: Allows to limit the transformation to a specific data type. data_type_strict_mode
: Toggles strict mode for data type validation. Will only work if limit_data_type
is set.
Note that data types need to be specified as a SparkDatatype
enum. Users should not have to interact with the ColumnConfig
class directly.
"},{"location":"reference/spark/transformations.html#how-to-define-a-columnstransformationwithtarget","title":"How to Define a ColumnsTransformationWithTarget","text":"To define a ColumnsTransformationWithTarget
, you create a subclass of the ColumnsTransformationWithTarget
class and implement the func
method. The func
method should return the transformation that will be applied to the column(s). The execute
method, which is already preset, will use the get_columns_with_target
method to loop over all the columns and apply this function to transform the DataFrame.
Here's an example of a ColumnsTransformationWithTarget
:
from pyspark.sql import Column\nfrom koheesio.steps.transformations import ColumnsTransformationWithTarget\n\nclass AddOneWithTarget(ColumnsTransformationWithTarget):\n def func(self, col: Column):\n return col + 1\n
In this example, AddOneWithTarget
is a subclass of ColumnsTransformationWithTarget
that you've defined. The func
method adds 1 to the values of a given column.
The ColumnsTransformationWithTarget
class has an additional target_column
field. This field can be used to store the result of the transformation in a new column. If the target_column
is not provided, the result will be stored in the source column. If more than one column is passed, the target_column
will be used as a suffix. Leaving this blank will result in the original columns being renamed.
The ColumnsTransformationWithTarget
class also has a get_columns_with_target
method. This method returns an iterator of the columns and handles the target_column
as well.
"},{"location":"reference/spark/transformations.html#key-features-of-a-transformation","title":"Key Features of a Transformation","text":" -
Execute Method: The Transformation
class provides an execute
method to implement in your subclass. This method should take a DataFrame from self.input.df
, apply a transformation, and store the result in self.output.df
.
For ColumnsTransformation
and ColumnsTransformationWithTarget
, the execute
method is already implemented in the base class. Instead of overriding execute
, you implement a func
method in your subclass. This func
method should return the transformation to be applied to each column. The execute
method will then apply this func to each column in a loop.
-
DataFrame Property: The Transformation
class provides a df
property as a shorthand for accessing self.input.df
. This property ensures that the data is ready to be transformed, even if the execute
method hasn't been explicitly called. This is useful for 'early validation' of the input data.
-
SparkSession: Every Transformation
has a SparkSession
available as self.spark
. This is the currently active SparkSession
, which can be used to perform distributed data processing tasks.
-
Columns Property: The ColumnsTransformation
and ColumnsTransformationWithTarget
classes provide a columns
property. This property standardizes the input for a single column or multiple columns. If more than one column is passed, the transformation will be run in a loop against all the given columns.
-
Target Column Property: The ColumnsTransformationWithTarget
class provides a target_column
property. This field can be used to store the result of the transformation in a new column. If the target_column
is not provided, the result will be stored in the source column.
"},{"location":"reference/spark/transformations.html#examples-of-transformation-classes-in-koheesio","title":"Examples of Transformation Classes in Koheesio","text":"Koheesio provides a variety of Transformation
subclasses for transforming data in different ways. Here are some examples:
-
DataframeLookup
: This transformation joins two dataframes together based on a list of join mappings. It allows you to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.
Here's an example of how to use the DataframeLookup
transformation:
from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\nspark = SparkSession.builder.getOrCreate()\nleft_df = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\nright_df = spark.createDataFrame([(1, \"A\"), (3, \"C\")], [\"id\", \"value\"])\n\nlookup = DataframeLookup(\n df=left_df,\n other=right_df,\n on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n how=JoinType.LEFT,\n)\n\noutput_df = lookup.execute().df\n
-
HashUUID5
: This transformation is a subclass of Transformation
and provides an interface to generate a UUID5 hash for each row in the DataFrame. The hash is generated based on the values of the specified source columns.
Here's an example of how to use the HashUUID5
transformation:
from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\n\nspark = SparkSession.builder.getOrCreate()\ndf = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\n\nhash_transform = HashUUID5(\n df=df,\n source_columns=[\"id\", \"value\"],\n target_column=\"hash\"\n)\n\noutput_df = hash_transform.execute().df\n
In this example, HashUUID5
is a subclass of Transformation
. After creating an instance of HashUUID5
, you call the execute
method to apply the transformation. The execute
method generates a UUID5 hash for each row in the DataFrame based on the values of the id
and value
columns and stores the result in a new column named hash
.
"},{"location":"reference/spark/transformations.html#benefits-of-using-koheesio-transformations","title":"Benefits of using Koheesio Transformations","text":"Using a Koheesio Transformation
over plain Spark provides several benefits:
-
Consistency: By using Transformation
classes, you ensure that data is transformed in a consistent manner across different parts of your pipeline. This can help avoid errors and make your code easier to understand and maintain.
-
Abstraction: Transformation
classes abstract away the details of transforming data, allowing you to focus on the logic of your pipeline. This can make your code cleaner and easier to read.
-
Flexibility: Transformation
classes can be easily swapped out for different data transformations without changing the rest of your pipeline. This can make your pipeline more flexible and easier to modify or extend.
-
Early Input Validation: As a Transformation
is a type of SparkStep
, which in turn is a Step
and a type of Pydantic BaseModel
, all inputs are validated when an instance of a Transformation
class is created. This early validation helps catch errors related to invalid input, such as an invalid column name, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.
-
Ease of Testing: Transformation
classes are designed to be easily testable. This can make it easier to write unit tests for your data pipeline, helping to ensure its correctness and reliability.
-
Robustness: Koheesio has been extensively tested with hundreds of unit tests, ensuring that the Transformation
classes work as expected under a wide range of conditions. This makes your data pipelines more robust and less likely to fail due to unexpected inputs or edge cases.
By using the concept of a Transformation
, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.
"},{"location":"reference/spark/transformations.html#advanced-usage-of-transformations","title":"Advanced Usage of Transformations","text":"Transformations can be combined and chained together to create complex data processing pipelines. Here's an example of how to chain transformations:
from pyspark.sql import SparkSession\nfrom koheesio.steps.transformations import HashUUID5\nfrom koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType\n\n# Create a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Define two DataFrames\ndf1 = spark.createDataFrame([(1, \"A\"), (2, \"B\")], [\"id\", \"value\"])\ndf2 = spark.createDataFrame([(1, \"C\"), (3, \"D\")], [\"id\", \"value\"])\n\n# Define the first transformation\nlookup = DataframeLookup(\n other=df2,\n on=JoinMapping(source_column=\"id\", other_column=\"id\"),\n targets=TargetColumn(target_column=\"value\", target_column_alias=\"right_value\"),\n how=JoinType.LEFT,\n)\n\n# Apply the first transformation\noutput_df = lookup.transform(df1)\n\n# Define the second transformation\nhash_transform = HashUUID5(\n source_columns=[\"id\", \"value\", \"right_value\"],\n target_column=\"hash\"\n)\n\n# Apply the second transformation\noutput_df2 = hash_transform.transform(output_df)\n
In this example, DataframeLookup
is a subclass of ColumnsTransformation
and HashUUID5
is a subclass of Transformation
. After creating instances of DataframeLookup
and HashUUID5
, you call the transform
method to apply each transformation. The transform
method of DataframeLookup
performs a left join with df2
on the id
column and adds the value
column from df2
to the result DataFrame as right_value
. The transform
method of HashUUID5
generates a UUID5 hash for each row in the DataFrame based on the values of the id
, value
, and right_value
columns and stores the result in a new column named hash
.
"},{"location":"reference/spark/transformations.html#troubleshooting-transformations","title":"Troubleshooting Transformations","text":"If you encounter an error when using a transformation, here are some steps you can take to troubleshoot:
-
Check the Input Data: Make sure the input DataFrame to the transformation is correct. You can use the show
method of the DataFrame to print the first few rows of the DataFrame.
-
Check the Transformation Parameters: Make sure the parameters passed to the transformation are correct. For example, if you're using a DataframeLookup
, make sure the join mappings and target columns are correctly specified.
-
Check the Transformation Logic: If the input data and parameters are correct, there might be an issue with the transformation logic. You can use PySpark's logging utilities to log intermediate results and debug the transformation logic.
-
Check the Output Data: If the transformation executes without errors but the output data is not as expected, you can use the show
method of the DataFrame to print the first few rows of the output DataFrame. This can help you identify any issues with the transformation logic.
"},{"location":"reference/spark/transformations.html#conclusion","title":"Conclusion","text":"The Transformation
module in Koheesio provides a powerful and flexible way to transform data in a DataFrame. By using Transformation
classes, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable. Whether you're performing simple transformations like adding a new column, or complex transformations like joining multiple DataFrames, the Transformation
module has you covered.
"},{"location":"reference/spark/writers.html","title":"Writer Module","text":"The Writer
module in Koheesio provides a set of classes for writing data to various destinations. A Writer
is a type of SparkStep
that takes data from self.input.df
and writes it to a destination based on the output parameters.
"},{"location":"reference/spark/writers.html#what-is-a-writer","title":"What is a Writer?","text":"A Writer
is a subclass of SparkStep
that writes data to a destination. The data to be written is taken from a DataFrame, which is accessible through the df
property of the Writer
.
"},{"location":"reference/spark/writers.html#how-to-define-a-writer","title":"How to Define a Writer?","text":"To define a Writer
, you create a subclass of the Writer
class and implement the execute
method. The execute
method should take data from self.input.df
and write it to the destination.
Here's an example of a Writer
:
class MyWriter(Writer):\n def execute(self):\n # get data from self.input.df\n data = self.input.df\n # write data to destination\n write_to_destination(data)\n
"},{"location":"reference/spark/writers.html#key-features-of-a-writer","title":"Key Features of a Writer","text":" -
Write Method: The Writer
class provides a write
method that calls the execute
method and writes the data to the destination. Essentially, calling .write()
is a shorthand for calling .execute().output.df
. This allows you to write data to a Writer
without having to call the execute
method directly. This is a convenience method that simplifies the usage of a Writer
.
Here's an example of how to use the .write()
method:
# Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the .write() method to write the data\nmy_writer.write()\n\n# The data from MyWriter's DataFrame is now written to the destination\n
In this example, MyWriter
is a subclass of Writer
that you've defined. After creating an instance of MyWriter
, you call the .write()
method to write the data to the destination. The data from MyWriter
's DataFrame is now written to the destination.
-
DataFrame Property: The Writer
class provides a df
property as a shorthand for accessing self.input.df
. This property ensures that the data is ready to be written, even if the execute
method hasn't been explicitly called.
Here's an example of how to use the df
property:
# Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the df property to get the data as a DataFrame\ndf = my_writer.df\n\n# Now df is a DataFrame with the data that will be written by MyWriter\n
In this example, MyWriter
is a subclass of Writer
that you've defined. After creating an instance of MyWriter
, you access the df
property to get the data as a DataFrame. The DataFrame df
now contains the data that will be written by MyWriter
.
-
SparkSession: Every Writer
has a SparkSession
available as self.spark
. This is the currently active SparkSession
, which can be used to perform distributed data processing tasks.
Here's an example of how to use the spark
property:
# Instantiate the Writer\nmy_writer = MyWriter()\n\n# Use the spark property to get the SparkSession\nspark = my_writer.spark\n\n# Now spark is the SparkSession associated with MyWriter\n
In this example, MyWriter
is a subclass of Writer
that you've defined. After creating an instance of MyWriter
, you access the spark
property to get the SparkSession
. The SparkSession
spark
can now be used to perform distributed data processing tasks.
"},{"location":"reference/spark/writers.html#understanding-inheritance-in-writers","title":"Understanding Inheritance in Writers","text":"Just like a Step
, a Writer
is defined as a subclass that inherits from the base Writer
class. This means it inherits all the properties and methods from the Writer
class and can add or override them as needed. The main method that needs to be overridden is the execute
method, which should implement the logic for writing data from self.input.df
to the destination.
"},{"location":"reference/spark/writers.html#examples-of-writer-classes-in-koheesio","title":"Examples of Writer Classes in Koheesio","text":"Koheesio provides a variety of Writer
subclasses for writing data to different destinations. Here are just a few examples:
BoxFileWriter
DeltaTableStreamWriter
DeltaTableWriter
DummyWriter
ForEachBatchStreamWriter
KafkaWriter
SnowflakeWriter
StreamWriter
Please note that this is not an exhaustive list. Koheesio provides many more Writer
subclasses for a wide range of data destinations. For a complete list, please refer to the Koheesio documentation or the source code.
"},{"location":"reference/spark/writers.html#benefits-of-using-writers-in-data-pipelines","title":"Benefits of Using Writers in Data Pipelines","text":"Using Writer
classes in your data pipelines has several benefits:
- Simplicity: Writers abstract away the details of writing data to various destinations, allowing you to focus on the logic of your pipeline.
- Consistency: By using Writers, you ensure that data is written in a consistent manner across different parts of your pipeline.
- Flexibility: Writers can be easily swapped out for different data destinations without changing the rest of your pipeline.
- Efficiency: Writers automatically manage resources like connections and file handles, ensuring efficient use of resources.
- Early Input Validation: As a
Writer
is a type of SparkStep
, which in turn is a Step
and a type of Pydantic BaseModel
, all inputs are validated when an instance of a Writer
class is created. This early validation helps catch errors related to invalid input, such as an invalid URL for a database, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.
By using the concept of a Writer
, you can create data pipelines that are simple, consistent, flexible, efficient, and reliable.
"},{"location":"tutorials/advanced-data-processing.html","title":"Advanced Data Processing with Koheesio","text":"In this guide, we will explore some advanced data processing techniques using Koheesio. We will cover topics such as complex transformations, handling large datasets, and optimizing performance.
"},{"location":"tutorials/advanced-data-processing.html#complex-transformations","title":"Complex Transformations","text":"Koheesio provides a variety of built-in transformations, but sometimes you may need to perform more complex operations on your data. In such cases, you can create custom transformations.
Here's an example of a custom transformation that normalizes a column in a DataFrame:
from pyspark.sql import DataFrame\nfrom koheesio.spark.transformations.transform import Transform\n\ndef normalize_column(df: DataFrame, column: str) -> DataFrame:\n max_value = df.agg({column: \"max\"}).collect()[0][0]\n min_value = df.agg({column: \"min\"}).collect()[0][0]\n return df.withColumn(column, (df[column] - min_value) / (max_value - min_value))\n\n\nclass NormalizeColumnTransform(Transform):\n column: str\n\n def transform(self, df: DataFrame) -> DataFrame:\n return normalize_column(df, self.column)\n
"},{"location":"tutorials/advanced-data-processing.html#handling-large-datasets","title":"Handling Large Datasets","text":"When working with large datasets, it's important to manage resources effectively to ensure good performance. Koheesio provides several features to help with this.
"},{"location":"tutorials/advanced-data-processing.html#partitioning","title":"Partitioning","text":"Partitioning is a technique that divides your data into smaller, more manageable pieces, called partitions. Koheesio allows you to specify the partitioning scheme for your data when writing it to a target.
from koheesio.steps.writers.delta import DeltaTableWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\nclass MyTask(EtlTask):\n target = DeltaTableWriter(table=\"my_table\", partitionBy=[\"column1\", \"column2\"])\n
"},{"location":"tutorials/getting-started.html","title":"Getting Started with Koheesio","text":""},{"location":"tutorials/getting-started.html#requirements","title":"Requirements","text":" - Python 3.9+
"},{"location":"tutorials/getting-started.html#installation","title":"Installation","text":""},{"location":"tutorials/getting-started.html#poetry","title":"Poetry","text":"If you're using Poetry, add the following entry to the pyproject.toml
file:
pyproject.toml[[tool.poetry.source]]\nname = \"nike\"\nurl = \"https://artifactory.nike.com/artifactory/api/pypi/python-virtual/simple\"\nsecondary = true\n
poetry add koheesio\n
"},{"location":"tutorials/getting-started.html#pip","title":"pip","text":"If you're using pip, run the following command to install Koheesio:
Requires pip.
pip install koheesio\n
"},{"location":"tutorials/getting-started.html#basic-usage","title":"Basic Usage","text":"Once you've installed Koheesio, you can start using it in your Python scripts. Here's a basic example:
from koheesio import Step\n\n# Define a step\nclass MyStep(Step):\n def execute(self):\n # Your step logic here\n\n# Create an instance of the step\nstep = MyStep()\n\n# Run the step\nstep.execute()\n
"},{"location":"tutorials/getting-started.html#advanced-usage","title":"Advanced Usage","text":"from pyspark.sql.functions import lit\nfrom pyspark.sql import DataFrame, SparkSession\n\n# Step 1: import Koheesio dependencies\nfrom koheesio.context import Context\nfrom koheesio.steps.readers.dummy import DummyReader\nfrom koheesio.steps.transformations.camel_to_snake import CamelToSnakeTransformation\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.tasks.etl_task import EtlTask\n\n# Step 2: Set up a SparkSession\nspark = SparkSession.builder.getOrCreate()\n\n# Step 3: Configure your Context\ncontext = Context({\n \"source\": DummyReader(),\n \"transformations\": [CamelToSnakeTransformation()],\n \"target\": DummyWriter(),\n \"my_favorite_movie\": \"inception\",\n})\n\n# Step 4: Create a Task\nclass MyFavoriteMovieTask(EtlTask):\n my_favorite_movie: str\n\n def transform(self, df: DataFrame = None) -> DataFrame:\n df = df.withColumn(\"MyFavoriteMovie\", lit(self.my_favorite_movie))\n return super().transform(df)\n\n# Step 5: Run your Task\ntask = MyFavoriteMovieTask(**context)\ntask.run()\n
"},{"location":"tutorials/getting-started.html#contributing","title":"Contributing","text":"If you want to contribute to Koheesio, check out the CONTRIBUTING.md
file in this repository. It contains guidelines for contributing, including how to submit issues and pull requests.
"},{"location":"tutorials/getting-started.html#testing","title":"Testing","text":"To run the tests for Koheesio, use the following command:
make dev-test\n
This will run all the tests in the tests
directory.
"},{"location":"tutorials/hello-world.html","title":"Simple Examples","text":""},{"location":"tutorials/hello-world.html#creating-a-custom-step","title":"Creating a Custom Step","text":"This example demonstrates how to use the SparkStep
class from the koheesio
library to create a custom step named HelloWorldStep
.
"},{"location":"tutorials/hello-world.html#code","title":"Code","text":"from koheesio.steps.step import SparkStep\n\nclass HelloWorldStep(SparkStep):\n message: str\n\n def execute(self) -> SparkStep.Output:\n # create a DataFrame with a single row containing the message\n self.output.df = self.spark.createDataFrame([(1, self.message)], [\"id\", \"message\"])\n
"},{"location":"tutorials/hello-world.html#usage","title":"Usage","text":"hello_world_step = HelloWorldStep(message=\"Hello, World!\")\nhello_world_step.execute()\n\nhello_world_step.output.df.show()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code","title":"Understanding the Code","text":"The HelloWorldStep
class is a SparkStep
in Koheesio, designed to generate a DataFrame with a single row containing a custom message. Here's a more detailed overview:
HelloWorldStep
inherits from SparkStep
, a fundamental building block in Koheesio for creating data processing steps with Apache Spark. - It has a
message
attribute. When creating an instance of HelloWorldStep
, you can pass a custom message that will be used in the DataFrame. SparkStep
has a spark
attribute, which is the active SparkSession. This is the entry point for any Spark functionality, allowing the step to interact with the Spark cluster. SparkStep
also includes an Output
class, used to store the output of the step. In this case, Output
has a df
attribute to store the output DataFrame. - The
execute
method creates a DataFrame with the custom message and stores it in output.df
. It doesn't return a value explicitly; instead, the output DataFrame can be accessed via output.df
. - Koheesio uses pydantic for automatic validation of the step's input and output, ensuring they are correctly defined and of the correct types.
Note: Pydantic is a data validation library that provides a way to validate that the data (in this case, the input and output of the step) conforms to the expected format.
"},{"location":"tutorials/hello-world.html#creating-a-custom-task","title":"Creating a Custom Task","text":"This example demonstrates how to use the EtlTask
from the koheesio
library to create a custom task named MyFavoriteMovieTask
.
"},{"location":"tutorials/hello-world.html#code_1","title":"Code","text":"from typing import Any\nfrom pyspark.sql import DataFrame, functions as f\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.tasks.etl_task import EtlTask\n\n\ndef add_column(df: DataFrame, target_column: str, value: Any):\n return df.withColumn(target_column, f.lit(value))\n\n\nclass MyFavoriteMovieTask(EtlTask):\n my_favorite_movie: str\n\n def transform(self, df: Optional[DataFrame] = None) -> DataFrame:\n df = df or self.extract()\n\n # pre-transformations specific to this class\n pre_transformations = [\n Transform(add_column, target_column=\"myFavoriteMovie\", value=self.my_favorite_movie)\n ]\n\n # execute transformations one by one\n for t in pre_transformations:\n df = t.transform(df)\n\n self.output.transform_df = df\n return df\n
"},{"location":"tutorials/hello-world.html#configuration","title":"Configuration","text":"Here is the sample.yaml
configuration file used in this example:
raw_layer:\n catalog: development\n schema: my_favorite_team\n table: some_random_table\nmovies:\n favorite: Office Space\nhash_settings:\n source_columns:\n - id\n - foo\n target_column: hash_uuid5\nsource:\n range: 4\n
"},{"location":"tutorials/hello-world.html#usage_1","title":"Usage","text":"from pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\n\ncontext = Context.from_yaml(\"sample.yaml\")\n\nSparkSession.builder.getOrCreate()\n\nmy_fav_mov_task = MyFavoriteMovieTask(\n source=DummyReader(**context.raw_layer),\n target=DummyWriter(truncate=False),\n my_favorite_movie=context.movies.favorite,\n)\nmy_fav_mov_task.execute()\n
"},{"location":"tutorials/hello-world.html#understanding-the-code_1","title":"Understanding the Code","text":"This example creates a MyFavoriteMovieTask
that adds a column named myFavoriteMovie
to the DataFrame. The value for this column is provided when the task is instantiated.
The MyFavoriteMovieTask
class is a custom task that extends the EtlTask
from the koheesio
library. It demonstrates how to add a custom transformation to a DataFrame. Here's a detailed breakdown:
-
MyFavoriteMovieTask
inherits from EtlTask
, a base class in Koheesio for creating Extract-Transform-Load (ETL) tasks with Apache Spark.
-
It has a my_favorite_movie
attribute. When creating an instance of MyFavoriteMovieTask
, you can pass a custom movie title that will be used in the DataFrame.
-
The transform
method is where the main logic of the task is implemented. It first extracts the data (if not already provided), then applies a series of transformations to the DataFrame.
-
In this case, the transformation is adding a new column to the DataFrame named myFavoriteMovie
, with the value set to the my_favorite_movie
attribute. This is done using the add_column
function and the Transform
class from Koheesio.
-
The transformed DataFrame is then stored in self.output.transform_df
.
-
The sample.yaml
configuration file is used to provide the context for the task, including the source data and the favorite movie title.
-
In the usage example, an instance of MyFavoriteMovieTask
is created with a DummyReader
as the source, a DummyWriter
as the target, and the favorite movie title from the context. The task is then executed, which runs the transformations and stores the result in self.output.transform_df
.
"},{"location":"tutorials/learn-koheesio.html","title":"Learn Koheesio","text":"Koheesio is designed to simplify the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.
"},{"location":"tutorials/learn-koheesio.html#core-concepts","title":"Core Concepts","text":"Koheesio is built around several core concepts:
- Step: The fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs.
See the Step documentation for more information.
- Context: A configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment.
See the Context documentation for more information.
- Logger: A class for logging messages at different levels.
See the Logger documentation for more information.
The Logger and Context classes provide support, enabling detailed logging of the pipeline's execution and customization of the pipeline's behavior based on the environment, respectively.
"},{"location":"tutorials/learn-koheesio.html#implementations","title":"Implementations","text":"In the context of Koheesio, an implementation refers to a specific way of executing Steps, the fundamental units of work in Koheesio. Each implementation uses a different technology or approach to process data along with its own set of Steps, designed to work with the specific technology or approach used by the implementation.
For example, the Spark implementation includes Steps for reading data from a Spark DataFrame, transforming the data using Spark operations, and writing the data to a Spark-supported destination.
Currently, Koheesio supports two implementations: Spark, and AsyncIO.
"},{"location":"tutorials/learn-koheesio.html#spark","title":"Spark","text":"Requires: Apache Spark (pyspark) Installation: pip install koheesio[spark]
Module: koheesio.spark
This implementation uses Apache Spark, a powerful open-source unified analytics engine for large-scale data processing.
Steps that use this implementation can leverage Spark's capabilities for distributed data processing, making it suitable for handling large volumes of data. The Spark implementation includes the following types of Steps:
-
Reader: from koheesio.spark.readers import Reader
A type of Step that reads data from a source and stores the result (to make it available for subsequent steps). For more information, see the Reader documentation.
-
Writer: from koheesio.spark.writers import Writer
This controls how data is written to the output in both batch and streaming contexts. For more information, see the Writer documentation.
-
Transformation: from koheesio.spark.transformations import Transformation
A type of Step that takes a DataFrame as input and returns a DataFrame as output. For more information, see the Transformation documentation.
In any given pipeline, you can expect to use Readers, Writers, and Transformations to express the ETL logic. Readers are responsible for extracting data from various sources, such as databases, files, or APIs. Transformations then process this data, performing operations like filtering, aggregation, or conversion. Finally, Writers handle the loading of the transformed data to the desired destination, which could be a database, a file, or a data stream.
"},{"location":"tutorials/learn-koheesio.html#async","title":"Async","text":"Module: koheesio.asyncio
This implementation uses Python's asyncio library for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives. Steps that use this implementation can perform data processing tasks asynchronously, which can be beneficial for IO-bound tasks.
"},{"location":"tutorials/learn-koheesio.html#best-practices","title":"Best Practices","text":"Here are some best practices for using Koheesio:
-
Use Context: The Context
class in Koheesio is designed to behave like a dictionary, but with added features. It's a good practice to use Context
to customize the behavior of a task. This allows you to share variables across tasks and adapt the behavior of a task based on its environment; for example, by changing the source or target of the data between development and production environments.
-
Modular Design: Each step in the pipeline (reading, transformation, writing) should be encapsulated in its own class, making the code easier to understand and maintain. This also promotes re-usability as steps can be reused across different tasks.
-
Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks. Make sure to leverage this feature to make your pipelines robust and fault-tolerant.
-
Logging: Use the built-in logging feature in Koheesio to log information and errors in data processing tasks. This can be very helpful for debugging and monitoring the pipeline. Koheesio sets the log level to WARNING
by default, but you can change it to INFO
or DEBUG
as needed.
-
Testing: Each step can be tested independently, making it easier to write unit tests. It's a good practice to write tests for your steps to ensure they are working as expected.
-
Use Transformations: The Transform
class in Koheesio allows you to define transformations on your data. It's a good practice to encapsulate your transformation logic in Transform
classes for better readability and maintainability.
-
Consistent Structure: Koheesio enforces a consistent structure for data processing tasks. Stick to this structure to make your codebase easier to understand for new developers.
-
Use Readers and Writers: Use the built-in Reader
and Writer
classes in Koheesio to handle data extraction and loading. This not only simplifies your code but also makes it more robust and efficient.
Remember, these are general best practices and might need to be adapted based on your specific use case and requirements.
"},{"location":"tutorials/learn-koheesio.html#pydantic","title":"Pydantic","text":"Koheesio Steps are Pydantic models, which means they can be validated and serialized. This makes it easy to define the inputs and outputs of a Step, and to validate them before running the Step. Pydantic models also provide a consistent way to define the schema of the data that a Step expects and produces, making it easier to understand and maintain the code.
Learn more about Pydantic here.
"},{"location":"tutorials/onboarding.html","title":"Onboarding","text":"tags: - doctype/how-to
"},{"location":"tutorials/onboarding.html#onboarding-to-koheesio","title":"Onboarding to Koheesio","text":"Koheesio is a Python library that simplifies the development of data engineering pipelines. It provides a structured way to define and execute data processing tasks, making it easier to build, test, and maintain complex data workflows.
This guide will walk you through the process of transforming a traditional Spark application into a Koheesio pipeline along with explaining the advantages of using Koheesio over raw Spark.
"},{"location":"tutorials/onboarding.html#traditional-spark-application","title":"Traditional Spark Application","text":"First let's create a simple Spark application that you might use to process data.
The following Spark application reads a CSV file, performs a transformation, and writes the result to a Delta table. The transformation includes filtering data where age is greater than 18 and performing an aggregation to calculate the average salary per country. The result is then written to a Delta table partitioned by country.
from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, avg\n\nspark = SparkSession.builder.getOrCreate()\n\n# Read data from CSV file\ndf = spark.read.csv(\"input.csv\", header=True, inferSchema=True)\n\n# Filter data where age is greater than 18\ndf = df.filter(col(\"age\") > 18)\n\n# Perform aggregation\ndf = df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n\n# Write data to Delta table with partitioning\ndf.write.format(\"delta\").partitionBy(\"country\").save(\"/path/to/delta_table\")\n
"},{"location":"tutorials/onboarding.html#transforming-to-koheesio","title":"Transforming to Koheesio","text":"The same pipeline can be rewritten using Koheesio's EtlTask
. In this version, each step (reading, transformations, writing) is encapsulated in its own class, making the code easier to understand and maintain.
First, a CsvReader
is defined to read the input CSV file. Then, a DeltaTableWriter
is defined to write the result to a Delta table partitioned by country.
Two transformations are defined: 1. one to filter data where age is greater than 18 2. and, another to calculate the average salary per country.
These transformations are then passed to an EtlTask
along with the reader and writer. Finally, the EtlTask
is executed to run the pipeline.
from koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta.batch import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\nfrom pyspark.sql.functions import col, avg\n\n# Define reader\nreader = CsvReader(path=\"input.csv\", header=True, inferSchema=True)\n\n# Define writer\nwriter = DeltaTableWriter(table=\"delta_table\", partition_by=[\"country\"])\n\n# Define transformations\nage_transformation = Transform(\n func=lambda df: df.filter(col(\"age\") > 18)\n)\navg_salary_per_country = Transform(\n func=lambda df: df.groupBy(\"country\").agg(avg(\"salary\").alias(\"average_salary\"))\n)\n\n# Define and execute EtlTask\ntask = EtlTask(\n source=reader, \n target=writer, \n transformations=[\n age_transformation,\n avg_salary_per_country\n ]\n)\ntask.execute()\n
This approach with Koheesio provides several advantages. It makes the code more modular and easier to test. Each step can be tested independently and reused across different tasks. It also makes the pipeline more readable and easier to maintain."},{"location":"tutorials/onboarding.html#advantages-of-koheesio","title":"Advantages of Koheesio","text":"Using Koheesio instead of raw Spark has several advantages:
- Modularity: Each step in the pipeline (reading, transformation, writing) is encapsulated in its own class, making the code easier to understand and maintain.
- Reusability: Steps can be reused across different tasks, reducing code duplication.
- Testability: Each step can be tested independently, making it easier to write unit tests.
- Flexibility: The behavior of a task can be customized using a
Context
class. - Consistency: Koheesio enforces a consistent structure for data processing tasks, making it easier for new developers to understand the codebase.
- Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks.
- Logging: Koheesio provides a consistent way to log information and errors in data processing tasks.
In contrast, using the plain PySpark API for transformations can lead to more verbose and less structured code, which can be harder to understand, maintain, and test. It also doesn't provide the same level of error handling, logging, and flexibility as the Koheesio Transform class.
"},{"location":"tutorials/onboarding.html#using-a-context-class","title":"Using a Context Class","text":"Here's a simple example of how to use a Context
class to customize the behavior of a task. The Context class in Koheesio is designed to behave like a dictionary, but with added features.
from koheesio import Context\nfrom koheesio.spark.etl_task import EtlTask\nfrom koheesio.spark.readers.file_loader import CsvReader\nfrom koheesio.spark.writers.delta import DeltaTableWriter\nfrom koheesio.spark.transformations.transform import Transform\n\ncontext = Context({ # this could be stored in a JSON or YAML\n \"age_threshold\": 18,\n \"reader_options\": {\n \"path\": \"input.csv\",\n \"header\": True,\n \"inferSchema\": True\n },\n \"writer_options\": {\n \"table\": \"delta_table\",\n \"partition_by\": [\"country\"]\n }\n})\n\ntask = EtlTask(\n source = CsvReader(**context.reader_options),\n target = DeltaTableWriter(**context.writer_options),\n transformations = [\n Transform(func=lambda df: df.filter(df[\"age\"] > context.age_threshold))\n ]\n)\n\ntask.execute()\n
In this example, we're using CsvReader
to read the input data, DeltaTableWriter
to write the output data, and a Transform
step to filter the data based on the age threshold. The options for the reader and writer are stored in a Context
object, which can be easily updated or loaded from a JSON or YAML file.
"},{"location":"tutorials/testing-koheesio-steps.html","title":"Testing Koheesio Tasks","text":"Testing is a crucial part of any software development process. Koheesio provides a structured way to define and execute data processing tasks, which makes it easier to build, test, and maintain complex data workflows. This guide will walk you through the process of testing Koheesio tasks.
"},{"location":"tutorials/testing-koheesio-steps.html#unit-testing","title":"Unit Testing","text":"Unit testing involves testing individual components of the software in isolation. In the context of Koheesio, this means testing individual tasks or steps.
Here's an example of how to unit test a Koheesio task:
from koheesio.tasks.etl_task import EtlTask\nfrom koheesio.steps.readers import DummyReader\nfrom koheesio.steps.writers.dummy import DummyWriter\nfrom koheesio.steps.transformations import Transform\nfrom pyspark.sql import SparkSession, DataFrame\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df: DataFrame) -> DataFrame:\n return df.filter(col(\"Age\") > 18)\n\n\ndef test_etl_task():\n # Initialize SparkSession\n spark = SparkSession.builder.getOrCreate()\n\n # Create a DataFrame for testing\n data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n df = spark.createDataFrame(data, [\"Name\", \"Age\"])\n\n # Define the task\n task = EtlTask(\n source=DummyReader(df=df),\n target=DummyWriter(),\n transformations=[\n Transform(filter_age)\n ]\n )\n\n # Execute the task\n task.execute()\n\n # Assert the result\n result_df = task.output.df\n assert result_df.count() == 2\n assert result_df.filter(\"Name == 'Tom'\").count() == 0\n
In this example, we're testing an EtlTask that reads data from a DataFrame, applies a filter transformation, and writes the result to another DataFrame. The test asserts that the task correctly filters out rows where the age is less than or equal to 18.
"},{"location":"tutorials/testing-koheesio-steps.html#integration-testing","title":"Integration Testing","text":"Integration testing involves testing the interactions between different components of the software. In the context of Koheesio, this means testing the entirety of data flowing through one or more tasks.
We'll create a simple test for a hypothetical EtlTask that uses DeltaReader and DeltaWriter. We'll use pytest and unittest.mock to mock the responses of the reader and writer. First, let's assume that you have an EtlTask defined in a module named my_module. This task reads data from a Delta table, applies some transformations, and writes the result to another Delta table.
Here's an example of how to write an integration test for this task:
# my_module.py\nfrom koheesio.tasks.etl_task import EtlTask\nfrom koheesio.spark.readers.delta import DeltaReader\nfrom koheesio.steps.writers.delta import DeltaWriter\nfrom koheesio.steps.transformations import Transform\nfrom koheesio.context import Context\nfrom pyspark.sql.functions import col\n\n\ndef filter_age(df):\n return df.filter(col(\"Age\") > 18)\n\n\ncontext = Context({\n \"reader_options\": {\n \"table\": \"input_table\"\n },\n \"writer_options\": {\n \"table\": \"output_table\"\n }\n})\n\ntask = EtlTask(\n source=DeltaReader(**context.reader_options),\n target=DeltaWriter(**context.writer_options),\n transformations=[\n Transform(filter_age)\n ]\n)\n
Now, let's create a test for this task. We'll use pytest and unittest.mock to mock the responses of the reader and writer. We'll also use a pytest fixture to create a test context and a test DataFrame.
# test_my_module.py\nimport pytest\nfrom unittest.mock import MagicMock, patch\nfrom pyspark.sql import SparkSession\nfrom koheesio.context import Context\nfrom koheesio.steps.readers import Reader\nfrom koheesio.steps.writers import Writer\n\nfrom my_module import task\n\n@pytest.fixture(scope=\"module\")\ndef spark():\n return SparkSession.builder.getOrCreate()\n\n@pytest.fixture(scope=\"module\")\ndef test_context():\n return Context({\n \"reader_options\": {\n \"table\": \"test_input_table\"\n },\n \"writer_options\": {\n \"table\": \"test_output_table\"\n }\n })\n\n@pytest.fixture(scope=\"module\")\ndef test_df(spark):\n data = [(\"John\", 19), (\"Anna\", 20), (\"Tom\", 18)]\n return spark.createDataFrame(data, [\"Name\", \"Age\"])\n\ndef test_etl_task(spark, test_context, test_df):\n # Mock the read method of the Reader class\n with patch.object(Reader, \"read\", return_value=test_df):\n # Mock the write method of the Writer class\n with patch.object(Writer, \"write\") as mock_write:\n # Execute the task\n task.execute()\n\n # Assert the result\n result_df = task.output.df\n assert result_df.count() == 2\n assert result_df.filter(\"Name == 'Tom'\").count() == 0\n\n # Assert that the reader and writer were called with the correct arguments\n Reader.read.assert_called_once_with(**test_context.reader_options)\n mock_write.assert_called_once_with(**test_context.writer_options)\n
In this test, we're mocking the DeltaReader and DeltaWriter to return a test DataFrame and check that they're called with the correct arguments. We're also asserting that the task correctly filters out rows where the age is less than or equal to 18.
"},{"location":"misc/tags.html","title":"{{ page.title }}","text":""},{"location":"misc/tags.html#doctypeexplanation","title":"doctype/explanation","text":" - Approach documentation
"},{"location":"misc/tags.html#doctypehow-to","title":"doctype/how-to","text":" - How to
"}]}
\ No newline at end of file