Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/6 restructure module structure #9

Merged
merged 47 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
3f28915
fix: python classifier
mikita-sakalouski May 22, 2024
5eaa14f
fix: tests
mikita-sakalouski May 22, 2024
8725feb
fix: adjust tests
mikita-sakalouski May 22, 2024
ef7c991
fix: github actions
mikita-sakalouski May 22, 2024
58e4d63
fix: github action
mikita-sakalouski May 22, 2024
1d753d1
chore: Fix typo in GitHub Actions workflow file
mikita-sakalouski May 22, 2024
b96f7c1
chore: Remove unnecessary retry delay in test.yaml workflow
mikita-sakalouski May 22, 2024
176b7d3
feat: adjust action
mikita-sakalouski May 22, 2024
e99014c
chore: Update Python versions in test.yaml workflow
mikita-sakalouski May 22, 2024
28d167b
chore: Update Python versions in test.yaml workflow
mikita-sakalouski May 22, 2024
c03138a
fix: exclude envs
mikita-sakalouski May 22, 2024
28781cd
fix: test to fail
mikita-sakalouski May 22, 2024
5f1ae8e
fix: partialmethod in python>=3.11
mikita-sakalouski May 22, 2024
598ef9e
fix: pydantic type check
mikita-sakalouski May 22, 2024
a3a91d0
fix: pandas dependencies
mikita-sakalouski May 22, 2024
0b58a40
feat: add code owners file
mikita-sakalouski May 22, 2024
ed37d0c
refactor: apply format
mikita-sakalouski May 22, 2024
c86eeca
randomize tests
dannymeijer May 23, 2024
81c970e
Merge remote-tracking branch 'origin/main' into feat/5-switch-testing…
mikita-sakalouski May 23, 2024
9129de2
Merge branch 'feat/5-switch-testing-to-native-hatch-command' of perso…
mikita-sakalouski May 23, 2024
afd8b0c
feat: switch tests
mikita-sakalouski May 23, 2024
3612e58
version bump
dannymeijer May 23, 2024
46f0ce8
pathing fix
dannymeijer May 23, 2024
fa65f10
feat: add logo
mikita-sakalouski May 23, 2024
c648956
Move spark Steps to their own module
dannymeijer May 23, 2024
74f7acd
asyncio as it's own module: koheesio.asyncio
dannymeijer May 23, 2024
e79080a
revamped step module
dannymeijer May 23, 2024
0a3de26
moved koheesio to src folder
dannymeijer May 24, 2024
591e1ad
fix: tests and formats
mikita-sakalouski May 24, 2024
a49f270
version bump
dannymeijer May 23, 2024
803afe2
Move spark Steps to their own module
dannymeijer May 23, 2024
de970e9
asyncio as it's own module: koheesio.asyncio
dannymeijer May 23, 2024
2dcc2eb
revamped step module
dannymeijer May 23, 2024
825b488
moved koheesio to src folder
dannymeijer May 24, 2024
f66c9ec
fix: apply fmt
mikita-sakalouski May 24, 2024
b73ce3d
fix: mute warnings
mikita-sakalouski May 24, 2024
9efd2ab
fix: warnings filter in pytest
mikita-sakalouski May 24, 2024
b7a43ac
small update
dannymeijer May 24, 2024
36e1705
Merge branch 'feat/6-restructure-module-structure' of personal.github…
mikita-sakalouski May 24, 2024
f6a8708
refactor: apply fmt
mikita-sakalouski May 24, 2024
54c2548
attempting to fix merge conflicts
dannymeijer May 24, 2024
3755d00
chore: Update pytest warnings filter for Koheesio Snowflake
mikita-sakalouski May 24, 2024
449be70
refcator: appy fmt
mikita-sakalouski May 24, 2024
7b21fb0
fixes
dannymeijer May 24, 2024
8abc171
Merge remote-tracking branch 'origin/feat/6-restructure-module-struct…
dannymeijer May 24, 2024
48fe74c
small update
dannymeijer May 24, 2024
0499aca
small update
dannymeijer May 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/reference/concepts/transformations.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,13 +161,13 @@ Koheesio provides a variety of `Transformation` subclasses for transforming data
examples:

- `DataframeLookup`: This transformation joins two dataframes together based on a list of join mappings. It allows you
to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.
to specify the join type and join hint, and it supports selecting specific target columns from the right dataframe.

Here's an example of how to use the `DataframeLookup` transformation:

```python
from pyspark.sql import SparkSession
from koheesio.steps.transformations.lookup import DataframeLookup, JoinMapping, TargetColumn, JoinType
from koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType

spark = SparkSession.builder.getOrCreate()
left_df = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "value"])
Expand Down Expand Up @@ -246,7 +246,7 @@ how to chain transformations:
```python
from pyspark.sql import SparkSession
from koheesio.steps.transformations import HashUUID5
from koheesio.steps.transformations.lookup import DataframeLookup, JoinMapping, TargetColumn, JoinType
from koheesio.steps.transformations import DataframeLookup, JoinMapping, TargetColumn, JoinType

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
Expand Down
7 changes: 5 additions & 2 deletions docs/tutorials/advanced-data-processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,15 @@ Here's an example of a custom transformation that normalizes a column in a DataF

```python
from pyspark.sql import DataFrame
from koheesio.steps.transformations.transform import Transform
from koheesio.steps.transformations import Transform


def normalize_column(df: DataFrame, column: str) -> DataFrame:
max_value = df.agg({column: "max"}).collect()[0][0]
min_value = df.agg({column: "min"}).collect()[0][0]
return df.withColumn(column, (df[column] - min_value) / (max_value - min_value))


class NormalizeColumnTransform(Transform):
column: str

Expand Down Expand Up @@ -45,7 +47,8 @@ Caching is another technique that can improve performance by storing the result
doesn't have to be recomputed each time it's used. You can use the cache method to cache the result of a transformation.

```python
from koheesio.steps.transformations.cache import CacheTransformation
from koheesio.steps.transformations import CacheTransformation


class MyTask(EtlTask):
transformations = [NormalizeColumnTransform(column="my_column"), CacheTransformation()]
Expand Down
6 changes: 4 additions & 2 deletions docs/tutorials/hello-world.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,14 @@ This example demonstrates how to use the `EtlTask` from the `koheesio` library t
```python
from typing import Any
from pyspark.sql import DataFrame, functions as f
from koheesio.steps.transformations.transform import Transform
from koheesio.steps.transformations import Transform
from koheesio.tasks.etl_task import EtlTask


def add_column(df: DataFrame, target_column: str, value: Any):
return df.withColumn(target_column, f.lit(value))


class MyFavoriteMovieTask(EtlTask):
my_favorite_movie: str

Expand Down Expand Up @@ -102,7 +104,7 @@ source:
```python
from pyspark.sql import SparkSession
from koheesio.context import Context
from koheesio.steps.readers.dummy import DummyReader
from koheesio.steps.readers import DummyReader
from koheesio.steps.writers.dummy import DummyWriter

context = Context.from_yaml("sample.yaml")
Expand Down
12 changes: 8 additions & 4 deletions docs/tutorials/testing-koheesio-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,17 @@ Here's an example of how to unit test a Koheesio task:

```python
from koheesio.tasks.etl_task import EtlTask
from koheesio.steps.readers.dummy import DummyReader
from koheesio.steps.readers import DummyReader
from koheesio.steps.writers.dummy import DummyWriter
from koheesio.steps.transformations.transform import Transform
from koheesio.steps.transformations import Transform
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col


def filter_age(df: DataFrame) -> DataFrame:
return df.filter(col("Age") > 18)


def test_etl_task():
# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()
Expand Down Expand Up @@ -61,15 +63,17 @@ Here's an example of how to write an integration test for this task:
```python
# my_module.py
from koheesio.tasks.etl_task import EtlTask
from koheesio.steps.readers.delta import DeltaReader
from koheesio.spark.readers.delta import DeltaReader
from koheesio.steps.writers.delta import DeltaWriter
from koheesio.steps.transformations.transform import Transform
from koheesio.steps.transformations import Transform
from koheesio.context import Context
from pyspark.sql.functions import col


def filter_age(df):
return df.filter(col("Age") > 18)


context = Context({
"reader_options": {
"table": "input_table"
Expand Down
28 changes: 0 additions & 28 deletions koheesio/steps/__init__.py

This file was deleted.

51 changes: 0 additions & 51 deletions koheesio/steps/transformations/rank_dedup.py

This file was deleted.

20 changes: 0 additions & 20 deletions koheesio/tasks/__init__.py

This file was deleted.

Loading