Skip to content

Latest commit

 

History

History
383 lines (303 loc) · 15.3 KB

CHANGELOG.md

File metadata and controls

383 lines (303 loc) · 15.3 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Added

  • Fixed Databricks.create_table_from_pandas() failing to overwrite a table in some cases even with replace="True".
  • Enabled Databricks Connect in the image. To enable, follow this guide
  • Added Databricks source to the library.
  • Added ExchangeRates source to the library.
  • Added from_df() method to Azure Data Lake source.
  • Added SAPRFC source to the library.
  • Added S3 source to the library.
  • Added RedshiftSpectrum source to the library.
  • Added upload() and download() methods to S3 source.
  • Added Genesys source to library.

Changed

  • Added SQLServerToDF task
  • Added SQLServerToDuckDB flow which downloads data from SQLServer table, loads it to parquet file and then uplads it do DuckDB
  • Added complete proxy set up in SAPRFC example (viadot/examples/sap_rfc)
  • Added Databricks/Spark setup to the image. See README for setup & usage instructions.
  • Added rollback feature to Databricks source.
  • Changed all Prefect logging instances in the sources directory to native Python logging.
  • Changed rm(), from_df(), to_df() methods in S3 Source
  • Changed get_request() to handle_api_request() in utils.py.

Removed

  • Removed the env param from Databricks source, as user can now store multiple configs for the same source using different config keys.
  • Removed Prefect dependency from the library (Python library, Docker base image)

[0.4.3] - 2022-04-28

Added

  • Added adls_file_name in SupermetricsToADLS and SharepointToADLS flows
  • Added BigQueryToADLS flow class which anables extract data from BigQuery.
  • Added Salesforce source
  • Added SalesforceUpsert task
  • Added SalesforceBulkUpsert task
  • Added C4C secret handling to CloudForCustomersReportToADLS flow (c4c_credentials_secret parameter)

Fixed

  • Fixed get_flow_last_run_date() incorrectly parsing the date
  • Fixed C4C secret handling (tasks now correctly read the secret as the credentials, rather than assuming the secret is a container for credentials for all environments and trying to access specific key inside it). In other words, tasks now assume the secret holds credentials, rather than a dict of the form {env: credentials, env2: credentials2}
  • Fixed utils.gen_bulk_insert_query_from_df() failing with > 1000 rows due to INSERT clause limit by chunking the data into multiple INSERTs
  • Fixed get_flow_last_run_date() incorrectly parsing the date
  • Fixed MultipleFlows when one flow is passed and when last flow fails.
  • Fixed issue with async usage in Genesys.genesys_generate_exports() (#669).

[0.4.2] - 2022-04-08

Added

  • Added AzureDataLakeRemove task

Changed

  • Changed name of task file from prefect to prefect_date_range

Fixed

  • Fixed out of range issue in prefect_date_range

[0.4.1] - 2022-04-07

Changed

  • bumped version

[0.4.0] - 2022-04-07

Added

  • Added custom_mail_state_handler function that sends mail notification using custom smtp server.
  • Added new function df_clean_column that cleans data frame columns from special characters
  • Added df_clean_column util task that removes special characters from a pandas DataFrame
  • Added MultipleFlows flow class which enables running multiple flows in a given order.
  • Added GetFlowNewDateRange task to change date range based on Prefect flows
  • Added check_col_order parameter in ADLSToAzureSQL
  • Added new source ASElite
  • Added KeyVault support in CloudForCustomers tasks
  • Added SQLServer source
  • Added DuckDBToDF task
  • Added DuckDBTransform flow
  • Added SQLServerCreateTable task
  • Added credentials param to BCPTask
  • Added get_sql_dtypes_from_df and update_dict util tasks
  • Added DuckDBToSQLServer flow
  • Added if_exists="append" option to DuckDB.create_table_from_parquet()
  • Added get_flow_last_run_date util function
  • Added df_to_dataset task util for writing DataFrames to data lakes using pyarrow
  • Added retries to Cloud for Customers tasks
  • Added chunksize parameter to C4CToDF task to allow pulling data in chunks
  • Added chunksize parameter to BCPTask task to allow more control over the load process
  • Added support for SQL Server's custom datetimeoffset type
  • Added AzureSQLToDF task
  • Added AzureDataLakeRemove task
  • Added AzureSQLUpsert task

Changed

  • Changed the base class of AzureSQL to SQLServer
  • df_to_parquet() task now creates directories if needed
  • Added several more separators to check for automatically in SAPRFC.to_df()
  • Upgraded duckdb version to 0.3.2

Fixed

  • Fixed bug with CheckColumnOrder task
  • Fixed OpenSSL config for old SQL Servers still using TLS < 1.2
  • BCPTask now correctly handles custom SQL Server port
  • Fixed SAPRFC.to_df() ignoring user-specified separator
  • Fixed temporary CSV generated by the DuckDBToSQLServer flow not being cleaned up
  • Fixed some mappings in get_sql_dtypes_from_df() and optimized performance
  • Fixed BCPTask - the case when the file path contained a space
  • Fixed credential evaluation logic (credentials is now evaluated before config_key)
  • Fixed "$top" and "$skip" values being ignored by C4CToDF task if provided in the params parameter
  • Fixed SQL.to_df() incorrectly handling queries that begin with whitespace

Removed

  • Removed autopick_sep parameter from SAPRFC functions. The separator is now always picked automatically if not provided.
  • Removed dtypes_to_json task to task_utils.py

[0.3.2] - 2022-02-17

Fixed

  • fixed an issue with schema info within CheckColumnOrder class.

[0.3.1] - 2022-02-17

Changed

-ADLSToAzureSQL - added remove_tab parameter to remove uncessery tab separators from data.

Fixed

  • fixed an issue with return df within CheckColumnOrder class.

[0.3.0] - 2022-02-16

Added

  • new source SAPRFC for connecting with SAP using the pyRFC library (requires pyrfc as well as the SAP NW RFC library that can be downloaded here
  • new source DuckDB for connecting with the DuckDB database
  • new task SAPRFCToDF for loading data from SAP to a pandas DataFrame
  • new tasks, DuckDBQuery and DuckDBCreateTableFromParquet, for interacting with DuckDB
  • new flow SAPToDuckDB for moving data from SAP to DuckDB
  • Added CheckColumnOrder task
  • C4C connection with url and report_url documentation -SQLIteInsert check if DataFrame is empty or object is not a DataFrame
  • KeyVault support in SharepointToDF task
  • KeyVault support in CloudForCustomers tasks

Changed

  • pinned Prefect version to 0.15.11
  • df_to_csv now creates dirs if they don't exist
  • ADLSToAzureSQL - when data in csv coulmns has unnecessary "\t" then removes them

Fixed

  • fixed an issue with duckdb calls seeing initial db snapshot instead of the updated state (#282)
  • C4C connection with url and report_url optimization
  • column mapper in C4C source

[0.2.15] - 2022-01-12

Added

  • new option to ADLSToAzureSQL Flow - if_exists="delete"
  • SQL source: create_table() already handles if_exists; now it handles a new option for if_exists()
  • C4CToDF and C4CReportToDF tasks are provided as a class instead of function

Fixed

  • Appending issue within CloudForCustomers source
  • An early return bug in UKCarbonIntensity in to_df method

[0.2.14] - 2021-12-01

Fixed

  • authorization issue within CloudForCustomers source

[0.2.13] - 2021-11-30

Added

  • Added support for file path to CloudForCustomersReportToADLS flow
  • Added flow_of_flows list handling
  • Added support for JSON files in AzureDataLakeToDF

Fixed

  • Supermetrics source: to_df() now correctly handles if_empty in case of empty results

Changed

  • Sharepoint and CloudForCustomers sources will now provide an informative CredentialError which is also raised early. This will make issues with input credenials immediately clear to the user.
  • Removed set_key_value from CloudForCustomersReportToADLS flow

[0.2.12] - 2021-11-25

Added

  • Added Sharepoint source
  • Added SharepointToDF task
  • Added SharepointToADLS flow
  • Added CloudForCustomers source
  • Added c4c_report_to_df taks
  • Added def c4c_to_df task
  • Added CloudForCustomersReportToADLS flow
  • Added df_to_csv task to task_utils.py
  • Added df_to_parquet task to task_utils.py
  • Added dtypes_to_json task to task_utils.py

[0.2.11] - 2021-10-30

Fixed

  • ADLSToAzureSQL - fixed path to csv issue.
  • SupermetricsToADLS - fixed local json path issue.

[0.2.10] - 2021-10-29

Release due to CI/CD error

[0.2.9] - 2021-10-29

Release due to CI/CD error

[0.2.8] - 2021-10-29

Changed

  • CI/CD: dev image is now only published on push to the dev branch
  • Docker:
    • updated registry links to use the new ghcr.io domain
    • run.sh now also accepts the -t option. When run in standard mode, it will only spin up the viadot_jupyter_lab service. When ran with -t dev, it will also spin up viadot_testing and viadot_docs containers.

Fixed

  • ADLSToAzureSQL - fixed path parameter issue.

[0.2.7] - 2021-10-04

Added

  • Added SQLiteQuery task
  • Added CloudForCustomers source
  • Added CloudForCustomersToDF and CloudForCustomersToCSV tasks
  • Added CloudForCustomersToADLS flow
  • Added support for parquet in CloudForCustomersToDF
  • Added style guidelines to the README
  • Added local setup and commands to the README

Changed

  • Changed CI/CD algorithm
    • the latest Docker image is now only updated on release and is the same exact image as the latest release
    • the dev image is released only on pushes and PRs to the dev branch (so dev branch = dev image)
  • Modified ADLSToAzureSQL - read_sep and write_sep parameters added to the flow.

Fixed

  • Fixed ADLSToAzureSQL breaking in "append" mode if the table didn't exist (#145).
  • Fixed ADLSToAzureSQL breaking in promotion path for csv files.

[0.2.6] - 2021-09-22

Added

  • Added flows library docs to the references page

Changed

  • Moved task library docs page to topbar
  • Updated docs for task and flows

[0.2.5] - 2021-09-20

Added

  • Added start and end_date parameters to SupermetricsToADLS flow
  • Added a tutorial on how to pull data from Supermetrics

[0.2.4] - 2021-09-06

Added

  • Added documentation (both docstrings and MKDocs docs) for multiple tasks
  • Added start_date and end_date parameters to the SupermetricsToAzureSQL flow
  • Added a temporary workaround df_to_csv_task task to the SupermetricsToADLS flow to handle mixed dtype columns not handled automatically by DataFrame's to_parquet() method

[0.2.3] - 2021-08-19

Changed

  • Modified RunGreatExpectationsValidation task to use the built in support for evaluation parameters added in Prefect v0.15.3
  • Modified SupermetricsToADLS and ADLSGen1ToAzureSQLNew flows to align with this recipe for reading the expectation suite JSON The suite now has to be loaded before flow initialization in the flow's python file and passed as an argument to the flow's constructor.
  • Modified RunGreatExpectationsValidation's expectations_path parameter to point to the directory containing the expectation suites instead of the Great Expectations project directory, which was confusing. The project directory is now only used internally and not exposed to the user
  • Changed the logging of docs URL for RunGreatExpectationsValidation task to use GE's recipe from the docs

Added

  • Added a test for SupermetricsToADLS flow -Added a test for AzureDataLakeList task
  • Added PR template for new PRs
  • Added a write_to_json util task to the SupermetricsToADLS flow. This task dumps the input expectations dict to the local filesystem as is required by Great Expectations. This allows the user to simply pass a dict with their expectations and not worry about the project structure required by Great Expectations
  • Added Shapely and imagehash dependencies required for full visions functionality (installing visions[all] breaks the build)
  • Added more parameters to control CSV parsing in the ADLSGen1ToAzureSQLNew flow
  • Added keep_output parameter to the RunGreatExpectationsValidation task to control Great Expectations output to the filesystem
  • Added keep_validation_output parameter and cleanup_validation_clutter task to the SupermetricsToADLS flow to control Great Expectations output to the filesystem

Removed

  • Removed SupermetricsToAzureSQLv2 and SupermetricsToAzureSQLv3 flows
  • Removed geopy dependency

[0.2.2] - 2021-07-27

Added

  • Added support for parquet in AzureDataLakeToDF
  • Added proper logging to the RunGreatExpectationsValidation task
  • Added the viz Prefect extra to requirements to allow flow visualizaion
  • Added a few utility tasks in task_utils
  • Added geopy dependency
  • Tasks:
    • AzureDataLakeList - for listing files in an ADLS directory
  • Flows:
    • ADLSToAzureSQL - promoting files to conformed, operations, creating an SQL table and inserting the data into it
    • ADLSContainerToContainer - copying files between ADLS containers

Changed

  • Renamed ReadAzureKeyVaultSecret and RunAzureSQLDBQuery tasks to match Prefect naming style
  • Flows:
    • SupermetricsToADLS - changed csv to parquet file extension. File and schema info are loaded to the RAW container.

Fixed

  • Removed the broken version autobump from CI

[0.2.1] - 2021-07-14

Added

  • Flows:
    • SupermetricsToADLS - supporting immutable ADLS setup

Changed

  • A default value for the ds_user parameter in SupermetricsToAzureSQLv3 can now be specified in the SUPERMETRICS_DEFAULT_USER secret
  • Updated multiple dependencies

Fixed

  • Fixed "Local run of SupermetricsToAzureSQLv3 skips all tasks after union_dfs_task" (#59)
  • Fixed the release GitHub action

[0.2.0] - 2021-07-12

Added

  • Sources:

    • AzureDataLake (supports gen1 & gen2)
    • SQLite
  • Tasks:

    • DownloadGitHubFile
    • AzureDataLakeDownload
    • AzureDataLakeUpload
    • AzureDataLakeToDF
    • ReadAzureKeyVaultSecret
    • CreateAzureKeyVaultSecret
    • DeleteAzureKeyVaultSecret
    • SQLiteInsert
    • SQLiteSQLtoDF
    • AzureSQLCreateTable
    • RunAzureSQLDBQuery
    • BCPTask
    • RunGreatExpectationsValidation
    • SupermetricsToDF
  • Flows:

    • SupermetricsToAzureSQLv1
    • SupermetricsToAzureSQLv2
    • SupermetricsToAzureSQLv3
    • AzureSQLTransform
    • Pipeline
    • ADLSGen1ToGen2
    • ADLSGen1ToAzureSQL
    • ADLSGen1ToAzureSQLNew
  • Examples:

    • Hello world flow
    • Supermetrics Google Ads extract

Changed

  • Tasks now use secrets for credential management (azure tasks use Azure Key Vault secrets)
  • SQL source now has a default query timeout of 1 hour

Fixed

  • Fix SQLite tests
  • Multiple stability improvements with retries and timeouts

[0.1.12] - 2021-05-08

Changed

  • Moved from poetry to pip

Fixed

  • Fix AzureBlobStorage's to_storage() method is missing the final upload blob part