All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Fixed
Databricks.create_table_from_pandas()
failing to overwrite a table in some cases even withreplace="True"
. - Enabled Databricks Connect in the image. To enable, follow this guide
- Added
Databricks
source to the library. - Added
ExchangeRates
source to the library. - Added
from_df()
method toAzure Data Lake
source. - Added
SAPRFC
source to the library. - Added
S3
source to the library. - Added
RedshiftSpectrum
source to the library. - Added
upload()
anddownload()
methods toS3
source. - Added
Genesys
source to library.
- Added
SQLServerToDF
task - Added
SQLServerToDuckDB
flow which downloads data from SQLServer table, loads it to parquet file and then uplads it do DuckDB - Added complete proxy set up in
SAPRFC
example (viadot/examples/sap_rfc
) - Added Databricks/Spark setup to the image. See README for setup & usage instructions.
- Added rollback feature to
Databricks
source. - Changed all Prefect logging instances in the
sources
directory to native Python logging. - Changed
rm()
,from_df()
,to_df()
methods inS3
Source - Changed
get_request()
tohandle_api_request()
inutils.py
.
- Removed the
env
param fromDatabricks
source, as user can now store multiple configs for the same source using different config keys. - Removed Prefect dependency from the library (Python library, Docker base image)
- Added
adls_file_name
inSupermetricsToADLS
andSharepointToADLS
flows - Added
BigQueryToADLS
flow class which anables extract data from BigQuery. - Added
Salesforce
source - Added
SalesforceUpsert
task - Added
SalesforceBulkUpsert
task - Added C4C secret handling to
CloudForCustomersReportToADLS
flow (c4c_credentials_secret
parameter)
- Fixed
get_flow_last_run_date()
incorrectly parsing the date - Fixed C4C secret handling (tasks now correctly read the secret as the credentials, rather than assuming the secret is a container for credentials for all environments and trying to access specific key inside it). In other words, tasks now assume the secret holds credentials, rather than a dict of the form
{env: credentials, env2: credentials2}
- Fixed
utils.gen_bulk_insert_query_from_df()
failing with > 1000 rows due to INSERT clause limit by chunking the data into multiple INSERTs - Fixed
get_flow_last_run_date()
incorrectly parsing the date - Fixed
MultipleFlows
when one flow is passed and when last flow fails. - Fixed issue with async usage in
Genesys.genesys_generate_exports()
(#669).
- Added
AzureDataLakeRemove
task
- Changed name of task file from
prefect
toprefect_date_range
- Fixed out of range issue in
prefect_date_range
- bumped version
- Added
custom_mail_state_handler
function that sends mail notification using custom smtp server. - Added new function
df_clean_column
that cleans data frame columns from special characters - Added
df_clean_column
util task that removes special characters from a pandas DataFrame - Added
MultipleFlows
flow class which enables running multiple flows in a given order. - Added
GetFlowNewDateRange
task to change date range based on Prefect flows - Added
check_col_order
parameter inADLSToAzureSQL
- Added new source
ASElite
- Added KeyVault support in
CloudForCustomers
tasks - Added
SQLServer
source - Added
DuckDBToDF
task - Added
DuckDBTransform
flow - Added
SQLServerCreateTable
task - Added
credentials
param toBCPTask
- Added
get_sql_dtypes_from_df
andupdate_dict
util tasks - Added
DuckDBToSQLServer
flow - Added
if_exists="append"
option toDuckDB.create_table_from_parquet()
- Added
get_flow_last_run_date
util function - Added
df_to_dataset
task util for writing DataFrames to data lakes usingpyarrow
- Added retries to Cloud for Customers tasks
- Added
chunksize
parameter toC4CToDF
task to allow pulling data in chunks - Added
chunksize
parameter toBCPTask
task to allow more control over the load process - Added support for SQL Server's custom
datetimeoffset
type - Added
AzureSQLToDF
task - Added
AzureDataLakeRemove
task - Added
AzureSQLUpsert
task
- Changed the base class of
AzureSQL
toSQLServer
df_to_parquet()
task now creates directories if needed- Added several more separators to check for automatically in
SAPRFC.to_df()
- Upgraded
duckdb
version to 0.3.2
- Fixed bug with
CheckColumnOrder
task - Fixed OpenSSL config for old SQL Servers still using TLS < 1.2
BCPTask
now correctly handles custom SQL Server port- Fixed
SAPRFC.to_df()
ignoring user-specified separator - Fixed temporary CSV generated by the
DuckDBToSQLServer
flow not being cleaned up - Fixed some mappings in
get_sql_dtypes_from_df()
and optimized performance - Fixed
BCPTask
- the case when the file path contained a space - Fixed credential evaluation logic (
credentials
is now evaluated beforeconfig_key
) - Fixed "$top" and "$skip" values being ignored by
C4CToDF
task if provided in theparams
parameter - Fixed
SQL.to_df()
incorrectly handling queries that begin with whitespace
- Removed
autopick_sep
parameter fromSAPRFC
functions. The separator is now always picked automatically if not provided. - Removed
dtypes_to_json
task to task_utils.py
- fixed an issue with schema info within
CheckColumnOrder
class.
-ADLSToAzureSQL
- added remove_tab
parameter to remove uncessery tab separators from data.
- fixed an issue with return df within
CheckColumnOrder
class.
- new source
SAPRFC
for connecting with SAP using thepyRFC
library (requires pyrfc as well as the SAP NW RFC library that can be downloaded here - new source
DuckDB
for connecting with theDuckDB
database - new task
SAPRFCToDF
for loading data from SAP to a pandas DataFrame - new tasks,
DuckDBQuery
andDuckDBCreateTableFromParquet
, for interacting with DuckDB - new flow
SAPToDuckDB
for moving data from SAP to DuckDB - Added
CheckColumnOrder
task - C4C connection with url and report_url documentation
-
SQLIteInsert
check if DataFrame is empty or object is not a DataFrame - KeyVault support in
SharepointToDF
task - KeyVault support in
CloudForCustomers
tasks
- pinned Prefect version to 0.15.11
df_to_csv
now creates dirs if they don't existADLSToAzureSQL
- when data in csv coulmns has unnecessary "\t" then removes them
- fixed an issue with duckdb calls seeing initial db snapshot instead of the updated state (#282)
- C4C connection with url and report_url optimization
- column mapper in C4C source
- new option to
ADLSToAzureSQL
Flow -if_exists="delete"
SQL
source:create_table()
already handlesif_exists
; now it handles a new option forif_exists()
C4CToDF
andC4CReportToDF
tasks are provided as a class instead of function
- Appending issue within CloudForCustomers source
- An early return bug in
UKCarbonIntensity
into_df
method
- authorization issue within
CloudForCustomers
source
- Added support for file path to
CloudForCustomersReportToADLS
flow - Added
flow_of_flows
list handling - Added support for JSON files in
AzureDataLakeToDF
Supermetrics
source:to_df()
now correctly handlesif_empty
in case of empty results
Sharepoint
andCloudForCustomers
sources will now provide an informativeCredentialError
which is also raised early. This will make issues with input credenials immediately clear to the user.- Removed set_key_value from
CloudForCustomersReportToADLS
flow
- Added
Sharepoint
source - Added
SharepointToDF
task - Added
SharepointToADLS
flow - Added
CloudForCustomers
source - Added
c4c_report_to_df
taks - Added
def c4c_to_df
task - Added
CloudForCustomersReportToADLS
flow - Added
df_to_csv
task to task_utils.py - Added
df_to_parquet
task to task_utils.py - Added
dtypes_to_json
task to task_utils.py
ADLSToAzureSQL
- fixed path to csv issue.SupermetricsToADLS
- fixed local json path issue.
- CI/CD:
dev
image is now only published on push to thedev
branch - Docker:
- updated registry links to use the new
ghcr.io
domain run.sh
now also accepts the-t
option. When run in standard mode, it will only spin up theviadot_jupyter_lab
service. When ran with-t dev
, it will also spin upviadot_testing
andviadot_docs
containers.
- updated registry links to use the new
- ADLSToAzureSQL - fixed path parameter issue.
- Added
SQLiteQuery
task - Added
CloudForCustomers
source - Added
CloudForCustomersToDF
andCloudForCustomersToCSV
tasks - Added
CloudForCustomersToADLS
flow - Added support for parquet in
CloudForCustomersToDF
- Added style guidelines to the
README
- Added local setup and commands to the
README
- Changed CI/CD algorithm
- the
latest
Docker image is now only updated on release and is the same exact image as the latest release - the
dev
image is released only on pushes and PRs to thedev
branch (so dev branch = dev image)
- the
- Modified
ADLSToAzureSQL
- read_sep and write_sep parameters added to the flow.
- Fixed
ADLSToAzureSQL
breaking in"append"
mode if the table didn't exist (#145). - Fixed
ADLSToAzureSQL
breaking in promotion path for csv files.
- Added flows library docs to the references page
- Moved task library docs page to topbar
- Updated docs for task and flows
- Added
start
andend_date
parameters toSupermetricsToADLS
flow - Added a tutorial on how to pull data from
Supermetrics
- Added documentation (both docstrings and MKDocs docs) for multiple tasks
- Added
start_date
andend_date
parameters to theSupermetricsToAzureSQL
flow - Added a temporary workaround
df_to_csv_task
task to theSupermetricsToADLS
flow to handle mixed dtype columns not handled automatically by DataFrame'sto_parquet()
method
- Modified
RunGreatExpectationsValidation
task to use the built in support for evaluation parameters added in Prefect v0.15.3 - Modified
SupermetricsToADLS
andADLSGen1ToAzureSQLNew
flows to align with this recipe for reading the expectation suite JSON The suite now has to be loaded before flow initialization in the flow's python file and passed as an argument to the flow's constructor. - Modified
RunGreatExpectationsValidation
'sexpectations_path
parameter to point to the directory containing the expectation suites instead of the Great Expectations project directory, which was confusing. The project directory is now only used internally and not exposed to the user - Changed the logging of docs URL for
RunGreatExpectationsValidation
task to use GE's recipe from the docs
- Added a test for
SupermetricsToADLS
flow -Added a test forAzureDataLakeList
task - Added PR template for new PRs
- Added a
write_to_json
util task to theSupermetricsToADLS
flow. This task dumps the input expectations dict to the local filesystem as is required by Great Expectations. This allows the user to simply pass a dict with their expectations and not worry about the project structure required by Great Expectations - Added
Shapely
andimagehash
dependencies required for fullvisions
functionality (installingvisions[all]
breaks the build) - Added more parameters to control CSV parsing in the
ADLSGen1ToAzureSQLNew
flow - Added
keep_output
parameter to theRunGreatExpectationsValidation
task to control Great Expectations output to the filesystem - Added
keep_validation_output
parameter andcleanup_validation_clutter
task to theSupermetricsToADLS
flow to control Great Expectations output to the filesystem
- Removed
SupermetricsToAzureSQLv2
andSupermetricsToAzureSQLv3
flows - Removed
geopy
dependency
- Added support for parquet in
AzureDataLakeToDF
- Added proper logging to the
RunGreatExpectationsValidation
task - Added the
viz
Prefect extra to requirements to allow flow visualizaion - Added a few utility tasks in
task_utils
- Added
geopy
dependency - Tasks:
AzureDataLakeList
- for listing files in an ADLS directory
- Flows:
ADLSToAzureSQL
- promoting files to conformed, operations, creating an SQL table and inserting the data into itADLSContainerToContainer
- copying files between ADLS containers
- Renamed
ReadAzureKeyVaultSecret
andRunAzureSQLDBQuery
tasks to match Prefect naming style - Flows:
SupermetricsToADLS
- changed csv to parquet file extension. File and schema info are loaded to theRAW
container.
- Removed the broken version autobump from CI
- Flows:
SupermetricsToADLS
- supporting immutable ADLS setup
- A default value for the
ds_user
parameter inSupermetricsToAzureSQLv3
can now be specified in theSUPERMETRICS_DEFAULT_USER
secret - Updated multiple dependencies
- Fixed "Local run of
SupermetricsToAzureSQLv3
skips all tasks afterunion_dfs_task
" (#59) - Fixed the
release
GitHub action
-
Sources:
AzureDataLake
(supports gen1 & gen2)SQLite
-
Tasks:
DownloadGitHubFile
AzureDataLakeDownload
AzureDataLakeUpload
AzureDataLakeToDF
ReadAzureKeyVaultSecret
CreateAzureKeyVaultSecret
DeleteAzureKeyVaultSecret
SQLiteInsert
SQLiteSQLtoDF
AzureSQLCreateTable
RunAzureSQLDBQuery
BCPTask
RunGreatExpectationsValidation
SupermetricsToDF
-
Flows:
SupermetricsToAzureSQLv1
SupermetricsToAzureSQLv2
SupermetricsToAzureSQLv3
AzureSQLTransform
Pipeline
ADLSGen1ToGen2
ADLSGen1ToAzureSQL
ADLSGen1ToAzureSQLNew
-
Examples:
- Hello world flow
- Supermetrics Google Ads extract
- Tasks now use secrets for credential management (azure tasks use Azure Key Vault secrets)
- SQL source now has a default query timeout of 1 hour
- Fix
SQLite
tests - Multiple stability improvements with retries and timeouts
- Moved from poetry to pip
- Fix
AzureBlobStorage
'sto_storage()
method is missing the final upload blob part