Optionally maintain checksums of CSV files for faster updates #85

dkaoster · 2022-01-21T08:19:04Z

Wanted to see if there is interest in a patch that helps speed up our workflows significantly, or if there are any further ideas for improving on such a feature. If this is out of scope for this project, I'm happy to continue maintaining my fork of this project.

Use Case

We currently maintain a folder of >200 CSV files with a total of a few hundred megabytes, and have a CI step that builds these CSVs into a sqlite database. These CSV files get updated 2-3 times a day, but only small changes are made to them. Currently, running csvs-to-sqlite with the --replace-tables flag takes roughly 6-7 minutes, which is too long for our use case.

Solution

Add a --update-tables flag that maintains a checksum hash of each CSV file in a table called .csvs-meta (happy to change this or make it configurable), and only reads the csv and loads the dataframe if the checksum has changed.

Forked Version Here

…onal updates

Add logic for maintaining a separate table with checksums and conditi…

ff166f0

…onal updates

This was referenced Sep 13, 2023

Add --skip-existing-tables command line flag #50

Closed

Add command line flag to skip existing table #75

Closed

putbullet mentioned this pull request Apr 16, 2025

Optimize utils.py for Better Performance #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optionally maintain checksums of CSV files for faster updates #85

Optionally maintain checksums of CSV files for faster updates #85

Uh oh!

dkaoster commented Jan 21, 2022

Uh oh!

Uh oh!

Uh oh!

Optionally maintain checksums of CSV files for faster updates #85

Are you sure you want to change the base?

Optionally maintain checksums of CSV files for faster updates #85

Uh oh!

Conversation

dkaoster commented Jan 21, 2022

Use Case

Solution

Uh oh!

Uh oh!