Skip to content

olbrichattila/gocsvimporter

Repository files navigation

Go CSV Importer: Multi-threaded Fast CSV to Database Tool

Overview

Go CSV Importer is a powerful, multi-threaded command-line tool designed to import large CSV files into databases quickly and efficiently. It supports various database types, offers customizable import modes, and is optimized for performance.

Key Features

  • Support for Large CSV Files: Handles millions of rows with ease.
  • Multi-Database Compatibility: Works with SQLite, MySQL, PostgreSQL, and Firebird.
  • Configurable Import Modes:
    • Transactional or non-transactional imports.
    • Batch SQL inserts or row-by-row operations.
    • Single or multiple database connections.

Installation

Install the latest version directly from Go:

go install github.com/olbrichattila/gocsvimporter/cmd/csvimporter@latest

How It Works

  1. Analyze the CSV: Determines data types and structures.
  2. Prepare the Database: Automatically creates the necessary table, dropping it if it already exists.
  3. Import Data: Optimizes the process based on database type and chosen parameters.

Usage

Run the tool with:

csvimporter <csv_file> <table_name> [delimiter]

Parameters:

  1. CSV File: Path to the CSV file.
  2. Table Name: Target database table name.
  3. Delimiter (Optional): CSV delimiter (default: ,).

Example:

csvimporter data.csv vehicles ";"

Field Header Management for Faster Imports

The CSV importer command line tool now includes a powerful feature for handling CSV headers efficiently.

Feature Overview

When importing CSV files, the tool automatically:

  1. Analyzes the CSV content: Deduplicates field names, determines proper field types, and calculates field sizes based on the data.
  2. Optimizes imports: Speeds up processing for repeated file formats by allowing you to save and reuse header definitions.

This analysis can take time, especially with large files. If you're importing the same file format multiple times, you can now save the calculated headers for reuse, bypassing recalculation for subsequent imports.

How It Works

Save Field Definitions:

Use the -sh= parameter to save the field definitions in a JSON file. You can optionally modify this JSON file to fine-tune your header configurations.

csvimporter <csv_file> <table_name> [delimiter] -sh=customerFieldDefinition

This will generate a JSON file with the field definitions.

Load Field Definitions:

Use the -lh= parameter to load the pre-saved field definitions, skipping the recalculation step.

csvimporter <csv_file> <table_name> [delimiter] -lh=customerFieldDefinition

When this parameter is provided, the import process starts immediately, but progress percentage updates will not be displayed.

Example JSON Structure

Here’s an example of the auto-saved JSON file generated by the tool:

[
 {
  "name": "index",
  "Type": "INT",
  "Length": 7
 },
 {
  "name": "customer_id",
  "Type": "VARCHAR",
  "Length": 15
 },
 {
  "name": "first_name",
  "Type": "VARCHAR",
  "Length": 11
 },
 {
  "name": "last_name",
  "Type": "VARCHAR",
  "Length": 11
 },
 {
  "name": "company",
  "Type": "VARCHAR",
  "Length": 38
 },
 {
  "name": "city",
  "Type": "VARCHAR",
  "Length": 24
 },
 {
  "name": "country",
  "Type": "VARCHAR",
  "Length": 51
 },
 {
  "name": "phone_a1",
  "Type": "VARCHAR",
  "Length": 22
 },
 {
  "name": "phone_a2",
  "Type": "VARCHAR",
  "Length": 22
 },
 {
  "name": "email",
  "Type": "VARCHAR",
  "Length": 44
 },
 {
  "name": "subscription_date",
  "Type": "VARCHAR",
  "Length": 10
 },
 {
  "name": "website",
  "Type": "VARCHAR",
  "Length": 40
 }
]

Benefits

  • Time-Saving: Skip repetitive calculations for consistent file formats.
  • Customizable: Modify the saved JSON to align with specific import requirements.
  • Improved Efficiency: Ideal for workflows requiring frequent CSV imports.

Notes

  • The -lh parameter disables progress percentage display.
  • Ensure the JSON file accurately reflects your data structure for seamless imports.

By leveraging this feature, the CSV importer command line tool becomes even more efficient for managing recurring import tasks and large datasets.

Supported Databases

SQLite

DB_CONNECTION=sqlite
DB_DATABASE=./database/database.sqlite

MySQL

DB_CONNECTION=mysql
DB_HOST=127.0.0.1
DB_PORT=3306
DB_DATABASE=mydb
DB_USERNAME=myuser
DB_PASSWORD=mypassword

PostgreSQL

DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_SSLMODE=disable

Firebird

DB_CONNECTION=firebird
DB_HOST=127.0.0.1
DB_PORT=3050
DB_DATABASE=/path/to/database.fdb
DB_USERNAME=SYSDBA
DB_PASSWORD=masterkey

Configuration

Create a .env.csvimporter file in the application's directory or export environment variables:

BATCH_SIZE=500               # Rows per batch (default: 100)
MAX_CONNECTION_COUNT=25      # Maximum connections (default: 10)
BATCH_INSERT=on              # Enable/disable batch insert
MULTIPLE_CONNECTIONS=on      # Enable/disable multi-threading
TRANSACTIONAL=off            # Enable/disable transactions

Note: Unsupported options for certain databases (e.g., SQLite) are ignored.


Performance

Speed Test: 2 Million Rows

System Configuration:

  • OS: Ubuntu Linux
  • Processor: Intel® Core™ i7-3770S @ 3.10GHz
  • Storage: SSD
Database Duration Mode Threads
SQLite 52 seconds Transactional, Batch SQL 1
MySQL 65 seconds Transactional, Multi-Threaded 10
PostgreSQL 43 seconds Transactional, Multi-Threaded 10
Firebird 5m 42s Transactional, Multi-Threaded 10

Makefile Targets

Test Imports

make vehicles
make customers

Switch Environments

make switch-sqlite
make switch-mysql
make switch-pgsql
make switch-firebird

Local Testing with Docker

A docker-compose setup is provided for testing:

cd docker
docker-compose up -d

Roadmap

Planned Improvements

  • Distributed Import: Split CSV files across multiple instances (pods/servers) for faster parallel imports.
  • Enhanced Configuration: Support more advanced database-specific settings.

Start importing your CSV files faster and more efficiently today!

About me:

About

Fast and large file optimised CSV importer

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages