BsyncGcs (Billion Sync GCS)

BsyncGcs is a tool designed to synchronize a vast number of small files from Google Cloud Storage (GCS) to Amazon S3. By leveraging a distributed architecture and the power of AWS SQS queues, BsyncGcs efficiently handles the migration of billions of files, making it ideal for large-scale data transfer tasks.

How It Works

BsyncGcs operates in two modes:

Populator Mode: Reads a CSV file containing a list of file names stored in GCS and populates an AWS SQS queue with messages representing each file to be transferred.
Worker Mode: Consumes messages from the SQS queue and transfers the corresponding files from GCS to S3.

This separation allows for scalable and efficient processing, enabling multiple workers to run in parallel and handle large volumes of data.

Prerequisites

Go: Version 1.16 or higher.
AWS Account: With access to S3 and SQS services.
Google Cloud Account: With access to the GCS bucket containing the files.
AWS Credentials: Configured via AWS CLI or environment variables.
Google Cloud Credentials: If required, for accessing non-public GCS buckets.

Installation

Clone the Repository

git clone https://github.com/grep-juub/bsyngcs.git
cd bsyngcs

Install Dependencies
```
go mod download
```

Configuration

BsyncGcs uses environment variables for configuration. These variables can be set directly in your shell or via a .env file.

Common Environment Variables

MIGRATION_BUCKET_AWS_REGION: AWS region where your resources are located.
MIGRATION_BUCKET_SQS_QUEUE_URL: URL of the SQS queue.
MIGRATION_BUCKET_S3_BUCKET_NAME: Name of the S3 bucket.

Populator-Specific Variables

MIGRATION_BUCKET_MODE: Set to populator.
MIGRATION_BUCKET_OBJECT_NAME: S3 object key of the compressed CSV file containing the list of files.
MIGRATION_BUCKET_WORKER_QUEUE_LENGTH: (Optional) Size of the worker queue.
MIGRATION_BUCKET_MESSAGE_LIMIT: (Optional) Limit the number of messages to process.

Worker-Specific Variables

MIGRATION_BUCKET_MODE: Set to worker.
MIGRATION_BUCKET_GCS_BUCKET_NAME: Name of the GCS bucket containing the files.
MIGRATION_BUCKET_SQS_BATCH_SIZE: Number of SQS messages to process per batch.
MIGRATION_BUCKET_WORKER_QUEUE_LENGTH: (Optional) Size of the worker queue.

Usage

Running the Populator

The populator reads a CSV file from S3, which contains a list of file names, and populates an SQS queue with messages for each file.

Set Environment Variables

Configure the environment variables as per your setup. See the Example Scripts section for guidance.
Run the Populator
```
go run main.go
```

Running the Worker

The worker consumes messages from the SQS queue and transfers the specified files from GCS to S3.

Set Environment Variables

Configure the environment variables as per your setup.
Run the Worker
```
go run main.go
```

Example Scripts

Below are example scripts to run BsyncGcs in both populator and worker modes.

Populator Mode Script

Create a file named run_populator.sh:

#!/bin/bash

# Set environment variables for Populator mode
export MIGRATION_BUCKET_AWS_REGION="us-west-2"
export MIGRATION_BUCKET_SQS_QUEUE_URL="https://sqs.us-west-2.amazonaws.com/123456789012/your-queue"
export MIGRATION_BUCKET_S3_BUCKET_NAME="your-s3-bucket-name"
export MIGRATION_BUCKET_OBJECT_NAME="path/to/yourfile.csv.gz"
export MIGRATION_BUCKET_WORKER_QUEUE_LENGTH="100"
export MIGRATION_BUCKET_MESSAGE_LIMIT="0"  # Set to 0 for no limit
export MIGRATION_BUCKET_MODE="populator"

# Run the application
go run main.go

Worker Mode Script

Create a file named run_worker.sh:

#!/bin/bash

# Set environment variables for Worker mode
export MIGRATION_BUCKET_AWS_REGION="us-west-2"
export MIGRATION_BUCKET_SQS_QUEUE_URL="https://sqs.us-west-2.amazonaws.com/123456789012/your-queue"
export MIGRATION_BUCKET_S3_BUCKET_NAME="your-s3-bucket-name"
export MIGRATION_BUCKET_GCS_BUCKET_NAME="your-gcs-bucket-name"
export MIGRATION_BUCKET_SQS_BATCH_SIZE="10"
export MIGRATION_BUCKET_WORKER_QUEUE_LENGTH="100"
export MIGRATION_BUCKET_MODE="worker"

# Run the application
go run main.go

Make the Scripts Executable:

chmod +x run_populator.sh
chmod +x run_worker.sh

Context and Motivation

In scenarios where a massive amount of small files need to be migrated between cloud storage providers, traditional tools and methods can be inefficient and time-consuming. This was the case when we needed to transfer approximately 300 TB of data consisting of nearly 1 billion files from GCS to S3.

Due to the small size of each file (~200 KB), the overhead of transferring files individually became a significant bottleneck. To overcome this, we designed BsyncGcs to:

Parallelize Processing: Utilize multiple workers running concurrently to handle file transfers.
Efficient Queue Management: Use AWS SQS to distribute tasks effectively among workers.
Scalable Architecture: Deploy the solution on an AWS EKS cluster, allowing for horizontal scaling based on workload.

By implementing BsyncGcs, we achieved a high-throughput migration process, significantly reducing the total time required for the data transfer.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
awsutils		awsutils
config		config
utils		utils
.gitignore		.gitignore
README.md		README.md
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BsyncGcs (Billion Sync GCS)

Table of Contents

How It Works

Prerequisites

Installation

Configuration

Common Environment Variables

Populator-Specific Variables

Worker-Specific Variables

Usage

Running the Populator

Running the Worker

Example Scripts

Populator Mode Script

Worker Mode Script

Context and Motivation

About

Releases

Packages

Languages

Grep-Juub/bsyncgcs

Folders and files

Latest commit

History

Repository files navigation

BsyncGcs (Billion Sync GCS)

Table of Contents

How It Works

Prerequisites

Installation

Configuration

Common Environment Variables

Populator-Specific Variables

Worker-Specific Variables

Usage

Running the Populator

Running the Worker

Example Scripts

Populator Mode Script

Worker Mode Script

Context and Motivation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages