Market Report Generation Dataset

This repository contains code for constructing and evaluating a market report generation dataset. The pipeline processes financial market data from various sources and pairs it with market reports to create training datasets for language models.

1. Dataset Construction Pipeline

1.1 Manual Data Collection

Due to copyright policies, some data needs to be collected manually:

U.S. Treasury Yields

Download from Wall Street Journal (2018/01/01 - 2023/06/30) to data/raw/wsj/:

Bond Yield	URL	File Name
1-Year	Link	us_1_year_bond_yield.csv
2-Year	Link	us_2_year_bond_yield.csv
3-Year	Link	us_3_year_bond_yield.csv
5-Year	Link	us_5_year_bond_yield.csv
7-Year	Link	us_7_year_bond_yield.csv
10-Year	Link	us_10_year_bond_yield.csv
30-Year	Link	us_30_year_bond_yield.csv

Feeder Cattle Index

Download from LRP Advisors (use "ALL" button and "CSV" export) to data/raw/cme/feeder_cattle_index.csv

1.2 Automated Data Collection and Processing

The pipeline consists of four main scripts that process data sequentially:

Step 1: Download Market Data (`download_data.py`)

# Setup DataBento API key first
export DATABENTO_API_KEY=your_key_here

# Download market data
python download_data.py

Objectives:

Download market data for individual tickers across specified timespan
Process data from multiple sources (Yahoo Finance, CME, etc.)
Combine all market data into a single processed file

Output Structure:

data/
├── table_data/
│   └── raw/
│       └── <market>/
│           └── <ticker>.csv    # Individual ticker data
└── intermediate/
    └── processed_data.csv      # Combined market data

Step 2: Process Table Data (`process_table_data.py`)

python process_table_data.py

Objectives:

Extract historical data for specified time spans
Process and format data for each market report
Organize data by market and data source

Output Structure:

data/
└── table_data/
    └── report_table_data/
        └── <historical_time_span>/
            └── <split>/
                └── <market-report_data_source>/
                    └── <report_date>.csv

Step 3: Construct Dataset (`construct_dataset.py`)

python construct_dataset.py

Objectives:

Combine table data with corresponding market reports
Format prompts for model training
Include relevant metadata

Output Structure:

data/
└── processed_dataset/
    └── <historical_time_span>/
        └── <split>.json        # Contains tables, prompts, reports, and metadata

Step 4: Tokenize Dataset (`tokenize_dataset.py`)

python tokenize_dataset.py

Objectives:

Convert processed dataset into tokenized format
Prepare data for training open-source language models
Support multiple tokenizer options

Output Structure:

data/
└── tokenized_dataset/
    └── <historical_time_span>/
        └── <tokenizer>/
            └── <split>/        # Tokenized data ready for training

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
prompts		prompts
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Market Report Generation Dataset

1. Dataset Construction Pipeline

1.1 Manual Data Collection

U.S. Treasury Yields

Feeder Cattle Index

1.2 Automated Data Collection and Processing

Step 1: Download Market Data (`download_data.py`)

Step 2: Process Table Data (`process_table_data.py`)

Step 3: Construct Dataset (`construct_dataset.py`)

Step 4: Tokenize Dataset (`tokenize_dataset.py`)

About

Releases

Packages

Contributors 2

Languages

yajingyang/DataTales

Folders and files

Latest commit

History

Repository files navigation

Market Report Generation Dataset

1. Dataset Construction Pipeline

1.1 Manual Data Collection

U.S. Treasury Yields

Feeder Cattle Index

1.2 Automated Data Collection and Processing

Step 1: Download Market Data (download_data.py)

Step 2: Process Table Data (process_table_data.py)

Step 3: Construct Dataset (construct_dataset.py)

Step 4: Tokenize Dataset (tokenize_dataset.py)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Step 1: Download Market Data (`download_data.py`)

Step 2: Process Table Data (`process_table_data.py`)

Step 3: Construct Dataset (`construct_dataset.py`)

Step 4: Tokenize Dataset (`tokenize_dataset.py`)

Packages