2012-2016 Russian enterprises financial reports

The code allows to collect corporate data from Rosstat statistics office web site and store full or sliced CSV files for further analysis in R/pandas/Eviews.

Easy path: novice users can download smaller subsets of Rosstat data as csv/xlsx files (fewer variables, less companies, size 3-5Mb to 10-20Mb per year)

Hard way: a more experienced user can reproduce a clean version of full Rosstat dataset on a local computer (300Mb-1.3Gb per year)

Latest data:

http://www.gks.ru/opendata/dataset/7708234640-bdboo2016

Variable descriptions:

https://github.com/epogrebnyak/data-rosstat-boo-2013

Source data

For every year in 2012-2015 we have a file with column names and archived CSV with data. Column names are the same for all 4 years.
Each data file 1-2 Gb when unpacked, >250 columns, 1 to 2 mln rows.

Source dataset is a bit dirty: -- a small part of rows uses different monetary units (rub and mln run instead of thousand rub). this is main data transformation issue -- several rows are corrupted in source files (see "Known bugs" below)

Usage

Use code below to obtain 2012 dataset. Supported years are 2012-2015 but older files are smaller, try running 2012 or 2013 before 2015.

from remote import RawDataset
from rows import Dataset

year = 2012
RawDatatset(year).download().unpack()
Dataset(year).to_csv()
df = Dataset(year).read_df()

Note: you will be operating with large datasets, creating files may take 2-3 mins on a fast computer and much longer on laptops and older machines. Consider downloading smaller datasets [here], if this code hangs on your machine.

Download and unrar raw csv

Download rar file
Unpack raw csv from rar file

Make local csv file

Purge broken lines from raw csv (company has no INN field, wrong number of columns)
Transform data:
- adjust numeric values to '000 rub
- produce file with fewer columns (controlled by columns.RENAMER)
- add new text columns (okved levels, title, year, region by inn)
Keep INN and region codes as strings
Add headers, datacolumns as in Columns().RENAMER
Save as local CSV file

Read local csv file as pandas dataframe

Read dataframe using pd.read_csv with dtypes (it loads file faster)

Subsets: parts dataset

Dataframe like df=Dataset(year).read_df() still very big, a lot of noise and slow to explore
Subsets allow creating row slices of dataset, column names stay the same acr

from reader import Subset
Subset(2015, 'test1').to_csv()

Known issues

1. Key field INN must be 10 digits, but sometimes starts with 0, trying to keep it as string, not int. Alternatively, push all to INNs to int. In practice when doing df.merge(on='inn') I loose some matches, probably due to typing of inns.

2. Reading source csv file:

one line with elements exceeding number of columns
several lines without INN field
CSV may have last empty row

3. Full-length datasets are out of memory in pandas on many computers.

4. Latest revisions of dataset wrongly mix units, there are fake large companies.

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
bin		bin
data/temp/test		data/temp/test
doc		doc
not_in_use		not_in_use
.gitignore		.gitignore
DEV.md		DEV.md
README.md		README.md
chunks.py		chunks.py
common.py		common.py
config.py		config.py
folders.py		folders.py
inspect_columns.py		inspect_columns.py
make_largest.py		make_largest.py
make_subset.py		make_subset.py
reader.py		reader.py
remote.py		remote.py
row_parser.py		row_parser.py
run.py		run.py
sample0.py		sample0.py
sample1.py		sample1.py
sample2.py		sample2.py
sample4.py		sample4.py
slicer.py		slicer.py
subset.py		subset.py
test_folders.py		test_folders.py
test_reader.py		test_reader.py
test_remote.py		test_remote.py
test_row_parser.py		test_row_parser.py
test_subset.py		test_subset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2012-2016 Russian enterprises financial reports

Source data

Usage

Download and unrar raw csv

Make local csv file

Read local csv file as pandas dataframe

Subsets: parts dataset

Known issues

About

Releases

Packages

Languages

ru-corporate/data-rosstat-boo-2013

Folders and files

Latest commit

History

Repository files navigation

2012-2016 Russian enterprises financial reports

Source data

Usage

Download and unrar raw csv

Make local csv file

Read local csv file as pandas dataframe

Subsets: parts dataset

Known issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages