Website: policychangeindex.org
Authors: Julian TszKin Chan and Weifeng Zhong
Please email all comments/questions to julian.chan [AT] policychangeindex.org or weifeng.zhong [AT] policychangeindex.org
China's industrialization process has long been a product of government direction, be it coercive central planning or ambitious industrial policy. For the first time in the literature, we develop a quantitative indicator of China's policy priorities over a long period of time, which we call the Policy Change Index for China (PCI-China). The PCI-China is a leading indicator that runs from 1951 to the most recent quarter and can be updated in the future. In other words, the PCI-China not only helps us understand the past of China's industrialization but also allows us to make short-term predictions about its future directions.
The design of the PCI-China has two building blocks: (1) it takes as input data the full text of the People's Daily --- the official newspaper of the Communist Party of China --- since it was founded in 1946; (2) it employs a set of machine learning techniques to "read" the articles and detect changes in the way the newspaper prioritizes policy issues.
The source of the PCI-China's predictive power rests on the fact that the People's Daily is at the nerve center of China's propaganda system and that propaganda changes often precede policy changes. Before the great transformation from the central planning under Mao to the economic reform program after Mao, for example, considerable efforts were made by the Chinese government to promote the idea of reform, move public opinion, and mobilize resources toward the new agenda. Therefore, by detecting (real-time) changes in propaganda, the PCI-China is, effectively, predicting (future) changes in policy.
For details about the methodology and findings of this project, please see the following research paper:
- Chan, Julian TszKin and Weifeng Zhong. 2019. "Reading China: Predicting Policy Change with Machine Learning." AEI Economics Working Paper No. 2018-11 (latest version available here).
Results will change as the underlying models improve. A fundamental reason for adopting open source methods in this project is so that people from all backgrounds can contribute to the models that our society uses to assess and predict changes in public policy; when community-contributed improvements are incorporated, the model will produce better results.
The first step for everyone (users and developers) is to open a free GitHub account. And then you can specify how you want to "watch" the PCI-China repository by clicking on the Watch button in the upper-right corner of the repository's main page.
The second step is to get familiar with the PCI-China repository by reading the documentation.
If you want to ask a question or report a bug, create a new issue here and post your question or tell us what you think is wrong with the repository.
If you want to request an enhancement, create a new issue here and provide details on what you think should be added to the repository.
First, install the dependencies and set up the proper environment by running the following command in the shell:
./PCI-China>conda env create -f environment.yml
Second, activate the new environment pci_env
:
./PCI-China>conda activate pci_env
Third, run the following in the pci_env
environment:
./PCI-China>sh run_all.sh
The above command will perform the following tasks: (1) processing data, (2) training models for two-, five-, and ten-year rolling windows, (3) compiling results, (4) creating text output, and (5) visualizing results.
If you do not have the People's Daily data, you can run our tests which estimate a PCI using a simulated data set:
./PCI-China>pytest
Notes
- The default setting uses the first GPU to run the code. If you don't have a GPU, the code can be ran on CPU by changing the GPU setting to -1 (see details below)
- One of the package imported by PCI (jieba-fast) requires Visual Studio C++ Build Tools. Please checkout jieba-fast's website for details.
The python and an R script listed below are contained in the run_all.sh
file. They are available for users to perform the following tasks, respectively.
proc_pd.py
: Process and prepare the raw data from the People's Daily for building the neural network models.pci.py
: Train a neural network model to construct the PCI-China for a specified year-quarter, using a specified rolling window length.compile_tuning.py
: Compile the results from all models and export them to a.csv
file.create_text_output.py
: Generate the raw data together with the model's classification result for each article in a specified year-quarter.gen_figures.R
: Generate figures.create_plotly.py
: Create an interactive Plotly figure.
For the pci.py
file, users can also check out the descriptions of the arguments for the function using the --help
option:
./PCI-China>python pci.py --help
Using TensorFlow backend.
usage: pci.py [-h] [--model MODEL] [--year YEAR] [--month MONTH] [--gpu GPU]
[--iterator ITERATOR] [--root ROOT] [--temperature TEMPERATURE]
[--discount DISCOUNT] [--bandwidth BANDWIDTH]
optional arguments:
-h, --help show this help message and exit
--model MODEL Model name: window_5_years_quarterly,
window_10_years_quarterly, window_2_years_quarterly
--year YEAR Target year
--month MONTH Target month
--gpu GPU Which gpu to use
--iterator ITERATOR Iterator in simulated annealing
--root ROOT Root directory
--temperature TEMPERATURE
Temperature in simulated annealing
--discount DISCOUNT Discount factor in simulated annealing
--bandwidth BANDWIDTH
Bandwidth in simulated annealing
The raw data of the People's Daily, which are not provided in this repository, should be placed in the sub-folder PCI-China/Input/pd/
. Each file in this sub-folder should contain one year-quarter of data, be named by the respective year-quarter, and be in the .pkl
format. For example, the raw data for the first quarter of 2018 should be in the file 2018_Q1.pkl
. Below is the list of column names and types of each raw data file:
>>> df1 = pd.read_pickle("./PCI-China/Input/pd/pd_1946_1975.pkl")
>>> df1.dtypes
date datetime64[ns]
year int64
month int64
day int64
page int64
title object
body object
id int64
dtype: object
where title
and body
are the Chinese texts of the title and body of each article.
The processed data of the People's Daily, which are not provided in this repository, should be placed in the sub-folder PCI-China/data/Output/database.db
. The file is in SQLite format. The schema of the database is shown as the table below:
import sqlite3
import pandas as pd
conn = sqlite3.connect("data/output/database.db")
pd.read_sql_query("PRAGMA TABLE_INFO(main)", conn)
cid | name | type | notnull | dflt_value | pk | |
---|---|---|---|---|---|---|
0 | 0 | date | TIMESTAMP | 0 | None | 0 |
1 | 1 | id | INTEGER | 0 | None | 0 |
2 | 2 | page | REAL | 0 | None | 0 |
3 | 3 | title | TEXT | 0 | None | 0 |
4 | 4 | body | TEXT | 0 | None | 0 |
5 | 5 | strata | INTEGER | 0 | None | 0 |
6 | 6 | title_seg | TEXT | 0 | None | 0 |
7 | 7 | body_seg | TEXT | 0 | None | 0 |
8 | 8 | year | INTEGER | 0 | None | 0 |
9 | 9 | quarter | INTEGER | 0 | None | 0 |
10 | 10 | month | INTEGER | 0 | None | 0 |
11 | 11 | day | INTEGER | 0 | None | 0 |
12 | 12 | weekday | INTEGER | 0 | None | 0 |
13 | 13 | frontpage | INTEGER | 0 | None | 0 |
14 | 14 | page1to3 | INTEGER | 0 | None | 0 |
15 | 15 | title_len | INTEGER | 0 | None | 0 |
16 | 16 | body_len | INTEGER | 0 | None | 0 |
17 | 17 | n_articles_that_day | INTEGER | 0 | None | 0 |
18 | 18 | n_pages_that_day | REAL | 0 | None | 0 |
19 | 19 | n_frontpage_articles_that_day | INTEGER | 0 | None | 0 |
where title_int
and body_int
are the word embeddings (numeric vectors) of the title and body of each article.
The summary statistics for the processed data can be found in the following .csv
file:
https://github.com/PSLmodels/PCI-China/blob/master/PCI-China/figures/Summary%20statistics.csv
Neither the raw data nor the processed data of the People's Daily can be released by the authors. Users who have questions about applying the repository to their own data are welcome to contact the authors:
- Julian TszKin Chan: julian.chan [AT] policychangeindex.org;
- Weifeng Zhong: weifeng.zhong [AT] policychangeindex.org.
Please cite the source of the latest PCI-China by the website: https://policychangeindex.org.
For academic work, please cite the following research paper:
- Chan, Julian TszKin and Weifeng Zhong. 2019. "Reading China: Predicting Policy Change with Machine Learning." AEI Economics Working Paper No. 2018-11 (latest version available here).