This repository contains a PyTorch implementation of CPC v1 for Natural Language (section 3.3) from the paper Representation Learning with Contrastive Predictive Coding .
I followed the details mentioned in section 3.3. Also, I got missing details directly from one of the paper's authors.
Embedding layer
- vocabulary size: 20 000
- dimension: 620
Encoder layer (g_enc)
- 1D-convolution + ReLU + mean-pooling
- output dimension: 2400
Recurrent Layer (g_ar)
- GRU
- dimension: 2400
Prediction Layer {W_k}
- Fully connected
- timesteps: 3
Training details
- input is 6 sentences
- maximum sequence length of 32
- negative samples are drawn from both batch and time dimension in the minibatch
- uses Adam optimizer with a learning rate of 2e-4
- trained on 8 GPUs, each with a batch size of 64
Configuration File
This implementation uses a configuration file for convenient configuration of the model. The config_cpc.yaml
file includes original parameters by default.
You have to adjust the following parameters to get started:
logging_dir
: directory for logging filesbooks_path
: directory containing the dataset
Optionally, if you want to log your experiments with comet.ml, you just need to install the library and write your api_key
.
Dataset
This model uses BookCorpus dataset for pretrainig. You have to organize your data according to the following structure:
├── BookCorpus
│ └── data
│ ├── file_1.txt
│ ├── file_2.txt
Then you have to write the path of your dataset in the books_path
parameter of the config_cpc.yaml
file.
Note: You could use publicly available files provided by Igor Brigadir at your own risk.
Training
When you have completed all the steps above, you can run:
python main.py
The implementation automatically saves a log of the experiment with the name cpc-date-hour
and also saves the model checkpoints with the same name.
Resume Training
If you want to resume your model training, you just need to write the name of your experiment (cpc-date-hour
) in the resume_name
parameter of the config_cpc.yaml
file and then run train.py
.
The CPC model employs vocabulary expansion in the same way as the Skip-Thought model. You just need to modify the run_name
and word2vec_path
parameters to then execute:
python vocab_expansion.py
The result is a numpy file of embeddings and a pickle file of the vocabulary. They will appear in a folder named vocab_expansion/
.
Configuration File
This implementation uses a configuration file for configuration of the classfier. You have to set the following parameters of the config_clf.yaml
file:
logging_dir
: directory for logging filescpc_path
: path of the pretrained cpc model fileexpanded_vocab
:True
if you want to use expanded vocabularydataset_path
: directory containing all the benchmarkdataset_name
: name of the task (e.g. CR, TREC, etc.)
Dataset
This classifier uses a common NLP benchmark. You have to organize your data according to the following structure:
├── dataset_name
│ └── data
│ └── task_name
│ ├── task_name.train.txt
│ ├── task_name.dev.txt
Then you have to set the path of your data (dataset_path
) and task name (dataset_name
) in the config_cpc.yaml
file.
Note: You could use publicly available files provided by zenRRan.
Training
When you have completed the steps above, you can run:
python main_clf.py
The implementation automatically saves a log of the experiment with the name cpc-clf-date-hour
and also saves the model checkpoints with the same name.
The model should be trained for 1e8 steps with a batch size of 64 * 8 GPUs. The authors provided me a snapshot of the first 1M training steps that you can find here, and you can find the results of my implementation here. There is a slight difference which may be due to various factors such as dataset or initialization. I have not been able to train the model entirely, so I did not replicate the results with the benchmark.
If anyone can fully train the model, feel free to share the results. I will be attentive to any questions or comments.
- Representation Learning with Contrastive Predictive Coding
- Part of the code is borrowed from https://github.com/jefflai108/Contrastive-Predictive-Coding-PyTorch
- Part of the code is borrowed from https://github.com/ryankiros/skip-thoughts