Skip to content

Latest commit

 

History

History
206 lines (150 loc) · 7.69 KB

README.md

File metadata and controls

206 lines (150 loc) · 7.69 KB

Seq2Seq-Vis

CircleCI Docker Pulls License Latest Release

A visual debugging tool for Sequence-to-Sequence models

*by IBM Research in Cambridge and Harvard SEAS -- more info seq2seq-vis.io

Seq2Seq-Vis

Install and run with conda

We require using miniconda to create a virtual environment and install all dependencies via scripts. Seq2Seq-Vis currently works with a special version of OpenNMT-py modified version by Sebastian Gehrmann. We provide a script to install this special branch.

after installation you should have a file structure like this:

MyS2S/Seq2Seq-Vis                   ==> the tool
MyS2S/Seq2Seq-Vis/0316-fakedates/   ==> example data
MyS2S/OpenNMT-py                    ==> modified OpenNMT

1 - Install dependencies (server and client) and create virtual environment

create root directory (MyS2S)and then:

git clone https://github.com/HendrikStrobelt/Seq2Seq-Vis.git
cd Seq2Seq-Vis

and run in /Seq2Seq-Vis:

source setup_cpu.sh

2 - Install custom OpenNMT-py version

return to root directory:

cd ..
source Seq2Seq-Vis/setup_onmt_custom.sh

3 - Download some example data

Here we provide some example data for a character based dataset which converts date strings (e.g. "March 03, 1999" , "03/03/99") into a base form "mm-dd-yyyy". Download here ~177MB save it to /Seq2Seq-Vis and unzip:

unzip fakedates.zip

4 - Run the system

python3 server.py --dir 0316-fakedates/

go here: http://localhost:8080/client/index.html?in=M a r c h _ 0 3 , 1 9 9 9

You should see:

Enjoy exploring !

Install and run with docker

Thanks, Samuel Gratzl for contributing a docker configuration and image. Here are the steps:

  1. pull image: docker pull sgratzl/seq2seq-vis
  2. download data Download here ~177MB and unzip: unzip fakedates.zip
  3. run container with bound data:
    docker run --rm -it -v "${PWD}/0316-fakedates:/data" -p "8080:8080" sgratzl/seq2seq-vis

Prepare and run own models

1 - Prepare your data

You can use any model trained with OpenNMT-py to extract your own data. To gain access to the extraction scripts, follow the instructions above to install the modified OpenNMT-py version.

First, create a folder s2s that will be used to save all the extractions by calling mkdir s2s.

Then, call

python extract_context.py -src $your_input_file \
                          -tgt $your_target_file \
                          -model $your_model.pt \
                          -gpu $your_GPU_id (can be ignored for CPU extraction) \
                          -batch_size $your_batch_size
                          

You can customize the maximum sequence lengths by setting max_src_len, and max_tgt_len in the script. If you want to restrict the number of examples in your state file, you can uncomment the following lines and set it to your desrired size:

# if bcounter > 100:
#     break

The script creates a file in the location s2s/states.h5. This file is what you need to create the indices for searching.

The file for this is located in this directory in scripts/h5_to_faiss.py. Call it three times (once for each type of state) with the parameters

-states s2s/states.h5 # Your states file location
-data [decoder_out, encoder_out, cstar] # The three datasets within the states h5 file
-output $your_index_name # We recommend just naming them decoder.faiss, encoder.faiss, and context.faiss
-stepsize 100 # you can increase this, this is the number of batches it will add to the index at once. It is bottlenecked by your memory

To generate the dictionary and embedding files, modify this line with the location of your model and call

python VisServer.py

This will also test that your model works with our server as it calls the same API. The script will create three files:

  • s2s/embs.h5
  • s2s/src.dict
  • s2s/tgt.dict

2 - Create a s2s.yaml file to describe project

# -- minimal config 
model: date_acc_100.00_ppl_1.00_e7.pt  # model file
dicts:
 src: src.dict  		# source dictionary file
 tgt: tgt.dict  		# target dictionary file
embeddings: embs.h5  	# word embeddings for src and tgt
train: train.h5			# training data 

# -- OPTIONAL: FAISS indices for Neighborhoods
indexType: faiss		# index type should be 'faiss' (or 'annoy')
indices:
 decoder: decoder.faiss		# index for decoder states
 encoder: encoder.faiss		# index for encoder states

# -- OPTIONAL: model for linear projection
project_model: linear_projection.pkl		# pickl-ed scikit-learn model

3 - Command Line Parameters

usage: server.py [-h] [--nodebug NODEBUG] [--port PORT]
                 [-dir DIR]

optional arguments:
  --nodebug 	TRUE if not in debug mode
  --port 		port to run system (default: 8080)
  --dir  		directory with s2s.yaml file

Cite us

@ARTICLE{seq2seqvisv1,
   author = {{Strobelt}, H. and {Gehrmann}, S. and {Behrisch}, M. and {Perer}, A. and {Pfister}, H. and {Rush}, A.~M.},
    title = "{Seq2Seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1804.09299v1},
 primaryClass = "cs.CL",
 keywords = {Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing},
     year = 2018,
    month = April
}

Contributors

  • Hendrik Strobelt (IBM Research & MIT-IBM Watson AI Lab)

  • Sebastian Gehrmann (Harvard NLP)

  • Alexander M. Rush (Harvard NLP)

  • Michael Behrisch (Harvard VCG), Adam Perer (IBM Research), Hanspeter Pfister (Harvard VCG)

  • PR #16 signed-off-by: Samuel Gratzl

License

Seq2Seq-Vis is licensed under Apache 2 license.