A neural network to detect and analyze ciphers from historical texts.
This project contains code for the detection and classification of ciphers to classical algorithms by using a neural network. In Future other parts of the cryptanalysis will be implemented. An online version of the neural networks will be officially published on https://www.cryptool.org/ncid.
This software and the online version on https://www.cryptool.org/ncid are licensed with the GPLv3 license. Private use of this software is allowed. Software using parts of the code from this repository must not be commercially used and also must be GPLv3 licensed.
Publications on websites and the like MUST be explicitly allowed by the author. For further information contact me at [email protected].
-
Clone this repository and enter it:
git clone https://github.com/dITySoftware/ncid cd ncid
-
Install the recommended and tested versions by using requirements.txt:
pip3 install -r requirements.txt
First of all this usage is not recommended as first option. Try running the train.py
or eval.py
script with the argument --download_dataset=True
, if you only want to train or test on the filtered dataset. Optionally you can download the dataset on your own from here
If you'd like to create your own plaintexts, you can use the generatePlainTextFiles.py
script. Therefore you first need to download some texts, for example the Gutenberg Library. You can do that by using following command, which downloads all English e-books compressed with zip. Note that this script can take a while and dumps about 14gb of files into ./data/gutenberg_en
and 5.3gb additionaly if you do not delete the gutenberg_en.zip
.
wget -m -H -nd "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en" -e use_proxy=yes -e http_proxy=46.101.1.221:80 > /tmp/wget-log 2>&1
The generatePlainTextFiles.py
script automatically unpacks the zips, with the parameter --restructure_directory
. Every line in a plaintext is seperated by a '\n', so be sure to save it in the right format or use the generatePlainTextFiles.py
script to reformat all files from '\r\n' to '\n'. For further description read the help by using the --help
parameter. Example usage:
python3 generatePlainTextFiles.py --directory=../gutenberg_en --restructure_directory=true
You might want to predict with one or more models by using the same ciphertext files. The generateCipherTextFiles.py
script encrypts plaintext files to multiple ciphertext files. The naming convention is fileName-cipherType-minLenXXX-maxLenXXX-keyLenXX.txt. This script generates ciphertexts out of plaintexts. If a line is not long enough it is concatenated with the next line. If a line is too long it is sliced into max_text_len length. For further description read the help by using the --help
parameter. Example usage:
python3 generateCipherTextFiles.py --min_text_len=100 --max_text_len=100 --max_files_count=100
To evaluate multiple models in the most comparable way, the features and ciphertexts are precalculated and saved into files using the generateCalculatedFeatures.py
script.
python3 generateCalculatedFeatures.py --dataset_workers=50 --min_len=100 --max_len=100 --save_directory=../data/generated_data --batch_size=512 --dataset_size=64960 --max_iter=10000000 > ../data/generate_data.txt 2> ../data/err_generate_data.txt
Here are our NCID models for all ACA ciphers and texts with exact length of 100. (released on March 20th, 2021): models100.zip
Here are our NCID models for the ACA ciphers amsco, bazeries, beaufort, bifid, cmbifid, digrafid, foursquare, fractionated_morse, gromark, gronsfeld, homophonic, monome_dinome, morbit, myszkowski, nicodemus, nihilist_substitution, periodic_gromark, phillips, playfair, pollux, porta, portax, progressive_key, quagmire2, quagmire3, quagmire4, ragbaby, redefence, seriated_playfair, slidefair, swagman, tridigital, trifid, tri_square, two_square, vigenere and texts with the lengths of 51-428. (released on March 20th, 2021): aca_models428.zip
There are multiple ways to evaluate the model. First of all it is needed to put the corresponding model file in the ../data/models
directory and run one of the following commands:
-
benchmark - Use this argument to create ciphertexts on the fly, like in training mode, and evaluate them with the model. This option is optimized for large throughput to test the model. Example usages:
python3 eval.py --model=../data/models/t128_ffnn_final_100.h5 --max_iter=1000000 --dataset_size=1120 benchmark --dataset_workers=10 --min_text_len=100 --max_text_len=100
python3 eval.py --architecture=Ensemble --models ../data/models/t128_ffnn_final_100.h5 ../data/models/t129_lstm_final_100.h5 ../data/models/t128_nb_final_100.h5 ../data/models/t99_rf_final_100.h5 ../data/models/t96_transformer_final_100.h5 --architectures FFNN LSTM NB RF Transformer --strategy=weighted --batch_size=512 --max_iter=1000000 --dataset_size=64960 benchmark --dataset_workers=10 --min_text_len=100 --max_text_len=100 > ../data/benchmark.txt 2> ../data/err_benchmark.txt
-
evaluate - Use this argument to evaluate cipher types for directories with ciphertext files in it. There are two evaluation_modes:
-
per_file - every file is evaluated on it's own. The average of all evaluations of that file is the output.
-
summarized - all files are evaluated and the average is printed at the end of the script. This mode is like benchmark, but is more reproducible and comparable, because the inputs are consistent.
Example usages:
python3 eval.py --model=../data/models/t128_ffnn_final_100.h5 --max_iter=100000 evaluate --evaluation_mode=per_file --ciphertext_folder=../data/generated_data
python3 eval.py --architecture=Ensemble --models ../data/models/t128_ffnn_final_100.h5 ../data/models/t129_lstm_final_100.h5 ../data/models/t128_nb_final_100.h5 ../data/models/t99_rf_final_100.h5 ../data/models/t96_transformer_final_100.h5 --architectures FFNN LSTM NB RF Transformer --strategy=weighted --batch_size=512 --dataset_size=64960 --max_iter=10000000 evaluate --data_folder=../data/generated_data --evaluation_mode=per_file > ../data/eval.txt 2> ../data/err_eval.txt
-
-
single_line - Use this argument to predict a single line of ciphertext. The difference of this command is, that in contrast to the other modes, the results are predicted without knowledge of the real cipher type. There are two types of data this command can process:
- ciphertext - A single line of ciphertext to be predicted by the model. Example usages:
python3 eval.py --model=../data/models/t128_ffnn_final_100.h5 single_line --ciphertext=yingraobhoarhthosortmkicwhaslcbpirpocuedcfthcezvoryyrsrdyaffcleaetiaaeuhtyegeadsneanmatedbtrdndres
python3 eval.py --architecture=Ensemble --models ../data/models/t128_ffnn_final_100.h5 ../data/models/t129_lstm_final_100.h5 ../data/models/t128_nb_final_100.h5 ../data/models/t99_rf_final_100.h5 ../data/models/t96_transformer_final_100.h5 --architectures FFNN LSTM NB RF Transformer --strategy=weighted --batch_size=512 single_line --file=../data/generated_data/aca_features.txt --verbose=False > weights/../data/predict.txt 2> weights/err_predict.txt
- file - A file with mixed lines of ciphertext to be predicted line by line by the model. Example usages:
python3 eval.py --model=../data/models/t128_ffnn_final_100.h5 single_line --verbose=false --file=../data/cipherstexts.txt
To see all options of eval.py
, run the --help
or -h
command.
python3 eval.py --help
By default we train the models to identify ACA ciphers listed here. The plaintexts used are already filtered and automatically downloaded in the train.py or eval.py scripts. You can turn off this behavior by setting --download_dataset=False
.
To see all options of train.py
, run the --help
or -h
command.
python3 train.py --help
-
python3 train.py --dataset_workers=50 --train_dataset_size=64960 --batch_size=512 --max_iter=1000000000 --min_train_len=100 --max_train_len=100 --min_test_len=100 --max_test_len=100 --model_name=t30.h5 > weights/t30.txt 2> weights/err_t30.txt &
-
python3 train.py --model_name=mtc3_model.h5 --ciphers=mtc3 # first config.py must be adapted
-
python3 train.py --architecture=FFNN --dataset_workers=50 --train_dataset_size=64960 --batch_size=512 --max_iter=1000000000 --min_train_len=100 --max_train_len=100 --min_test_len=100 --max_test_len=100 --model_name=t30.h5 > weights/t30.txt 2> weights/err_t30.txt &
NCID now supports multiple GPUs seamlessly during training:
Before running any of the scripts, run: export CUDA_VISIBLE_DEVICES=[gpus]
- Where you should replace [gpus] with a comma separated list of the index of each GPU you want to use (e.g., 0,1,2,3).
- You should still do this if only using 1 GPU.
- You can check the indices of your GPUs with
nvidia-smi
.
Multiple Unit-Tests ensure the functionality of the implemented ciphers and the TextLine2CipherStatisticsDataset.
Every test case can be executed by using following command in the main directory:
python3 -m unittest discover -s unit -p '*Test.py'
Single test classes can be executed with this command:
python3 -m unittest <path/to/test/class>
for example:
python3 -m unittest unit/cipherTypeDetection/textLine2CipherStatisticsDataset.py
Following are our training results from a DGX-1 with 2 GPUs on the models with length 100 and 6 GPUs on models with length 51-428. Models are differentiated into feature-engineering (FFNN, RF and NB) and feature-extracting (LSTM and Transformer) models. Models are evaluated with a dataset of 10 million self generated records.
Model Name | Accuracy in % | Iterations in Mio. | Training Time |
---|---|---|---|
t128_ffnn_final_100 | 78.31 | 181 | 7d 11h 14m |
t96_transformer_final_100 | 72.33 | 303 | 5d 8h 58m |
t99_rf_final_100 | 73.50 | 2.5 | 3h 24m |
t128_nb_final_100 | 52.79 | 181 | 7d 11h 14m |
t129_lstm_final_100 | 72.16 | 162 | 2d 21h 31m |
ensemble_mean_100 | 82.67 | - | - |
ensemble_weighted_100 | 82.78 | - | - |
t142_final_aca428_ffnn | 67.43 | 100 | 4d 5h 17m |
t145_transformer_final_aca428 | 59.54 | 114 | 8h 8m |
t144_rf_final_aca428 | 59.15 | 2.5 | 3h 18m |
t142_final_aca428_nb | 50.71 | 100 | 4d 5h 17m |
t143_lstm_final_aca428 | 63.41 | 89 | 9h 6m |
ensemble_mean428 | 70.79 | - | - |
ensemble_weighted428 | 70.78 | - | - |
AusDM 2021: Detection of Classical Cipher Types with Feature-Learning Approaches
If you use ncid in a scientific publication, we would appreciate using the following citations:
@inproceedings{leierzopf2021-2,
author="Leierzopf, Ernst and Mikhalev, Vasily and Kopal, Nils and Esslinger, Bernhard and Lampesberger, Harald and Hermann, Eckehard",
editor="Xu, Yue and Wang, Rosalindand Lord, Anton and Boo, Yee Ling and Nayak, Richi and Zhao, Yanchang and Williams, Graham",
title="Detection of Classical Cipher Types with Feature-Learning Approaches",
booktitle="Data Mining",
year="2021",
month="Dec",
publisher="Springer Singapore",
address="Singapore",
pages="152--164",
isbn="978-981-16-8531-6",
url="https://link.springer.com/chapter/10.1007/978-981-16-8531-6_11",
doi="10.1007/978-981-16-8531-6_11"
}
@inproceedings{leierzopf2021-1,
title = "A Massive Machine-Learning Approach For Classical Cipher Type Detection Using Feature Engineering",
author = "Ernst Leierzopf and Nils Kopal and Bernhard Esslinger and Harald Lampesberger and Eckehard Hermann",
year = "2021",
month = aug,
doi = "10.3384/ecp183164",
language = "English",
pages = "111--120",
booktitle = "Proceedings of the 4th International Conference on Historical Cryptology",
}