All STS datasets are already included in the data
folder.
- For additional information on
hard*
andimages*
, please visit this link. (complete_corpus
/headlines
,images
) - For details on
free-test
, refer to this paper.
COSTRA dataset should be downloaded automatically. For more information about the dataset, please visit this link or github
The CFD dataset should be automatically downloaded upon first evaluation. If not, you can manually download and unzip it into the data/
folder using the following commands:
curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0022-FE82-7{/facebook.zip}
unzip file.zip
For more information about the dataset, please visit this link.
The CTDC dataset will NOT be downloaded automatically. You can obtain it by following the steps outlined here. Once obtained, save the contents of the decompressed TGZ file as data/czech_text_document_corpus_v20
.
The DareCzech dataset will NOT be downloaded automatically. However, you can obtain it by following the instructions provided here. Save the DareCzech dataset as data/dareczech
.