Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 1.55 KB

File metadata and controls

28 lines (19 loc) · 1.55 KB

Datasets

STS

All STS datasets are already included in the data folder.

  • For additional information on hard* and images*, please visit this link. (complete_corpus / headlines, images)
  • For details on free-test, refer to this paper.

COSTRA

COSTRA dataset should be downloaded automatically. For more information about the dataset, please visit this link or github

CFD

The CFD dataset should be automatically downloaded upon first evaluation. If not, you can manually download and unzip it into the data/ folder using the following commands:

curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0022-FE82-7{/facebook.zip}
unzip file.zip

For more information about the dataset, please visit this link.

CTDC

The CTDC dataset will NOT be downloaded automatically. You can obtain it by following the steps outlined here. Once obtained, save the contents of the decompressed TGZ file as data/czech_text_document_corpus_v20.

DareCzech

The DareCzech dataset will NOT be downloaded automatically. However, you can obtain it by following the instructions provided here. Save the DareCzech dataset as data/dareczech.