Skip to content

Latest commit

 

History

History
146 lines (127 loc) · 6.77 KB

README.md

File metadata and controls

146 lines (127 loc) · 6.77 KB

expansion.for.pdseasrch

This project is aiming at enhancing product representation using contents in product pages (texts, image and more). Retrieval methods will based on the sparse retrieval and learned sparse retrieval.


Docuemnt Sheet Slide


Dataset

We used the dataset collected by TREC Prodcut Search Track. Each files we used are stored at product-search huggingface.

The files has connect to my datasets directory: /home/jhju/datasets/

Original Files # Examples
/home/jhju/datasets/pdsearch/corpus.jsonl 1118658
/home/jhju/datasets/pdsearch/images/* 791108 (795498)
/home/jhju/datasets/pdsearch/qid2query.tsv 30734
/home/jhju/datasets/pdsearch/qid2query-dev-filtered.tsv 8940
/home/jhju/datasets/pdsearch/product-search-dev-filtered.qrels 169718
Training Files # Examples
data/trec-pds.train.m2t.product2query.jsonl 224719
data/trec-pds.train.i2t.product2query.jsonl ??????
data/trec-pds.train.t2t.product2query.jsonl 307492

Note that some of our prepreocessed datasets/files can be found at this huggingface hub.

  1. simplified_corpus/corpus.jsonl [huggingface_hub] (#) A few products' description/title are missing (38396), we only perform indexing on the rest of them.
python3 text2text/filter_corpus.py \
    --input_jsonl data/corpus.jsonl \
    --output_jsonl data/simplified_corpus/corpus.jsonl
  1. trec-pds.train.product2query.jsonl huggingface_hub This training files contains 307492 examples (randomly pick 3K examples as validation). We use the train qrels labels product-search-train.qrels and convert it into seq2se format. You can find the jsonl file here (huggingface's hub) or run the following script.
python text2text/convert_qrel_to_seq2seq.py \
    --collection data/corpus.jsonl \
    --query data/qid2query.tsv \
    --qrel data/product-search-train.qrels \
    --output data/trec-pds.train.product2query.jsonl
  1. Qrels filtered We remove 13 queries in original dev-qrels that have no particular information needs. The filtered dev-qrels was converted by this command.
python3 tools/filter_invalid_queries.py \
    --query data/qid2query.tsv \
    --qrels data/product-search-dev.qrels \
    --qrels_filtered data/product-search-dev-filtered.qrels \
    --query_filtered data/qid2query-dev-filtered.tsv
# Output
169952it [00:00, 463853.96it/s]
Filtered query:
['B07SDGB8XG', '', 'B01LE7U1PG', 'B074M44VZ6', 'B07R5H8QSY', 'B087CZZNDJ', 'B00MEHLYY8', 'B079SHC4SM', 'B086X41FSY', 'B07H2JS63P', 'B004V23YV0', 'B06XXZWR52', 'B00RINP9HG', 'B00HKC17R6']
Number of query filtered: 14

Text-to-text Method

  1. Fine-tune on the constructed dataset.
TRAIN_SEQ2SEQ=data/trec-pds.train.product2query.jsonl
MODEL_PATH=models/t5-base-product2query 

python3 text2text/train.py \
    --model_name_or_path t5-base \
    --config_name t5-base \
    --tokenizer_name t5-base \
    --train_file ${TRAIN_SEQ2SEQ} \
    --max_src_length 384  \
    --max_tgt_length 32 \
    --output_dir ${MODEL_PATH} \
    --do_train --do_eval \
    --save_strategy steps \
    --max_steps 50000 \
    --save_steps 10000 \
    --eval_steps 500 \
    --save_strategy steps \
    --evaluation_strategy steps \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --optim adafactor \
    --learning_rate 1e-3 \
    --lr_scheduler_type linear \
    --warmup_steps 1000 \
    --remove_unused_columns false \
    --report_to wandb \
    --template "summarize: title: {0} description: {1}"
  1. Append the predicted texts as a new corpus We only append the title as there is no significant difference between using it with description.
python3 tools/concat_predict_to_corpus.py \
    --input_jsonl data/corpus.jsonl  \
    --prediction_jsonl predictions/corpus.pred.jsonl \
    --output_dir data/title+prod2query_old  \ 
    --use_title 

The fine-tuned text-to-text model checkpoint.

python3 tools/concat_predict_to_corpus.py \
    --input_jsonl data/corpus.jsonl  \
    --prediction_jsonl predictions/corpus.pred.jsonl 
    --output_dir data/title+prod2query_old  \ 
    --use_title 

References:

Text-based

Image-based

  • BLIP
  • GIT

Multimodal

Benchmark datasets

Others