This project is aiming at enhancing product representation using contents in product pages (texts, image and more). Retrieval methods will based on the sparse retrieval and learned sparse retrieval.
We used the dataset collected by TREC Prodcut Search Track. Each files we used are stored at product-search huggingface.
The files has connect to my datasets directory:
/home/jhju/datasets/
Original Files | # Examples |
---|---|
/home/jhju/datasets/pdsearch/corpus.jsonl | 1118658 |
/home/jhju/datasets/pdsearch/images/* | 791108 (795498) |
/home/jhju/datasets/pdsearch/qid2query.tsv | 30734 |
/home/jhju/datasets/pdsearch/qid2query-dev-filtered.tsv | 8940 |
/home/jhju/datasets/pdsearch/product-search-dev-filtered.qrels | 169718 |
Training Files | # Examples |
---|---|
data/trec-pds.train.m2t.product2query.jsonl | 224719 |
data/trec-pds.train.i2t.product2query.jsonl | ?????? |
data/trec-pds.train.t2t.product2query.jsonl | 307492 |
Note that some of our prepreocessed datasets/files can be found at this huggingface hub.
- simplified_corpus/corpus.jsonl [huggingface_hub] (#) A few products' description/title are missing (38396), we only perform indexing on the rest of them.
python3 text2text/filter_corpus.py \
--input_jsonl data/corpus.jsonl \
--output_jsonl data/simplified_corpus/corpus.jsonl
- trec-pds.train.product2query.jsonl huggingface_hub This training files contains 307492 examples (randomly pick 3K examples as validation). We use the train qrels labels product-search-train.qrels and convert it into seq2se format. You can find the jsonl file here (huggingface's hub) or run the following script.
python text2text/convert_qrel_to_seq2seq.py \
--collection data/corpus.jsonl \
--query data/qid2query.tsv \
--qrel data/product-search-train.qrels \
--output data/trec-pds.train.product2query.jsonl
- Qrels filtered We remove 13 queries in original dev-qrels that have no particular information needs. The filtered dev-qrels was converted by this command.
python3 tools/filter_invalid_queries.py \
--query data/qid2query.tsv \
--qrels data/product-search-dev.qrels \
--qrels_filtered data/product-search-dev-filtered.qrels \
--query_filtered data/qid2query-dev-filtered.tsv
# Output
169952it [00:00, 463853.96it/s]
Filtered query:
['B07SDGB8XG', '', 'B01LE7U1PG', 'B074M44VZ6', 'B07R5H8QSY', 'B087CZZNDJ', 'B00MEHLYY8', 'B079SHC4SM', 'B086X41FSY', 'B07H2JS63P', 'B004V23YV0', 'B06XXZWR52', 'B00RINP9HG', 'B00HKC17R6']
Number of query filtered: 14
- Fine-tune on the constructed dataset.
TRAIN_SEQ2SEQ=data/trec-pds.train.product2query.jsonl
MODEL_PATH=models/t5-base-product2query
python3 text2text/train.py \
--model_name_or_path t5-base \
--config_name t5-base \
--tokenizer_name t5-base \
--train_file ${TRAIN_SEQ2SEQ} \
--max_src_length 384 \
--max_tgt_length 32 \
--output_dir ${MODEL_PATH} \
--do_train --do_eval \
--save_strategy steps \
--max_steps 50000 \
--save_steps 10000 \
--eval_steps 500 \
--save_strategy steps \
--evaluation_strategy steps \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--optim adafactor \
--learning_rate 1e-3 \
--lr_scheduler_type linear \
--warmup_steps 1000 \
--remove_unused_columns false \
--report_to wandb \
--template "summarize: title: {0} description: {1}"
- Append the predicted texts as a new corpus
We only append the
title
as there is no significant difference between using it with description.
python3 tools/concat_predict_to_corpus.py \
--input_jsonl data/corpus.jsonl \
--prediction_jsonl predictions/corpus.pred.jsonl \
--output_dir data/title+prod2query_old \
--use_title
The fine-tuned text-to-text model checkpoint.
python3 tools/concat_predict_to_corpus.py \
--input_jsonl data/corpus.jsonl \
--prediction_jsonl predictions/corpus.pred.jsonl
--output_dir data/title+prod2query_old \
--use_title
- Document Expansion by Query Prediction (Nogueira et al., 2019)
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking (Formal et al., 2021)
- Generative Query Reformulation for Effective Adhoc Search (Wang et al., 2023)
- Leveraging Customer Reviews for E-commerce Query Generation (Lien et al., 2022)
- Lexically-Accelerated Dense Retrieval (Kulkarni et al., 2023)
- BLIP
- GIT
- MSMO: Multimodal Summarization with Multimodal Output (Zhu et al., 2023)
- Exploiting Pseudo Image Captions for Multimodal Summarization (Jiang et al., 2023)
- Flava: A foundational language and vision alignment model (Singh et al., 2022)
- Kosmos-2: Grounding Multimodal Large Language Models to the World (Peng et al., 2023)
- Understanding Guided Image Captioning Performance across Domains (Ng et al., 2021)
- Query Generation for Multimodal Documents (Kim et al., 2021)
- Retrieval-augmented Image Captioning (Ramos et al., 2023)
- FAIR-PMD (dataset)
- GeneralAI-GRIT (details)