expansion.for.pdseasrch

This project is aiming at enhancing product representation using contents in product pages (texts, image and more). Retrieval methods will based on the sparse retrieval and learned sparse retrieval.

Docuemnt Sheet Slide

Dataset

We used the dataset collected by TREC Prodcut Search Track. Each files we used are stored at product-search huggingface.

The files has connect to my datasets directory: /home/jhju/datasets/

Original Files	# Examples
/home/jhju/datasets/pdsearch/corpus.jsonl	1118658
/home/jhju/datasets/pdsearch/images/*	791108 (795498)
/home/jhju/datasets/pdsearch/qid2query.tsv	30734
/home/jhju/datasets/pdsearch/qid2query-dev-filtered.tsv	8940
/home/jhju/datasets/pdsearch/product-search-dev-filtered.qrels	169718

Training Files	# Examples
data/trec-pds.train.m2t.product2query.jsonl	224719
data/trec-pds.train.i2t.product2query.jsonl	??????
data/trec-pds.train.t2t.product2query.jsonl	307492

Note that some of our prepreocessed datasets/files can be found at this huggingface hub.

simplified_corpus/corpus.jsonl [huggingface_hub] (#) A few products' description/title are missing (38396), we only perform indexing on the rest of them.

python3 text2text/filter_corpus.py \
    --input_jsonl data/corpus.jsonl \
    --output_jsonl data/simplified_corpus/corpus.jsonl

trec-pds.train.product2query.jsonl huggingface_hub This training files contains 307492 examples (randomly pick 3K examples as validation). We use the train qrels labels product-search-train.qrels and convert it into seq2se format. You can find the jsonl file here (huggingface's hub) or run the following script.

python text2text/convert_qrel_to_seq2seq.py \
    --collection data/corpus.jsonl \
    --query data/qid2query.tsv \
    --qrel data/product-search-train.qrels \
    --output data/trec-pds.train.product2query.jsonl

Qrels filtered We remove 13 queries in original dev-qrels that have no particular information needs. The filtered dev-qrels was converted by this command.

python3 tools/filter_invalid_queries.py \
    --query data/qid2query.tsv \
    --qrels data/product-search-dev.qrels \
    --qrels_filtered data/product-search-dev-filtered.qrels \
    --query_filtered data/qid2query-dev-filtered.tsv
# Output
169952it [00:00, 463853.96it/s]
Filtered query:
['B07SDGB8XG', '', 'B01LE7U1PG', 'B074M44VZ6', 'B07R5H8QSY', 'B087CZZNDJ', 'B00MEHLYY8', 'B079SHC4SM', 'B086X41FSY', 'B07H2JS63P', 'B004V23YV0', 'B06XXZWR52', 'B00RINP9HG', 'B00HKC17R6']
Number of query filtered: 14

Text-to-text Method

Fine-tune on the constructed dataset.

TRAIN_SEQ2SEQ=data/trec-pds.train.product2query.jsonl
MODEL_PATH=models/t5-base-product2query 

python3 text2text/train.py \
    --model_name_or_path t5-base \
    --config_name t5-base \
    --tokenizer_name t5-base \
    --train_file ${TRAIN_SEQ2SEQ} \
    --max_src_length 384  \
    --max_tgt_length 32 \
    --output_dir ${MODEL_PATH} \
    --do_train --do_eval \
    --save_strategy steps \
    --max_steps 50000 \
    --save_steps 10000 \
    --eval_steps 500 \
    --save_strategy steps \
    --evaluation_strategy steps \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --optim adafactor \
    --learning_rate 1e-3 \
    --lr_scheduler_type linear \
    --warmup_steps 1000 \
    --remove_unused_columns false \
    --report_to wandb \
    --template "summarize: title: {0} description: {1}"

Append the predicted texts as a new corpus We only append the title as there is no significant difference between using it with description.

python3 tools/concat_predict_to_corpus.py \
    --input_jsonl data/corpus.jsonl  \
    --prediction_jsonl predictions/corpus.pred.jsonl \
    --output_dir data/title+prod2query_old  \ 
    --use_title

The fine-tuned text-to-text model checkpoint.

python3 tools/concat_predict_to_corpus.py \
    --input_jsonl data/corpus.jsonl  \
    --prediction_jsonl predictions/corpus.pred.jsonl 
    --output_dir data/title+prod2query_old  \ 
    --use_title

References:

Text-based

Document Expansion by Query Prediction (Nogueira et al., 2019)
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking (Formal et al., 2021)
Generative Query Reformulation for Effective Adhoc Search (Wang et al., 2023)
Leveraging Customer Reviews for E-commerce Query Generation (Lien et al., 2022)
Lexically-Accelerated Dense Retrieval (Kulkarni et al., 2023)

Image-based

BLIP
GIT

Multimodal

MSMO: Multimodal Summarization with Multimodal Output (Zhu et al., 2023)
Exploiting Pseudo Image Captions for Multimodal Summarization (Jiang et al., 2023)
Flava: A foundational language and vision alignment model (Singh et al., 2022)
Kosmos-2: Grounding Multimodal Large Language Models to the World (Peng et al., 2023)
Understanding Guided Image Captioning Performance across Domains (Ng et al., 2021)
Query Generation for Multimodal Documents (Kim et al., 2021)

Benchmark datasets

Retrieval-augmented Image Captioning (Ramos et al., 2023)
FAIR-PMD (dataset)
GeneralAI-GRIT (details)

Others

(OCR toolkit)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

expansion.for.pdseasrch

Dataset

Text-to-text Method

References:

Text-based

Image-based

Multimodal

Benchmark datasets

Others

Files

README.md

Latest commit

History

README.md

File metadata and controls

expansion.for.pdseasrch

Dataset

Text-to-text Method

References:

Text-based

Image-based

Multimodal

Benchmark datasets

Others