Skip to content

This repository contains tools and scripts for quantifying pairwise similarity between samples based on text descriptions

Notifications You must be signed in to change notification settings

krishnanlab/similarity_metrics_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

Similarity Metrics Pipeline

Comparing Text Descriptions Using BLEU, WMD, and Cosine Similarities


Overview

This repository provides a pipeline for calculating pairwise text similarities between biological sample and study descriptions. Given the ground truth labels for the samples, the pipeline computes similarities using the following methods:

  • BLEU Score (Bilingual Evaluation Understudy)
  • Word Mover's Distance (WMD)
  • Cosine Similarity

The workflow ensures comparisons occur only between samples from different experiments (GSEs).

The results are used to evaluate how well these similarity metrics reflect the true relationships between samples and studies.

Note: I am starting with BLEU score calculations and will implement the rest of the mertics later

Project Structure (current)

project_repo/
├── data/                            # Input and processed data
│   ├── processed_descriptions.tsv   # Preprocessed sample descriptions
│   └── bleu_scores.csv.gz           # BLEU similarity scores
├── bin/                                  # Executable scripts 
│   ├── tissue_labels_CL:0000000.csv     # tissue input file
│  
├── scripts/                         # Python scripts for similarity calculations
│   ├── calculate_bleu_scores.py          # Calculates BLEU scores
│   ├── filter_ground_truth_pairs_bleu.py # Filters BLEU pairs based on ground truth labels
│
├── run/                                  # Executable scripts 
│   ├── submit_bleu_job.sbatch            # SLURM script for BLEU similarity job arrays
│   └── test_job_tissue_CL0000000.sbatch  # SLURM script for filter_ground_truth_pairs_bleu.py job submission            
│
├── results/                         # Output results
│   └── bleu_scores_with_category.csv.gz   # final product: Ground truth filtered BLEU pairs
│
└── README.md                        # Project documentation (this file)


Project Structure (goal)

project_repo/
├── data/                            # Input and processed data
│   ├── processed_descriptions.tsv   # Preprocessed sample descriptions
│   └── similarity_results/          # Output folder for similarity scores
│
├── scripts/                         # Python scripts for similarity calculations
│   ├── calculate_bleu_scores.py          # Calculates BLEU scores
│   ├── calculate_wmd_scores.py           # Calculates WMD similarity
│   ├── calculate_cosine_scores.py        # Calculates Cosine similarity
│   ├── filter_ground_truth_pairs_bleu.py # Filters BLEU pairs based on ground truth labels
│   ├── filter_ground_truth_pairs_wmd.py  # Filters WMD pairs based on ground truth labels
│   └── filter_ground_truth_pairs_cosine.py # Filters Cosine pairs based on ground truth labels
│
├── bin/                             # Executable scripts and tools
│   ├── submit_bleu_job.sbatch       # SLURM script for BLEU similarity job arrays
│   ├── submit_wmd_job.sbatch        # SLURM script for WMD similarity job arrays
│   ├── submit_cosine_job.sbatch     # SLURM script for Cosine similarity job arrays
│   └── readme.txt                   # Details about external tools and their setup
│
├── results/                         # Output results
│   ├── bleu_scores.csv.gz           # BLEU similarity scores
│   ├── wmd_scores.csv.gz            # WMD similarity scores
│   ├── cosine_scores.csv.gz         # Cosine similarity scores
│   ├── filtered_pairs_bleu.csv.gz   # Ground truth filtered BLEU pairs
│   ├── filtered_pairs_wmd.csv.gz    # Ground truth filtered WMD pairs
│   └── filtered_pairs_cosine.csv.gz # Ground truth filtered Cosine pairs
│
└── README.md                        

About

This repository contains tools and scripts for quantifying pairwise similarity between samples based on text descriptions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published