-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
332 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,332 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "raw", | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "raw" | ||
} | ||
}, | ||
"source": [ | ||
"+++\n", | ||
"title = 'Minimizers are Just Fancy K-mers'\n", | ||
"date = 2024-09-04T08:00:00+00:00\n", | ||
"draft = true\n", | ||
"+++" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"Today I am picking up an old but influential paper. Cited over 400 times, the paper \"Reducing storage requirements for biological sequence comparison\" by [Roberts et al. (2004)](https://academic.oup.com/bioinformatics/article/20/18/3363/202143) has had considerable impact on the sequencing community. If you are using modern aligners, you relied on the ideas published in that paper. \n", | ||
"\n", | ||
"One notable paper citing this reference is the publication of Minimap2. It is the first citation in its method section, and as such definitely worth a read. \n", | ||
"\n", | ||
"## The Core Idea\n", | ||
"\n", | ||
"The core idea in the paper is that to compare sequences in the age of next generation sequencing (NGS) and large datasets one should use a smart way of reducing data to avoid having to compare each sequence to all other sequences. \n", | ||
"\n", | ||
"To solve that issue the authors present the concept of minimizer. \n", | ||
"\n", | ||
"Today I will implement that concept and see if I can identify meaningful minimizers.\n", | ||
"\n", | ||
"\n", | ||
"### Minimizers, what are they?\n", | ||
"\n", | ||
"Minimizers are a relatively simple idea. The key problem they try to solve is to provide good seeds, meaning locations where two sequences are identical, to kick-start an alignment of those two sequences. \n", | ||
"\n", | ||
"The most naive way of solving this would be to compute all k-mers of both sequences, find the common k-mers and try to align the sequences starting at each k-mer. But those k-mer databases would become huge. \n", | ||
"\n", | ||
"So to reduce that solution space, instead of storing all k-mers, minimizers are the \"smallest\" k-mers in a window. As long as the sequences share stretches of identical nucleotides large enough, they will also share the odd \"smallest\" k-mer. In reality one could also store the \"biggest\" k-mer. It really does not matter as long as the index and the query operation computes the order of the k-mers the same way. \n", | ||
"\n", | ||
"All that is left to do for each minimizer is to store its location in the sequence so that one can use it as a seed for an alignment if it is found in the query.\n", | ||
"\n", | ||
"\n", | ||
"## A Simple Implementation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import hashlib\n", | ||
"\n", | ||
"from pydantic import BaseModel\n", | ||
"\n", | ||
"\n", | ||
"class Kmer(BaseModel):\n", | ||
" kmer: str\n", | ||
"\n", | ||
" def __hash__(self):\n", | ||
" # Hashing function I am using to sort k-mers\n", | ||
" # Sorting lexicographically will lead to\n", | ||
" # uninformative k-mers such as AAAA\n", | ||
" return int(hashlib.md5(self.kmer.encode()).hexdigest(), 16)\n", | ||
"\n", | ||
" def __len__(self):\n", | ||
" return len(self.kmer)\n", | ||
"\n", | ||
" def __str__(self):\n", | ||
" return self.kmer\n", | ||
"\n", | ||
"\n", | ||
"class Minimizer(BaseModel):\n", | ||
" kmer: Kmer\n", | ||
" sequence_id: str\n", | ||
" position: int\n", | ||
"\n", | ||
" def __hash__(self):\n", | ||
" return int(\n", | ||
" hashlib.md5(f\"{self.kmer}{self.position}\".encode()).hexdigest(), 16\n", | ||
" )\n", | ||
"\n", | ||
" def __lt__(self, other):\n", | ||
" return hash(self.kmer) < hash(other.kmer)\n", | ||
"\n", | ||
" def __eq__(self, other):\n", | ||
" if not isinstance(other, Minimizer):\n", | ||
" raise ValueError(\"Can only compare to other Minimizer\")\n", | ||
" return self.kmer == other.kmer and self.position == other.position\n", | ||
" \n", | ||
" def __str__(self):\n", | ||
" return f\"'{self.kmer}' @ {self.sequence_id}: {self.position}\"\n", | ||
"\n", | ||
"\n", | ||
"minimizer = Minimizer(kmer=Kmer(kmer=\"ATCG\"), sequence_id=\"seq_1\", position=99)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"These classes of Kmer and Minimizer implement the basic functionality that I need next, when I want to find the smallest Minimizer in a sequence. The way I implemented it, I can sort a list of Minimizers based on the hash." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"The k-mer with the lowest value is: 'B'\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"kmers = [\"A\", \"B\", \"C\"]\n", | ||
"minimizers = [\n", | ||
" Minimizer(kmer=Kmer(kmer=kmer), position=position, sequence_id=\"1\")\n", | ||
" for position, kmer in enumerate(kmers)\n", | ||
"]\n", | ||
"sorted_minimizers = sorted(minimizers)\n", | ||
"print(f\"The k-mer with the lowest value is: '{sorted_minimizers[0].kmer}'\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"That's working well. Now I can get the Minimizers for a sequence along its windows and store them:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def windows(sequence: str, window_size: int):\n", | ||
" for i in range(len(sequence) - window_size + 1):\n", | ||
" yield i\n", | ||
"\n", | ||
"\n", | ||
"def get_minimizer(\n", | ||
" sequence_id: str, sequence: str, k: int, offset: int\n", | ||
") -> Minimizer:\n", | ||
" all_kmers = []\n", | ||
" for i in range(len(sequence) - k + 1):\n", | ||
" kmer = sequence[i : i + k]\n", | ||
" all_kmers.append(\n", | ||
" Minimizer(\n", | ||
" kmer=Kmer(kmer=kmer),\n", | ||
" sequence_id=sequence_id,\n", | ||
" position=offset + i,\n", | ||
" )\n", | ||
" )\n", | ||
" all_kmers.sort()\n", | ||
"\n", | ||
" return all_kmers[0]\n", | ||
"\n", | ||
"\n", | ||
"def minimizers(\n", | ||
" sequence_id: str, sequence: str, w: int, k: int, unique: bool = True\n", | ||
") -> list[dict]:\n", | ||
" minimizers = []\n", | ||
"\n", | ||
" for offset, start in enumerate(windows(sequence, w)):\n", | ||
" window = sequence[start : start + w]\n", | ||
" minimizers.append(get_minimizer(sequence_id, window, k, offset))\n", | ||
"\n", | ||
" if unique:\n", | ||
" minimizers = list(set(minimizers))\n", | ||
" return minimizers" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"These basic functions is all I really need to get the Minimizers of a sequence given a `window size` and a `k`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Collected 7 minimizers\n", | ||
"The first three Minimizers:\n", | ||
"'Hello W' @ seq_1: 0\n", | ||
"'llo Wor' @ seq_1: 2\n", | ||
"' World,' @ seq_1: 5\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"example_sequence = \"Hello World, this is a sequence.\"\n", | ||
"\n", | ||
"k = 7\n", | ||
"window_size = 12\n", | ||
"\n", | ||
"sequence_minimizers = minimizers(\n", | ||
" sequence_id=\"seq_1\",\n", | ||
" sequence=example_sequence,\n", | ||
" w=window_size,\n", | ||
" k=k,\n", | ||
")\n", | ||
"sequence_minimizers.sort(key=lambda x: x.position) # sort by position for displaying\n", | ||
"\n", | ||
"print(f\"Collected {len(sequence_minimizers)} minimizers\")\n", | ||
"print(\"The first three Minimizers:\")\n", | ||
"for i in range(3):\n", | ||
" print(sequence_minimizers[i])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that I can get the Minimizers I can also find the common Minimizers between two sequences." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Common kmer: ATACGCAT\n", | ||
"atgctagcATACGCATcacgcatc\n", | ||
"ggatcagctcgagcATACGCATacgcatcgcatcgat\n", | ||
"\n", | ||
"Common kmer: ATACGCAT\n", | ||
"atgctagcATACGCATcacgcatc\n", | ||
"ggatcagctcgagcatacgcATACGCATcgcatcgat\n", | ||
"\n", | ||
"Common kmer: TACGCATC\n", | ||
"atgctagcaTACGCATCacgcatc\n", | ||
"ggatcagctcgagcatacgcaTACGCATCgcatcgat\n", | ||
"\n", | ||
"Common kmer: AGCATACG\n", | ||
"atgctAGCATACGcatcacgcatc\n", | ||
"ggatcagctcgAGCATACGcatacgcatcgcatcgat\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"def visualize_minimizer(seq: str, minimizer: Minimizer) -> str:\n", | ||
" before = seq[: minimizer.position].lower()\n", | ||
" after = seq[minimizer.position + len(minimizer.kmer) :].lower()\n", | ||
" return f\"{before}{str(minimizer.kmer.kmer).upper()}{after}\"\n", | ||
"\n", | ||
"\n", | ||
"def compare_sequences(seq1: str, seq2: str, w: int, k: int) -> list[dict]:\n", | ||
" minimizers1 = minimizers(\"seq_1\", seq1, w, k)\n", | ||
" minimizers2 = minimizers(\"seq_2\", seq2, w, k)\n", | ||
"\n", | ||
" minimizer_set1 = {m.kmer for m in minimizers1}\n", | ||
" minimizer_set2 = {m.kmer for m in minimizers2}\n", | ||
"\n", | ||
" common_minimizers = minimizer_set1.intersection(minimizer_set2)\n", | ||
"\n", | ||
" # Prepare results\n", | ||
" comparison_results = []\n", | ||
" for minimizer in common_minimizers:\n", | ||
" # find the substrings that match and show that\n", | ||
" for i in [m for m in minimizers1 if m.kmer == minimizer]:\n", | ||
" for j in [m for m in minimizers2 if m.kmer == minimizer]:\n", | ||
" print(f\"Common kmer: {minimizer.kmer}\")\n", | ||
" print(visualize_minimizer(seq1, i))\n", | ||
" print(visualize_minimizer(seq2, j))\n", | ||
" print(\"\")\n", | ||
"\n", | ||
" return comparison_results\n", | ||
"\n", | ||
"\n", | ||
"sequence1 = \"ATGCTAGCATACGCATCACGCATC\"\n", | ||
"sequence2 = \"GGATCAGCTCGAGCATACGCATACGCATCGCATCGAT\"\n", | ||
"\n", | ||
"w = 10 # Window size\n", | ||
"k = 8 # k-mer size\n", | ||
"comparison_results = compare_sequences(sequence1, sequence2, w, k)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This implementation shows how easy it is for this approach to find appropriate seed locations for starting a pairwise alignment. \n", | ||
"\n", | ||
"The paper goes more into detail and also introduces the concept of End-minimizers. I can only recommend checking it out.\n", | ||
"\n", | ||
"Thats all I have today. I hope it was interesting. " | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "reproduce_hic", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.0" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |