Skip to content

Commit cc303f3

Browse files
committed
tracking a minimizer post
1 parent 5c2441e commit cc303f3

File tree

1 file changed

+332
-0
lines changed

1 file changed

+332
-0
lines changed
Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "raw",
5+
"metadata": {
6+
"vscode": {
7+
"languageId": "raw"
8+
}
9+
},
10+
"source": [
11+
"+++\n",
12+
"title = 'Minimizers are Just Fancy K-mers'\n",
13+
"date = 2024-09-04T08:00:00+00:00\n",
14+
"draft = true\n",
15+
"+++"
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"metadata": {},
21+
"source": [
22+
"\n",
23+
"Today I am picking up an old but influential paper. Cited over 400 times, the paper \"Reducing storage requirements for biological sequence comparison\" by [Roberts et al. (2004)](https://academic.oup.com/bioinformatics/article/20/18/3363/202143) has had considerable impact on the sequencing community. If you are using modern aligners, you relied on the ideas published in that paper. \n",
24+
"\n",
25+
"One notable paper citing this reference is the publication of Minimap2. It is the first citation in its method section, and as such definitely worth a read. \n",
26+
"\n",
27+
"## The Core Idea\n",
28+
"\n",
29+
"The core idea in the paper is that to compare sequences in the age of next generation sequencing (NGS) and large datasets one should use a smart way of reducing data to avoid having to compare each sequence to all other sequences. \n",
30+
"\n",
31+
"To solve that issue the authors present the concept of minimizer. \n",
32+
"\n",
33+
"Today I will implement that concept and see if I can identify meaningful minimizers.\n",
34+
"\n",
35+
"\n",
36+
"### Minimizers, what are they?\n",
37+
"\n",
38+
"Minimizers are a relatively simple idea. The key problem they try to solve is to provide good seeds, meaning locations where two sequences are identical, to kick-start an alignment of those two sequences. \n",
39+
"\n",
40+
"The most naive way of solving this would be to compute all k-mers of both sequences, find the common k-mers and try to align the sequences starting at each k-mer. But those k-mer databases would become huge. \n",
41+
"\n",
42+
"So to reduce that solution space, instead of storing all k-mers, minimizers are the \"smallest\" k-mers in a window. As long as the sequences share stretches of identical nucleotides large enough, they will also share the odd \"smallest\" k-mer. In reality one could also store the \"biggest\" k-mer. It really does not matter as long as the index and the query operation computes the order of the k-mers the same way. \n",
43+
"\n",
44+
"All that is left to do for each minimizer is to store its location in the sequence so that one can use it as a seed for an alignment if it is found in the query.\n",
45+
"\n",
46+
"\n",
47+
"## A Simple Implementation"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": 6,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"import hashlib\n",
57+
"\n",
58+
"from pydantic import BaseModel\n",
59+
"\n",
60+
"\n",
61+
"class Kmer(BaseModel):\n",
62+
" kmer: str\n",
63+
"\n",
64+
" def __hash__(self):\n",
65+
" # Hashing function I am using to sort k-mers\n",
66+
" # Sorting lexicographically will lead to\n",
67+
" # uninformative k-mers such as AAAA\n",
68+
" return int(hashlib.md5(self.kmer.encode()).hexdigest(), 16)\n",
69+
"\n",
70+
" def __len__(self):\n",
71+
" return len(self.kmer)\n",
72+
"\n",
73+
" def __str__(self):\n",
74+
" return self.kmer\n",
75+
"\n",
76+
"\n",
77+
"class Minimizer(BaseModel):\n",
78+
" kmer: Kmer\n",
79+
" sequence_id: str\n",
80+
" position: int\n",
81+
"\n",
82+
" def __hash__(self):\n",
83+
" return int(\n",
84+
" hashlib.md5(f\"{self.kmer}{self.position}\".encode()).hexdigest(), 16\n",
85+
" )\n",
86+
"\n",
87+
" def __lt__(self, other):\n",
88+
" return hash(self.kmer) < hash(other.kmer)\n",
89+
"\n",
90+
" def __eq__(self, other):\n",
91+
" if not isinstance(other, Minimizer):\n",
92+
" raise ValueError(\"Can only compare to other Minimizer\")\n",
93+
" return self.kmer == other.kmer and self.position == other.position\n",
94+
" \n",
95+
" def __str__(self):\n",
96+
" return f\"'{self.kmer}' @ {self.sequence_id}: {self.position}\"\n",
97+
"\n",
98+
"\n",
99+
"minimizer = Minimizer(kmer=Kmer(kmer=\"ATCG\"), sequence_id=\"seq_1\", position=99)"
100+
]
101+
},
102+
{
103+
"cell_type": "markdown",
104+
"metadata": {},
105+
"source": [
106+
"These classes of Kmer and Minimizer implement the basic functionality that I need next, when I want to find the smallest Minimizer in a sequence. The way I implemented it, I can sort a list of Minimizers based on the hash."
107+
]
108+
},
109+
{
110+
"cell_type": "code",
111+
"execution_count": 7,
112+
"metadata": {},
113+
"outputs": [
114+
{
115+
"name": "stdout",
116+
"output_type": "stream",
117+
"text": [
118+
"The k-mer with the lowest value is: 'B'\n"
119+
]
120+
}
121+
],
122+
"source": [
123+
"kmers = [\"A\", \"B\", \"C\"]\n",
124+
"minimizers = [\n",
125+
" Minimizer(kmer=Kmer(kmer=kmer), position=position, sequence_id=\"1\")\n",
126+
" for position, kmer in enumerate(kmers)\n",
127+
"]\n",
128+
"sorted_minimizers = sorted(minimizers)\n",
129+
"print(f\"The k-mer with the lowest value is: '{sorted_minimizers[0].kmer}'\")"
130+
]
131+
},
132+
{
133+
"cell_type": "markdown",
134+
"metadata": {},
135+
"source": [
136+
"That's working well. Now I can get the Minimizers for a sequence along its windows and store them:"
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": 8,
142+
"metadata": {},
143+
"outputs": [],
144+
"source": [
145+
"def windows(sequence: str, window_size: int):\n",
146+
" for i in range(len(sequence) - window_size + 1):\n",
147+
" yield i\n",
148+
"\n",
149+
"\n",
150+
"def get_minimizer(\n",
151+
" sequence_id: str, sequence: str, k: int, offset: int\n",
152+
") -> Minimizer:\n",
153+
" all_kmers = []\n",
154+
" for i in range(len(sequence) - k + 1):\n",
155+
" kmer = sequence[i : i + k]\n",
156+
" all_kmers.append(\n",
157+
" Minimizer(\n",
158+
" kmer=Kmer(kmer=kmer),\n",
159+
" sequence_id=sequence_id,\n",
160+
" position=offset + i,\n",
161+
" )\n",
162+
" )\n",
163+
" all_kmers.sort()\n",
164+
"\n",
165+
" return all_kmers[0]\n",
166+
"\n",
167+
"\n",
168+
"def minimizers(\n",
169+
" sequence_id: str, sequence: str, w: int, k: int, unique: bool = True\n",
170+
") -> list[dict]:\n",
171+
" minimizers = []\n",
172+
"\n",
173+
" for offset, start in enumerate(windows(sequence, w)):\n",
174+
" window = sequence[start : start + w]\n",
175+
" minimizers.append(get_minimizer(sequence_id, window, k, offset))\n",
176+
"\n",
177+
" if unique:\n",
178+
" minimizers = list(set(minimizers))\n",
179+
" return minimizers"
180+
]
181+
},
182+
{
183+
"cell_type": "markdown",
184+
"metadata": {},
185+
"source": [
186+
"These basic functions is all I really need to get the Minimizers of a sequence given a `window size` and a `k`."
187+
]
188+
},
189+
{
190+
"cell_type": "code",
191+
"execution_count": 15,
192+
"metadata": {},
193+
"outputs": [
194+
{
195+
"name": "stdout",
196+
"output_type": "stream",
197+
"text": [
198+
"Collected 7 minimizers\n",
199+
"The first three Minimizers:\n",
200+
"'Hello W' @ seq_1: 0\n",
201+
"'llo Wor' @ seq_1: 2\n",
202+
"' World,' @ seq_1: 5\n"
203+
]
204+
}
205+
],
206+
"source": [
207+
"example_sequence = \"Hello World, this is a sequence.\"\n",
208+
"\n",
209+
"k = 7\n",
210+
"window_size = 12\n",
211+
"\n",
212+
"sequence_minimizers = minimizers(\n",
213+
" sequence_id=\"seq_1\",\n",
214+
" sequence=example_sequence,\n",
215+
" w=window_size,\n",
216+
" k=k,\n",
217+
")\n",
218+
"sequence_minimizers.sort(key=lambda x: x.position) # sort by position for displaying\n",
219+
"\n",
220+
"print(f\"Collected {len(sequence_minimizers)} minimizers\")\n",
221+
"print(\"The first three Minimizers:\")\n",
222+
"for i in range(3):\n",
223+
" print(sequence_minimizers[i])"
224+
]
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"metadata": {},
229+
"source": [
230+
"Now that I can get the Minimizers I can also find the common Minimizers between two sequences."
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": 16,
236+
"metadata": {},
237+
"outputs": [
238+
{
239+
"name": "stdout",
240+
"output_type": "stream",
241+
"text": [
242+
"Common kmer: ATACGCAT\n",
243+
"atgctagcATACGCATcacgcatc\n",
244+
"ggatcagctcgagcATACGCATacgcatcgcatcgat\n",
245+
"\n",
246+
"Common kmer: ATACGCAT\n",
247+
"atgctagcATACGCATcacgcatc\n",
248+
"ggatcagctcgagcatacgcATACGCATcgcatcgat\n",
249+
"\n",
250+
"Common kmer: TACGCATC\n",
251+
"atgctagcaTACGCATCacgcatc\n",
252+
"ggatcagctcgagcatacgcaTACGCATCgcatcgat\n",
253+
"\n",
254+
"Common kmer: AGCATACG\n",
255+
"atgctAGCATACGcatcacgcatc\n",
256+
"ggatcagctcgAGCATACGcatacgcatcgcatcgat\n",
257+
"\n"
258+
]
259+
}
260+
],
261+
"source": [
262+
"def visualize_minimizer(seq: str, minimizer: Minimizer) -> str:\n",
263+
" before = seq[: minimizer.position].lower()\n",
264+
" after = seq[minimizer.position + len(minimizer.kmer) :].lower()\n",
265+
" return f\"{before}{str(minimizer.kmer.kmer).upper()}{after}\"\n",
266+
"\n",
267+
"\n",
268+
"def compare_sequences(seq1: str, seq2: str, w: int, k: int) -> list[dict]:\n",
269+
" minimizers1 = minimizers(\"seq_1\", seq1, w, k)\n",
270+
" minimizers2 = minimizers(\"seq_2\", seq2, w, k)\n",
271+
"\n",
272+
" minimizer_set1 = {m.kmer for m in minimizers1}\n",
273+
" minimizer_set2 = {m.kmer for m in minimizers2}\n",
274+
"\n",
275+
" common_minimizers = minimizer_set1.intersection(minimizer_set2)\n",
276+
"\n",
277+
" # Prepare results\n",
278+
" comparison_results = []\n",
279+
" for minimizer in common_minimizers:\n",
280+
" # find the substrings that match and show that\n",
281+
" for i in [m for m in minimizers1 if m.kmer == minimizer]:\n",
282+
" for j in [m for m in minimizers2 if m.kmer == minimizer]:\n",
283+
" print(f\"Common kmer: {minimizer.kmer}\")\n",
284+
" print(visualize_minimizer(seq1, i))\n",
285+
" print(visualize_minimizer(seq2, j))\n",
286+
" print(\"\")\n",
287+
"\n",
288+
" return comparison_results\n",
289+
"\n",
290+
"\n",
291+
"sequence1 = \"ATGCTAGCATACGCATCACGCATC\"\n",
292+
"sequence2 = \"GGATCAGCTCGAGCATACGCATACGCATCGCATCGAT\"\n",
293+
"\n",
294+
"w = 10 # Window size\n",
295+
"k = 8 # k-mer size\n",
296+
"comparison_results = compare_sequences(sequence1, sequence2, w, k)"
297+
]
298+
},
299+
{
300+
"cell_type": "markdown",
301+
"metadata": {},
302+
"source": [
303+
"This implementation shows how easy it is for this approach to find appropriate seed locations for starting a pairwise alignment. \n",
304+
"\n",
305+
"The paper goes more into detail and also introduces the concept of End-minimizers. I can only recommend checking it out.\n",
306+
"\n",
307+
"Thats all I have today. I hope it was interesting. "
308+
]
309+
}
310+
],
311+
"metadata": {
312+
"kernelspec": {
313+
"display_name": "reproduce_hic",
314+
"language": "python",
315+
"name": "python3"
316+
},
317+
"language_info": {
318+
"codemirror_mode": {
319+
"name": "ipython",
320+
"version": 3
321+
},
322+
"file_extension": ".py",
323+
"mimetype": "text/x-python",
324+
"name": "python",
325+
"nbconvert_exporter": "python",
326+
"pygments_lexer": "ipython3",
327+
"version": "3.11.0"
328+
}
329+
},
330+
"nbformat": 4,
331+
"nbformat_minor": 2
332+
}

0 commit comments

Comments
 (0)