diff --git a/posts/2024-09-04-roberts-minimizer.ipynb b/posts/2024-09-04-roberts-minimizer.ipynb new file mode 100644 index 0000000..6d4b172 --- /dev/null +++ b/posts/2024-09-04-roberts-minimizer.ipynb @@ -0,0 +1,332 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "+++\n", + "title = 'Minimizers are Just Fancy K-mers'\n", + "date = 2024-09-04T08:00:00+00:00\n", + "draft = true\n", + "+++" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Today I am picking up an old but influential paper. Cited over 400 times, the paper \"Reducing storage requirements for biological sequence comparison\" by [Roberts et al. (2004)](https://academic.oup.com/bioinformatics/article/20/18/3363/202143) has had considerable impact on the sequencing community. If you are using modern aligners, you relied on the ideas published in that paper. \n", + "\n", + "One notable paper citing this reference is the publication of Minimap2. It is the first citation in its method section, and as such definitely worth a read. \n", + "\n", + "## The Core Idea\n", + "\n", + "The core idea in the paper is that to compare sequences in the age of next generation sequencing (NGS) and large datasets one should use a smart way of reducing data to avoid having to compare each sequence to all other sequences. \n", + "\n", + "To solve that issue the authors present the concept of minimizer. \n", + "\n", + "Today I will implement that concept and see if I can identify meaningful minimizers.\n", + "\n", + "\n", + "### Minimizers, what are they?\n", + "\n", + "Minimizers are a relatively simple idea. The key problem they try to solve is to provide good seeds, meaning locations where two sequences are identical, to kick-start an alignment of those two sequences. \n", + "\n", + "The most naive way of solving this would be to compute all k-mers of both sequences, find the common k-mers and try to align the sequences starting at each k-mer. But those k-mer databases would become huge. \n", + "\n", + "So to reduce that solution space, instead of storing all k-mers, minimizers are the \"smallest\" k-mers in a window. As long as the sequences share stretches of identical nucleotides large enough, they will also share the odd \"smallest\" k-mer. In reality one could also store the \"biggest\" k-mer. It really does not matter as long as the index and the query operation computes the order of the k-mers the same way. \n", + "\n", + "All that is left to do for each minimizer is to store its location in the sequence so that one can use it as a seed for an alignment if it is found in the query.\n", + "\n", + "\n", + "## A Simple Implementation" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "import hashlib\n", + "\n", + "from pydantic import BaseModel\n", + "\n", + "\n", + "class Kmer(BaseModel):\n", + " kmer: str\n", + "\n", + " def __hash__(self):\n", + " # Hashing function I am using to sort k-mers\n", + " # Sorting lexicographically will lead to\n", + " # uninformative k-mers such as AAAA\n", + " return int(hashlib.md5(self.kmer.encode()).hexdigest(), 16)\n", + "\n", + " def __len__(self):\n", + " return len(self.kmer)\n", + "\n", + " def __str__(self):\n", + " return self.kmer\n", + "\n", + "\n", + "class Minimizer(BaseModel):\n", + " kmer: Kmer\n", + " sequence_id: str\n", + " position: int\n", + "\n", + " def __hash__(self):\n", + " return int(\n", + " hashlib.md5(f\"{self.kmer}{self.position}\".encode()).hexdigest(), 16\n", + " )\n", + "\n", + " def __lt__(self, other):\n", + " return hash(self.kmer) < hash(other.kmer)\n", + "\n", + " def __eq__(self, other):\n", + " if not isinstance(other, Minimizer):\n", + " raise ValueError(\"Can only compare to other Minimizer\")\n", + " return self.kmer == other.kmer and self.position == other.position\n", + " \n", + " def __str__(self):\n", + " return f\"'{self.kmer}' @ {self.sequence_id}: {self.position}\"\n", + "\n", + "\n", + "minimizer = Minimizer(kmer=Kmer(kmer=\"ATCG\"), sequence_id=\"seq_1\", position=99)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These classes of Kmer and Minimizer implement the basic functionality that I need next, when I want to find the smallest Minimizer in a sequence. The way I implemented it, I can sort a list of Minimizers based on the hash." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The k-mer with the lowest value is: 'B'\n" + ] + } + ], + "source": [ + "kmers = [\"A\", \"B\", \"C\"]\n", + "minimizers = [\n", + " Minimizer(kmer=Kmer(kmer=kmer), position=position, sequence_id=\"1\")\n", + " for position, kmer in enumerate(kmers)\n", + "]\n", + "sorted_minimizers = sorted(minimizers)\n", + "print(f\"The k-mer with the lowest value is: '{sorted_minimizers[0].kmer}'\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's working well. Now I can get the Minimizers for a sequence along its windows and store them:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "def windows(sequence: str, window_size: int):\n", + " for i in range(len(sequence) - window_size + 1):\n", + " yield i\n", + "\n", + "\n", + "def get_minimizer(\n", + " sequence_id: str, sequence: str, k: int, offset: int\n", + ") -> Minimizer:\n", + " all_kmers = []\n", + " for i in range(len(sequence) - k + 1):\n", + " kmer = sequence[i : i + k]\n", + " all_kmers.append(\n", + " Minimizer(\n", + " kmer=Kmer(kmer=kmer),\n", + " sequence_id=sequence_id,\n", + " position=offset + i,\n", + " )\n", + " )\n", + " all_kmers.sort()\n", + "\n", + " return all_kmers[0]\n", + "\n", + "\n", + "def minimizers(\n", + " sequence_id: str, sequence: str, w: int, k: int, unique: bool = True\n", + ") -> list[dict]:\n", + " minimizers = []\n", + "\n", + " for offset, start in enumerate(windows(sequence, w)):\n", + " window = sequence[start : start + w]\n", + " minimizers.append(get_minimizer(sequence_id, window, k, offset))\n", + "\n", + " if unique:\n", + " minimizers = list(set(minimizers))\n", + " return minimizers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These basic functions is all I really need to get the Minimizers of a sequence given a `window size` and a `k`." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collected 7 minimizers\n", + "The first three Minimizers:\n", + "'Hello W' @ seq_1: 0\n", + "'llo Wor' @ seq_1: 2\n", + "' World,' @ seq_1: 5\n" + ] + } + ], + "source": [ + "example_sequence = \"Hello World, this is a sequence.\"\n", + "\n", + "k = 7\n", + "window_size = 12\n", + "\n", + "sequence_minimizers = minimizers(\n", + " sequence_id=\"seq_1\",\n", + " sequence=example_sequence,\n", + " w=window_size,\n", + " k=k,\n", + ")\n", + "sequence_minimizers.sort(key=lambda x: x.position) # sort by position for displaying\n", + "\n", + "print(f\"Collected {len(sequence_minimizers)} minimizers\")\n", + "print(\"The first three Minimizers:\")\n", + "for i in range(3):\n", + " print(sequence_minimizers[i])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that I can get the Minimizers I can also find the common Minimizers between two sequences." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Common kmer: ATACGCAT\n", + "atgctagcATACGCATcacgcatc\n", + "ggatcagctcgagcATACGCATacgcatcgcatcgat\n", + "\n", + "Common kmer: ATACGCAT\n", + "atgctagcATACGCATcacgcatc\n", + "ggatcagctcgagcatacgcATACGCATcgcatcgat\n", + "\n", + "Common kmer: TACGCATC\n", + "atgctagcaTACGCATCacgcatc\n", + "ggatcagctcgagcatacgcaTACGCATCgcatcgat\n", + "\n", + "Common kmer: AGCATACG\n", + "atgctAGCATACGcatcacgcatc\n", + "ggatcagctcgAGCATACGcatacgcatcgcatcgat\n", + "\n" + ] + } + ], + "source": [ + "def visualize_minimizer(seq: str, minimizer: Minimizer) -> str:\n", + " before = seq[: minimizer.position].lower()\n", + " after = seq[minimizer.position + len(minimizer.kmer) :].lower()\n", + " return f\"{before}{str(minimizer.kmer.kmer).upper()}{after}\"\n", + "\n", + "\n", + "def compare_sequences(seq1: str, seq2: str, w: int, k: int) -> list[dict]:\n", + " minimizers1 = minimizers(\"seq_1\", seq1, w, k)\n", + " minimizers2 = minimizers(\"seq_2\", seq2, w, k)\n", + "\n", + " minimizer_set1 = {m.kmer for m in minimizers1}\n", + " minimizer_set2 = {m.kmer for m in minimizers2}\n", + "\n", + " common_minimizers = minimizer_set1.intersection(minimizer_set2)\n", + "\n", + " # Prepare results\n", + " comparison_results = []\n", + " for minimizer in common_minimizers:\n", + " # find the substrings that match and show that\n", + " for i in [m for m in minimizers1 if m.kmer == minimizer]:\n", + " for j in [m for m in minimizers2 if m.kmer == minimizer]:\n", + " print(f\"Common kmer: {minimizer.kmer}\")\n", + " print(visualize_minimizer(seq1, i))\n", + " print(visualize_minimizer(seq2, j))\n", + " print(\"\")\n", + "\n", + " return comparison_results\n", + "\n", + "\n", + "sequence1 = \"ATGCTAGCATACGCATCACGCATC\"\n", + "sequence2 = \"GGATCAGCTCGAGCATACGCATACGCATCGCATCGAT\"\n", + "\n", + "w = 10 # Window size\n", + "k = 8 # k-mer size\n", + "comparison_results = compare_sequences(sequence1, sequence2, w, k)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This implementation shows how easy it is for this approach to find appropriate seed locations for starting a pairwise alignment. \n", + "\n", + "The paper goes more into detail and also introduces the concept of End-minimizers. I can only recommend checking it out.\n", + "\n", + "Thats all I have today. I hope it was interesting. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "reproduce_hic", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}