tracking a minimizer post

openpaul · Sep 4, 2024 · cc303f3 · cc303f3
1 parent 5c2441e
commit cc303f3
Showing 1 changed file with 332 additions and 0 deletions.
diff --git a/posts/2024-09-04-roberts-minimizer.ipynb b/posts/2024-09-04-roberts-minimizer.ipynb
@@ -0,0 +1,332 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "+++\n",
+    "title = 'Minimizers are Just Fancy K-mers'\n",
+    "date = 2024-09-04T08:00:00+00:00\n",
+    "draft = true\n",
+    "+++"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Today I am picking up an old but influential paper. Cited over 400 times, the paper \"Reducing storage requirements for biological sequence comparison\" by [Roberts et al. (2004)](https://academic.oup.com/bioinformatics/article/20/18/3363/202143) has had considerable impact on the sequencing community. If you are using modern aligners, you relied on the ideas published in that paper. \n",
+    "\n",
+    "One notable paper citing this reference is the publication of Minimap2. It is the first citation in its method section, and as such definitely worth a read. \n",
+    "\n",
+    "## The Core Idea\n",
+    "\n",
+    "The core idea in the paper is that to compare sequences in the age of next generation sequencing (NGS) and large datasets one should use a smart way of reducing data to avoid having to compare each sequence to all other sequences. \n",
+    "\n",
+    "To solve that issue the authors present the concept of minimizer. \n",
+    "\n",
+    "Today I will implement that concept and see if I can identify meaningful minimizers.\n",
+    "\n",
+    "\n",
+    "### Minimizers, what are they?\n",
+    "\n",
+    "Minimizers are a relatively simple idea. The key problem they try to solve is to provide good seeds, meaning locations where two sequences are identical, to kick-start an alignment of those two sequences. \n",
+    "\n",
+    "The most naive way of solving this would be to compute all k-mers of both sequences, find the common k-mers and try to align the sequences starting at each k-mer. But those k-mer databases would become huge. \n",
+    "\n",
+    "So to reduce that solution space, instead of storing all k-mers, minimizers are the \"smallest\" k-mers in a window. As long as the sequences share stretches of identical nucleotides large enough, they will also share the odd \"smallest\" k-mer. In reality one could also store the \"biggest\" k-mer. It really does not matter as long as the index and the query operation computes the order of the k-mers the same way. \n",
+    "\n",
+    "All that is left to do for each minimizer is to store its location in the sequence so that one can use it as a seed for an alignment if it is found in the query.\n",
+    "\n",
+    "\n",
+    "## A Simple Implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import hashlib\n",
+    "\n",
+    "from pydantic import BaseModel\n",
+    "\n",
+    "\n",
+    "class Kmer(BaseModel):\n",
+    "    kmer: str\n",
+    "\n",
+    "    def __hash__(self):\n",
+    "        # Hashing function I am using to sort k-mers\n",
+    "        # Sorting lexicographically will lead to\n",
+    "        # uninformative k-mers such as AAAA\n",
+    "        return int(hashlib.md5(self.kmer.encode()).hexdigest(), 16)\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.kmer)\n",
+    "\n",
+    "    def __str__(self):\n",
+    "        return self.kmer\n",
+    "\n",
+    "\n",
+    "class Minimizer(BaseModel):\n",
+    "    kmer: Kmer\n",
+    "    sequence_id: str\n",
+    "    position: int\n",
+    "\n",
+    "    def __hash__(self):\n",
+    "        return int(\n",
+    "            hashlib.md5(f\"{self.kmer}{self.position}\".encode()).hexdigest(), 16\n",
+    "        )\n",
+    "\n",
+    "    def __lt__(self, other):\n",
+    "        return hash(self.kmer) < hash(other.kmer)\n",
+    "\n",
+    "    def __eq__(self, other):\n",
+    "        if not isinstance(other, Minimizer):\n",
+    "            raise ValueError(\"Can only compare to other Minimizer\")\n",
+    "        return self.kmer == other.kmer and self.position == other.position\n",
+    "    \n",
+    "    def __str__(self):\n",
+    "        return f\"'{self.kmer}' @ {self.sequence_id}: {self.position}\"\n",
+    "\n",
+    "\n",
+    "minimizer = Minimizer(kmer=Kmer(kmer=\"ATCG\"), sequence_id=\"seq_1\", position=99)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These classes of Kmer and Minimizer implement the basic functionality that I need next, when I want to find the smallest Minimizer in a sequence. The way I implemented it, I can sort a list of Minimizers based on the hash."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The k-mer with the lowest value is: 'B'\n"
+     ]
+    }
+   ],
+   "source": [
+    "kmers = [\"A\", \"B\", \"C\"]\n",
+    "minimizers = [\n",
+    "    Minimizer(kmer=Kmer(kmer=kmer), position=position, sequence_id=\"1\")\n",
+    "    for position, kmer in enumerate(kmers)\n",
+    "]\n",
+    "sorted_minimizers = sorted(minimizers)\n",
+    "print(f\"The k-mer with the lowest value is: '{sorted_minimizers[0].kmer}'\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "That's working well. Now I can get the Minimizers for a sequence along its windows and store them:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def windows(sequence: str, window_size: int):\n",
+    "    for i in range(len(sequence) - window_size + 1):\n",
+    "        yield i\n",
+    "\n",
+    "\n",
+    "def get_minimizer(\n",
+    "    sequence_id: str, sequence: str, k: int, offset: int\n",
+    ") -> Minimizer:\n",
+    "    all_kmers = []\n",
+    "    for i in range(len(sequence) - k + 1):\n",
+    "        kmer = sequence[i : i + k]\n",
+    "        all_kmers.append(\n",
+    "            Minimizer(\n",
+    "                kmer=Kmer(kmer=kmer),\n",
+    "                sequence_id=sequence_id,\n",
+    "                position=offset + i,\n",
+    "            )\n",
+    "        )\n",
+    "    all_kmers.sort()\n",
+    "\n",
+    "    return all_kmers[0]\n",
+    "\n",
+    "\n",
+    "def minimizers(\n",
+    "    sequence_id: str, sequence: str, w: int, k: int, unique: bool = True\n",
+    ") -> list[dict]:\n",
+    "    minimizers = []\n",
+    "\n",
+    "    for offset, start in enumerate(windows(sequence, w)):\n",
+    "        window = sequence[start : start + w]\n",
+    "        minimizers.append(get_minimizer(sequence_id, window, k, offset))\n",
+    "\n",
+    "    if unique:\n",
+    "        minimizers = list(set(minimizers))\n",
+    "    return minimizers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These basic functions is all I really need to get the Minimizers of a sequence given a `window size` and a `k`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Collected 7 minimizers\n",
+      "The first three Minimizers:\n",
+      "'Hello W' @ seq_1: 0\n",
+      "'llo Wor' @ seq_1: 2\n",
+      "' World,' @ seq_1: 5\n"
+     ]
+    }
+   ],
+   "source": [
+    "example_sequence = \"Hello World, this is a sequence.\"\n",
+    "\n",
+    "k = 7\n",
+    "window_size = 12\n",
+    "\n",
+    "sequence_minimizers = minimizers(\n",
+    "    sequence_id=\"seq_1\",\n",
+    "    sequence=example_sequence,\n",
+    "    w=window_size,\n",
+    "    k=k,\n",
+    ")\n",
+    "sequence_minimizers.sort(key=lambda x: x.position) # sort by position for displaying\n",
+    "\n",
+    "print(f\"Collected {len(sequence_minimizers)} minimizers\")\n",
+    "print(\"The first three Minimizers:\")\n",
+    "for i in range(3):\n",
+    "    print(sequence_minimizers[i])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that I can get the Minimizers I can also find the common Minimizers between two sequences."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Common kmer: ATACGCAT\n",
+      "atgctagcATACGCATcacgcatc\n",
+      "ggatcagctcgagcATACGCATacgcatcgcatcgat\n",
+      "\n",
+      "Common kmer: ATACGCAT\n",
+      "atgctagcATACGCATcacgcatc\n",
+      "ggatcagctcgagcatacgcATACGCATcgcatcgat\n",
+      "\n",
+      "Common kmer: TACGCATC\n",
+      "atgctagcaTACGCATCacgcatc\n",
+      "ggatcagctcgagcatacgcaTACGCATCgcatcgat\n",
+      "\n",
+      "Common kmer: AGCATACG\n",
+      "atgctAGCATACGcatcacgcatc\n",
+      "ggatcagctcgAGCATACGcatacgcatcgcatcgat\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "def visualize_minimizer(seq: str, minimizer: Minimizer) -> str:\n",
+    "    before = seq[: minimizer.position].lower()\n",
+    "    after = seq[minimizer.position + len(minimizer.kmer) :].lower()\n",
+    "    return f\"{before}{str(minimizer.kmer.kmer).upper()}{after}\"\n",
+    "\n",
+    "\n",
+    "def compare_sequences(seq1: str, seq2: str, w: int, k: int) -> list[dict]:\n",
+    "    minimizers1 = minimizers(\"seq_1\", seq1, w, k)\n",
+    "    minimizers2 = minimizers(\"seq_2\", seq2, w, k)\n",
+    "\n",
+    "    minimizer_set1 = {m.kmer for m in minimizers1}\n",
+    "    minimizer_set2 = {m.kmer for m in minimizers2}\n",
+    "\n",
+    "    common_minimizers = minimizer_set1.intersection(minimizer_set2)\n",
+    "\n",
+    "    # Prepare results\n",
+    "    comparison_results = []\n",
+    "    for minimizer in common_minimizers:\n",
+    "        # find the substrings that match and show that\n",
+    "        for i in [m for m in minimizers1 if m.kmer == minimizer]:\n",
+    "            for j in [m for m in minimizers2 if m.kmer == minimizer]:\n",
+    "                print(f\"Common kmer: {minimizer.kmer}\")\n",
+    "                print(visualize_minimizer(seq1, i))\n",
+    "                print(visualize_minimizer(seq2, j))\n",
+    "                print(\"\")\n",
+    "\n",
+    "    return comparison_results\n",
+    "\n",
+    "\n",
+    "sequence1 = \"ATGCTAGCATACGCATCACGCATC\"\n",
+    "sequence2 = \"GGATCAGCTCGAGCATACGCATACGCATCGCATCGAT\"\n",
+    "\n",
+    "w = 10  # Window size\n",
+    "k = 8  # k-mer size\n",
+    "comparison_results = compare_sequences(sequence1, sequence2, w, k)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This implementation shows how easy it is for this approach to find appropriate seed locations for starting a pairwise alignment. \n",
+    "\n",
+    "The paper goes more into detail and also introduces the concept of End-minimizers. I can only recommend checking it out.\n",
+    "\n",
+    "Thats all I have today. I hope it was interesting. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "reproduce_hic",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}