From 213d5af0fe919277087fe7e80793eb35ad88fa82 Mon Sep 17 00:00:00 2001
From: Jeff Vestal <53237856+jeffvestal@users.noreply.github.com>
Date: Fri, 1 Nov 2024 11:30:36 -0500
Subject: [PATCH] addding notebook for re-ranking-elasticsearch-hosted blog
 (#346)

* addding notebook for re-ranking-elasticsearch-hosted blog

* adding cloud trial signup link

* Update supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb

fix typo

Co-authored-by: Carly Richmond <74931905+carlyrichmond@users.noreply.github.com>

* fixed index name

* typo

---------

Co-authored-by: Carly Richmond <74931905+carlyrichmond@users.noreply.github.com>
---
 .../re-ranking-elasticsearch-hosted.ipynb     | 819 ++++++++++++++++++
 1 file changed, 819 insertions(+)
 create mode 100644 supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb

diff --git a/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb b/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb
new file mode 100644
index 00000000..281f8a47
--- /dev/null
+++ b/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb
@@ -0,0 +1,819 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "name": "python3",
+   "display_name": "Python 3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "*Reranking with a locally hosted reranker model from HuggingFace*"
+   ],
+   "metadata": {
+    "id": "bW9q8qD_bPhY"
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Setup the notebook"
+   ],
+   "metadata": {
+    "id": "BecBOzyDbWik"
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Install required libs"
+   ],
+   "metadata": {
+    "id": "6ayhDP72bZAe"
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "2Xz9uWQFbNkH"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -qqU elasticsearch\n",
+    "!pip install -qqU eland[pytorch]\n",
+    "!pip install -qqU datasets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Import the required python libraries"
+   ],
+   "metadata": {
+    "id": "LgHQaJh0bmJQ"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "import os\n",
+    "from elasticsearch import Elasticsearch, helpers, exceptions\n",
+    "from urllib.request import urlopen\n",
+    "from getpass import getpass\n",
+    "import json\n",
+    "import time\n",
+    "from datasets import load_dataset\n",
+    "import pandas as pd"
+   ],
+   "metadata": {
+    "id": "CsL466H0bjNX"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Create an Elasticsearch Python client\n",
+    "\n"
+   ],
+   "metadata": {
+    "id": "gsQ4XIpkbpd4"
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Free Trial\n",
+    "If you don't have an Elasticsearch cluster, or what one to test out. Head over to [cloud.elastic.co](https://cloud.elastic.co/registration?onboarding_token=search&cta=cloud-registration&tech=trial&plcmt=article%20content&pg=search-labs) and sign up. You can sign up and have a serverless project up and running in only a few mintues!"
+   ],
+   "metadata": {
+    "id": "M_JppGte1Wc5"
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "We are using an Elastic Cloud cloud_id and deployment (cluster) API key.\n",
+    "\n",
+    "[See this guide](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud) for finding the `cloud_id` and creating an `api_key`"
+   ],
+   "metadata": {
+    "id": "gdsEjwhV1tr5"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "cloud_id = getpass(prompt=\"Enter your Elasticsearch Cloud ID: \")\n",
+    "api_key = getpass(prompt=\"Enter your Elasticsearch API key: \")\n",
+    "\n",
+    "\n",
+    "es = Elasticsearch(cloud_id=cloud_id, api_key=api_key)\n",
+    "\n",
+    "try:\n",
+    "    es.info()\n",
+    "    print(\"Successfully connected to Elasticsearch!\")\n",
+    "except exceptions.ConnectionError as e:\n",
+    "    print(f\"Error connecting to Elasticsearch: {e}\")"
+   ],
+   "metadata": {
+    "id": "UY5WCB0HUVTb"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Ready Elasticsearch"
+   ],
+   "metadata": {
+    "id": "jQdzhNbB_e3n"
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Hugging Face Reranking Model\n",
+    "Run this cell to:\n",
+    "- Use Eland's `eland_import_hub_model` command to upload the reranking model to Elasticsearch.\n",
+    "\n",
+    "For this example we've chosen the [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) text similarity model.\n",
+    "<br><br>\n",
+    "**Note**:\n",
+    "While we are importing the model for use as a reranker, Eland and Elasticsearch do not have a dedicated rerank task type, so we still use `text_similarity`"
+   ],
+   "metadata": {
+    "id": "5bsLLnqCfNKk"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "model_id = \"cross-encoder/ms-marco-MiniLM-L-6-v2\"\n",
+    "\n",
+    "!eland_import_hub_model \\\n",
+    "  --cloud-id $cloud_id \\\n",
+    "  --es-api-key $api_key \\\n",
+    "  --hub-model-id $model_id \\\n",
+    "  --task-type text_similarity"
+   ],
+   "metadata": {
+    "id": "J2MTEYrUfk9R"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Create Inference Endpoint\n",
+    "Run this cell to:\n",
+    "- Create an inference Endpoint\n",
+    "- Deploy the reranking model we impoted in the previous section\n",
+    "We need to create an endpoint queries can use for reranking\n",
+    "\n",
+    "Key points about the `model_config`\n",
+    "- `service` - in this case `elasticsearch` will tell the inference API to use a locally hosted (in Elasticsearch) model\n",
+    "- `num_allocations` sets the number of allocations to 1\n",
+    "    - Allocations are independent units of work for NLP tasks. Scaling this allows for an increase in concurrent throughput\n",
+    "- `num_threads` - sets the number of threads per allocation to 1\n",
+    "    - Threads per allocation affect the number of threads used by each allocation during inference. Scaling this generally increased the speed of inference requests (to a point).\n",
+    "- `model_id` - This is the id of the model as it is named in Elasticsearch\n",
+    "\n"
+   ],
+   "metadata": {
+    "id": "-rrQV6SAgWz8"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "model_config = {\n",
+    "    \"service\": \"elasticsearch\",\n",
+    "    \"service_settings\": {\n",
+    "        \"num_allocations\": 1,\n",
+    "        \"num_threads\": 1,\n",
+    "        \"model_id\": \"cross-encoder__ms-marco-minilm-l-6-v2\",\n",
+    "    },\n",
+    "    \"task_settings\": {\"return_documents\": True},\n",
+    "}\n",
+    "\n",
+    "inference_id = \"semantic-reranking\"\n",
+    "\n",
+    "create_endpoint = es.inference.put(\n",
+    "    inference_id=inference_id, task_type=\"rerank\", body=model_config\n",
+    ")\n",
+    "\n",
+    "create_endpoint.body"
+   ],
+   "metadata": {
+    "id": "Abu084BYgWCE",
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "outputId": "d58cc940-281e-4d56-d6e6-040e4881e78a"
+   },
+   "execution_count": null,
+   "outputs": [
+    {
+     "output_type": "execute_result",
+     "data": {
+      "text/plain": [
+       "{'inference_id': 'semantic-reranking',\n",
+       " 'task_type': 'rerank',\n",
+       " 'service': 'elasticsearch',\n",
+       " 'service_settings': {'num_allocations': 1,\n",
+       "  'num_threads': 1,\n",
+       "  'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n",
+       " 'task_settings': {'return_documents': True}}"
+      ]
+     },
+     "metadata": {},
+     "execution_count": 6
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Verify it was created\n",
+    "\n",
+    "- Run the two cells in this section to verify:\n",
+    "- The Inference Endpoint has been completed\n",
+    "- The model has been deployed\n",
+    "\n",
+    "You should see JSON output with information about the semantic endpoint"
+   ],
+   "metadata": {
+    "id": "X8rQXMrHhMkS"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "check_endpoint = es.inference.get(\n",
+    "    inference_id=inference_id,\n",
+    ")\n",
+    "\n",
+    "check_endpoint.body"
+   ],
+   "metadata": {
+    "id": "n3Yk7rgYhP-N",
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "outputId": "d9e68225-5796-411e-964a-6db3be5541aa"
+   },
+   "execution_count": null,
+   "outputs": [
+    {
+     "output_type": "execute_result",
+     "data": {
+      "text/plain": [
+       "{'endpoints': [{'inference_id': 'semantic-reranking',\n",
+       "   'task_type': 'rerank',\n",
+       "   'service': 'elasticsearch',\n",
+       "   'service_settings': {'num_allocations': 1,\n",
+       "    'num_threads': 1,\n",
+       "    'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n",
+       "   'task_settings': {'return_documents': True}}]}"
+      ]
+     },
+     "metadata": {},
+     "execution_count": 7
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Create the index mapping\n",
+    "\n",
+    "We are going to index the `title` and `abstract` from the dataset.  "
+   ],
+   "metadata": {
+    "id": "4vqimyNWAhWb"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "index_name = \"arxiv-papers\"\n",
+    "\n",
+    "index_mapping = {\n",
+    "    \"mappings\": {\n",
+    "        \"properties\": {\"title\": {\"type\": \"text\"}, \"abstract\": {\"type\": \"text\"}}\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "\n",
+    "try:\n",
+    "    es.indices.create(index=index_name, body=index_mapping)\n",
+    "    print(f\"Index '{index_name}' created successfully.\")\n",
+    "except exceptions.RequestError as e:\n",
+    "    if e.error == \"resource_already_exists_exception\":\n",
+    "        print(f\"Index '{index_name}' already exists.\")\n",
+    "    else:\n",
+    "        print(f\"Error creating index '{index_name}': {e}\")"
+   ],
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "outputId": "4bc9ba0b-9be3-410a-d1d6-f1da04bbfec7",
+    "id": "DPADF_7ytTmR"
+   },
+   "execution_count": null,
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Index 'arxiv-papers' created successfully.\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Ready the dataset\n",
+    "We are going to use the [CShorten/ML-ArXiv-Papers](https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers) dataset."
+   ],
+   "metadata": {
+    "id": "FqQmaT5P-Nhx"
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Download Dataset\n",
+    "**Note** You may get a warning *The secret `HF_TOKEN` does not exist in your Colab secrets*.\n",
+    "\n",
+    "You can safely ignore this."
+   ],
+   "metadata": {
+    "id": "aN0dbYO7oB47"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "dataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")"
+   ],
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "IVnpj5bBoEBL",
+    "outputId": "bc6371d9-d66f-482c-95f8-8cdb89e4f0ef"
+   },
+   "execution_count": null,
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n",
+      "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
+      "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
+      "You will be able to reuse this secret in all of your notebooks.\n",
+      "Please note that authentication is recommended but still optional to access public models or datasets.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Index into Elasticsearch\n",
+    "\n",
+    "We will loop through the dataset and send batches of rows to Elasticsearch\n",
+    "- This may take a couple minutes depending on your cluster sizing."
+   ],
+   "metadata": {
+    "id": "GQxDITCpAKWb"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "def bulk_insert_elasticsearch(dataset, index_name, chunk_size=1000):\n",
+    "    actions = []\n",
+    "    for record in dataset:\n",
+    "        action = {\n",
+    "            \"_index\": index_name,\n",
+    "            \"_source\": {\"title\": record[\"title\"], \"abstract\": record[\"abstract\"]},\n",
+    "        }\n",
+    "        actions.append(action)\n",
+    "\n",
+    "        if len(actions) == chunk_size:\n",
+    "            helpers.bulk(es, actions)\n",
+    "            actions = []\n",
+    "\n",
+    "    if actions:\n",
+    "        helpers.bulk(es, actions)\n",
+    "\n",
+    "\n",
+    "bulk_insert_elasticsearch(dataset[\"train\"], index_name)"
+   ],
+   "metadata": {
+    "id": "tDZ0qEbW-ozW"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Query with Reranking\n",
+    "\n",
+    "This containes a `text_similarity_reranker` retriever which:\n",
+    "1. Uses a Standard Retriever to :\n",
+    "    1. Perform a lexical query against `title field\n",
+    "2. Perform a reranking:\n",
+    "    1. Taks as input the top 100 results from the previous search\n",
+    "      - `\"rank_window_size\": 100`\n",
+    "    2. Taks as input the query\n",
+    "      - `\"inference_text\": query`\n",
+    "    3.  Uses our previously created reranking API and model\n"
+   ],
+   "metadata": {
+    "id": "2bwvzLfRjJ2n"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "query = \"sparse vector embedding\"\n",
+    "\n",
+    "# Query scored from score\n",
+    "response_scored = es.search(\n",
+    "    index=\"arxiv-papers\",\n",
+    "    body={\n",
+    "        \"size\": 10,\n",
+    "        \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n",
+    "        \"fields\": [\"title\", \"abstract\"],\n",
+    "        \"_source\": False,\n",
+    "    },\n",
+    ")\n",
+    "\n",
+    "# Query with Semantic Reranker\n",
+    "response_reranked = es.search(\n",
+    "    index=\"arxiv-papers\",\n",
+    "    body={\n",
+    "        \"size\": 10,\n",
+    "        \"retriever\": {\n",
+    "            \"text_similarity_reranker\": {\n",
+    "                \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n",
+    "                \"field\": \"abstract\",\n",
+    "                \"inference_id\": \"semantic-reranking\",\n",
+    "                \"inference_text\": query,\n",
+    "                \"rank_window_size\": 100,\n",
+    "            }\n",
+    "        },\n",
+    "        \"fields\": [\"title\", \"abstract\"],\n",
+    "        \"_source\": False,\n",
+    "    },\n",
+    ")"
+   ],
+   "metadata": {
+    "id": "HWXQBS35jQ3n"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Print the table comparing the scored and reranked results"
+   ],
+   "metadata": {
+    "id": "Hnam80Irbj6a"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "titles_scored = [\n",
+    "    paper[\"fields\"][\"title\"][0] for paper in response_scored.body[\"hits\"][\"hits\"]\n",
+    "]\n",
+    "titles_reranked = [\n",
+    "    paper[\"fields\"][\"title\"][0] for paper in response_reranked.body[\"hits\"][\"hits\"]\n",
+    "]\n",
+    "\n",
+    "# Creating a DataFrame\n",
+    "df = pd.DataFrame(\n",
+    "    {\"Scored Results\": titles_scored, \"Reranked Results\": titles_reranked}\n",
+    ")\n",
+    "\n",
+    "df_styled = df.style.set_properties(**{\"text-align\": \"left\"}).set_caption(\n",
+    "    f\"Comparison of Scored and Semantic Reranked Results - Query: '{query}'\"\n",
+    ")\n",
+    "\n",
+    "# Display the table\n",
+    "df_styled"
+   ],
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 415
+    },
+    "id": "yTTNYCYcBtll",
+    "outputId": "0f0af538-13fd-4e62-e3d2-1dd185689904"
+   },
+   "execution_count": null,
+   "outputs": [
+    {
+     "output_type": "execute_result",
+     "data": {
+      "text/plain": [
+       "<pandas.io.formats.style.Styler at 0x7dad78316950>"
+      ],
+      "text/html": [
+       "<style type=\"text/css\">\n",
+       "#T_b97e6_row0_col0, #T_b97e6_row0_col1, #T_b97e6_row1_col0, #T_b97e6_row1_col1, #T_b97e6_row2_col0, #T_b97e6_row2_col1, #T_b97e6_row3_col0, #T_b97e6_row3_col1, #T_b97e6_row4_col0, #T_b97e6_row4_col1, #T_b97e6_row5_col0, #T_b97e6_row5_col1, #T_b97e6_row6_col0, #T_b97e6_row6_col1, #T_b97e6_row7_col0, #T_b97e6_row7_col1, #T_b97e6_row8_col0, #T_b97e6_row8_col1, #T_b97e6_row9_col0, #T_b97e6_row9_col1 {\n",
+       "  text-align: left;\n",
+       "}\n",
+       "</style>\n",
+       "<table id=\"T_b97e6\" class=\"dataframe\">\n",
+       "  <caption>Comparison of Scored and Semantic Reranked Results - Query: 'sparse vector embedding'</caption>\n",
+       "  <thead>\n",
+       "    <tr>\n",
+       "      <th class=\"blank level0\" >&nbsp;</th>\n",
+       "      <th id=\"T_b97e6_level0_col0\" class=\"col_heading level0 col0\" >Scored Results</th>\n",
+       "      <th id=\"T_b97e6_level0_col1\" class=\"col_heading level0 col1\" >Reranked Results</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
+       "      <td id=\"T_b97e6_row0_col0\" class=\"data row0 col0\" >Compact Speaker Embedding: lrx-vector</td>\n",
+       "      <td id=\"T_b97e6_row0_col1\" class=\"data row0 col1\" >Scaling Up Sparse Support Vector Machines by Simultaneous Feature and\n",
+       "  Sample Reduction</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
+       "      <td id=\"T_b97e6_row1_col0\" class=\"data row1 col0\" >Quantum Sparse Support Vector Machines</td>\n",
+       "      <td id=\"T_b97e6_row1_col1\" class=\"data row1 col1\" >Spaceland Embedding of Sparse Stochastic Graphs</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
+       "      <td id=\"T_b97e6_row2_col0\" class=\"data row2 col0\" >Sparse Support Vector Infinite Push</td>\n",
+       "      <td id=\"T_b97e6_row2_col1\" class=\"data row2 col1\" >Elliptical Ordinal Embedding</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n",
+       "      <td id=\"T_b97e6_row3_col0\" class=\"data row3 col0\" >The Sparse Vector Technique, Revisited</td>\n",
+       "      <td id=\"T_b97e6_row3_col1\" class=\"data row3 col1\" >Minimum-Distortion Embedding</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n",
+       "      <td id=\"T_b97e6_row4_col0\" class=\"data row4 col0\" >L-Vector: Neural Label Embedding for Domain Adaptation</td>\n",
+       "      <td id=\"T_b97e6_row4_col1\" class=\"data row4 col1\" >Free Gap Information from the Differentially Private Sparse Vector and\n",
+       "  Noisy Max Mechanisms</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n",
+       "      <td id=\"T_b97e6_row5_col0\" class=\"data row5 col0\" >Spaceland Embedding of Sparse Stochastic Graphs</td>\n",
+       "      <td id=\"T_b97e6_row5_col1\" class=\"data row5 col1\" >Interpolated Discretized Embedding of Single Vectors and Vector Pairs\n",
+       "  for Classification, Metric Learning and Distance Approximation</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row6\" class=\"row_heading level0 row6\" >6</th>\n",
+       "      <td id=\"T_b97e6_row6_col0\" class=\"data row6 col0\" >Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n",
+       "  Correlation</td>\n",
+       "      <td id=\"T_b97e6_row6_col1\" class=\"data row6 col1\" >Attention Word Embedding</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row7\" class=\"row_heading level0 row7\" >7</th>\n",
+       "      <td id=\"T_b97e6_row7_col0\" class=\"data row7 col0\" >Stable Sparse Subspace Embedding for Dimensionality Reduction</td>\n",
+       "      <td id=\"T_b97e6_row7_col1\" class=\"data row7 col1\" >Binary Speaker Embedding</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row8\" class=\"row_heading level0 row8\" >8</th>\n",
+       "      <td id=\"T_b97e6_row8_col0\" class=\"data row8 col0\" >Auto-weighted Mutli-view Sparse Reconstructive Embedding</td>\n",
+       "      <td id=\"T_b97e6_row8_col1\" class=\"data row8 col1\" >NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th id=\"T_b97e6_level0_row9\" class=\"row_heading level0 row9\" >9</th>\n",
+       "      <td id=\"T_b97e6_row9_col0\" class=\"data row9 col0\" >Embedding Words in Non-Vector Space with Unsupervised Graph Learning</td>\n",
+       "      <td id=\"T_b97e6_row9_col1\" class=\"data row9 col1\" >Estimating Vector Fields on Manifolds and the Embedding of Directed\n",
+       "  Graphs</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n"
+      ]
+     },
+     "metadata": {},
+     "execution_count": 17
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Print out Title and Abstract\n",
+    "This will print the title and the abstract for the top 10 results after semantic reranking."
+   ],
+   "metadata": {
+    "id": "A0HyNZoWyeun"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "for paper in response_scored.body[\"hits\"][\"hits\"]:\n",
+    "    print(\n",
+    "        f\"Title {paper['fields']['title'][0]} \\n  Abstract: {paper['fields']['abstract'][0]}\"\n",
+    "    )"
+   ],
+   "metadata": {
+    "id": "4ZEx-46rn3in",
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "outputId": "e83318ad-ca42-4aa7-98d4-37c4428eb70a"
+   },
+   "execution_count": null,
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Title Compact Speaker Embedding: lrx-vector \n",
+      "  Abstract:   Deep neural networks (DNN) have recently been widely used in speaker\n",
+      "recognition systems, achieving state-of-the-art performance on various\n",
+      "benchmarks. The x-vector architecture is especially popular in this research\n",
+      "community, due to its excellent performance and manageable computational\n",
+      "complexity. In this paper, we present the lrx-vector system, which is the\n",
+      "low-rank factorized version of the x-vector embedding network. The primary\n",
+      "objective of this topology is to further reduce the memory requirement of the\n",
+      "speaker recognition system. We discuss the deployment of knowledge distillation\n",
+      "for training the lrx-vector system and compare against low-rank factorization\n",
+      "with SVD. On the VOiCES 2019 far-field corpus we were able to reduce the\n",
+      "weights by 28% compared to the full-rank x-vector system while keeping the\n",
+      "recognition rate constant (1.83% EER).\n",
+      "\n",
+      "Title Quantum Sparse Support Vector Machines \n",
+      "  Abstract:   We analyze the computational complexity of Quantum Sparse Support Vector\n",
+      "Machine, a linear classifier that minimizes the hinge loss and the $L_1$ norm\n",
+      "of the feature weights vector and relies on a quantum linear programming solver\n",
+      "instead of a classical solver. Sparse SVM leads to sparse models that use only\n",
+      "a small fraction of the input features in making decisions, and is especially\n",
+      "useful when the total number of features, $p$, approaches or exceeds the number\n",
+      "of training samples, $m$. We prove a $\\Omega(m)$ worst-case lower bound for\n",
+      "computational complexity of any quantum training algorithm relying on black-box\n",
+      "access to training samples; quantum sparse SVM has at least linear worst-case\n",
+      "complexity. However, we prove that there are realistic scenarios in which a\n",
+      "sparse linear classifier is expected to have high accuracy, and can be trained\n",
+      "in sublinear time in terms of both the number of training samples and the\n",
+      "number of features.\n",
+      "\n",
+      "Title Sparse Support Vector Infinite Push \n",
+      "  Abstract:   In this paper, we address the problem of embedded feature selection for\n",
+      "ranking on top of the list problems. We pose this problem as a regularized\n",
+      "empirical risk minimization with $p$-norm push loss function ($p=\\infty$) and\n",
+      "sparsity inducing regularizers. We leverage the issues related to this\n",
+      "challenging optimization problem by considering an alternating direction method\n",
+      "of multipliers algorithm which is built upon proximal operators of the loss\n",
+      "function and the regularizer. Our main technical contribution is thus to\n",
+      "provide a numerical scheme for computing the infinite push loss function\n",
+      "proximal operator. Experimental results on toy, DNA microarray and BCI problems\n",
+      "show how our novel algorithm compares favorably to competitors for ranking on\n",
+      "top while using fewer variables in the scoring function.\n",
+      "\n",
+      "Title The Sparse Vector Technique, Revisited \n",
+      "  Abstract:   We revisit one of the most basic and widely applicable techniques in the\n",
+      "literature of differential privacy - the sparse vector technique [Dwork et al.,\n",
+      "STOC 2009]. This simple algorithm privately tests whether the value of a given\n",
+      "query on a database is close to what we expect it to be. It allows to ask an\n",
+      "unbounded number of queries as long as the answer is close to what we expect,\n",
+      "and halts following the first query for which this is not the case.\n",
+      "  We suggest an alternative, equally simple, algorithm that can continue\n",
+      "testing queries as long as any single individual does not contribute to the\n",
+      "answer of too many queries whose answer deviates substantially form what we\n",
+      "expect. Our analysis is subtle and some of its ingredients may be more widely\n",
+      "applicable. In some cases our new algorithm allows to privately extract much\n",
+      "more information from the database than the original.\n",
+      "  We demonstrate this by applying our algorithm to the shifting heavy-hitters\n",
+      "problem: On every time step, each of $n$ users gets a new input, and the task\n",
+      "is to privately identify all the current heavy-hitters. That is, on time step\n",
+      "$i$, the goal is to identify all data elements $x$ such that many of the users\n",
+      "have $x$ as their current input. We present an algorithm for this problem with\n",
+      "improved error guarantees over what can be obtained using existing techniques.\n",
+      "Specifically, the error of our algorithm depends on the maximal number of times\n",
+      "that a single user holds a heavy-hitter as input, rather than the total number\n",
+      "of times in which a heavy-hitter exists.\n",
+      "\n",
+      "Title L-Vector: Neural Label Embedding for Domain Adaptation \n",
+      "  Abstract:   We propose a novel neural label embedding (NLE) scheme for the domain\n",
+      "adaptation of a deep neural network (DNN) acoustic model with unpaired data\n",
+      "samples from source and target domains. With NLE method, we distill the\n",
+      "knowledge from a powerful source-domain DNN into a dictionary of label\n",
+      "embeddings, or l-vectors, one for each senone class. Each l-vector is a\n",
+      "representation of the senone-specific output distributions of the source-domain\n",
+      "DNN and is learned to minimize the average L2, Kullback-Leibler (KL) or\n",
+      "symmetric KL distance to the output vectors with the same label through simple\n",
+      "averaging or standard back-propagation. During adaptation, the l-vectors serve\n",
+      "as the soft targets to train the target-domain model with cross-entropy loss.\n",
+      "Without parallel data constraint as in the teacher-student learning, NLE is\n",
+      "specially suited for the situation where the paired target-domain data cannot\n",
+      "be simulated from the source-domain data. We adapt a 6400 hours\n",
+      "multi-conditional US English acoustic model to each of the 9 accented English\n",
+      "(80 to 830 hours) and kids' speech (80 hours). NLE achieves up to 14.1%\n",
+      "relative word error rate reduction over direct re-training with one-hot labels.\n",
+      "\n",
+      "Title Spaceland Embedding of Sparse Stochastic Graphs \n",
+      "  Abstract:   We introduce a nonlinear method for directly embedding large, sparse,\n",
+      "stochastic graphs into low-dimensional spaces, without requiring vertex\n",
+      "features to reside in, or be transformed into, a metric space. Graph data and\n",
+      "models are prevalent in real-world applications. Direct graph embedding is\n",
+      "fundamental to many graph analysis tasks, in addition to graph visualization.\n",
+      "We name the novel approach SG-t-SNE, as it is inspired by and builds upon the\n",
+      "core principle of t-SNE, a widely used method for nonlinear dimensionality\n",
+      "reduction and data visualization. We also introduce t-SNE-$\\Pi$, a\n",
+      "high-performance software for 2D, 3D embedding of large sparse graphs on\n",
+      "personal computers with superior efficiency. It empowers SG-t-SNE with modern\n",
+      "computing techniques for exploiting in tandem both matrix structures and memory\n",
+      "architectures. We present elucidating embedding results on one synthetic graph\n",
+      "and four real-world networks.\n",
+      "\n",
+      "Title Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n",
+      "  Correlation \n",
+      "  Abstract:   This work discusses the problem of sparse signal recovery when there is\n",
+      "correlation among the values of non-zero entries. We examine intra-vector\n",
+      "correlation in the context of the block sparse model and inter-vector\n",
+      "correlation in the context of the multiple measurement vector model, as well as\n",
+      "their combination. Algorithms based on the sparse Bayesian learning are\n",
+      "presented and the benefits of incorporating correlation at the algorithm level\n",
+      "are discussed. The impact of correlation on the limits of support recovery is\n",
+      "also discussed highlighting the different impact intra-vector and inter-vector\n",
+      "correlations have on such limits.\n",
+      "\n",
+      "Title Stable Sparse Subspace Embedding for Dimensionality Reduction \n",
+      "  Abstract:   Sparse random projection (RP) is a popular tool for dimensionality reduction\n",
+      "that shows promising performance with low computational complexity. However, in\n",
+      "the existing sparse RP matrices, the positions of non-zero entries are usually\n",
+      "randomly selected. Although they adopt uniform sampling with replacement, due\n",
+      "to large sampling variance, the number of non-zeros is uneven among rows of the\n",
+      "projection matrix which is generated in one trial, and more data information\n",
+      "may be lost after dimension reduction. To break this bottleneck, based on\n",
+      "random sampling without replacement in statistics, this paper builds a stable\n",
+      "sparse subspace embedded matrix (S-SSE), in which non-zeros are uniformly\n",
+      "distributed. It is proved that the S-SSE is stabler than the existing matrix,\n",
+      "and it can maintain Euclidean distance between points well after dimension\n",
+      "reduction. Our empirical studies corroborate our theoretical findings and\n",
+      "demonstrate that our approach can indeed achieve satisfactory performance.\n",
+      "\n",
+      "Title Auto-weighted Mutli-view Sparse Reconstructive Embedding \n",
+      "  Abstract:   With the development of multimedia era, multi-view data is generated in\n",
+      "various fields. Contrast with those single-view data, multi-view data brings\n",
+      "more useful information and should be carefully excavated. Therefore, it is\n",
+      "essential to fully exploit the complementary information embedded in multiple\n",
+      "views to enhance the performances of many tasks. Especially for those\n",
+      "high-dimensional data, how to develop a multi-view dimension reduction\n",
+      "algorithm to obtain the low-dimensional representations is of vital importance\n",
+      "but chanllenging. In this paper, we propose a novel multi-view dimensional\n",
+      "reduction algorithm named Auto-weighted Mutli-view Sparse Reconstructive\n",
+      "Embedding (AMSRE) to deal with this problem. AMSRE fully exploits the sparse\n",
+      "reconstructive correlations between features from multiple views. Furthermore,\n",
+      "it is equipped with an auto-weighted technique to treat multiple views\n",
+      "discriminatively according to their contributions. Various experiments have\n",
+      "verified the excellent performances of the proposed AMSRE.\n",
+      "\n",
+      "Title Embedding Words in Non-Vector Space with Unsupervised Graph Learning \n",
+      "  Abstract:   It has become a de-facto standard to represent words as elements of a vector\n",
+      "space (word2vec, GloVe). While this approach is convenient, it is unnatural for\n",
+      "language: words form a graph with a latent hierarchical structure, and this\n",
+      "structure has to be revealed and encoded by word embeddings. We introduce\n",
+      "GraphGlove: unsupervised graph word representations which are learned\n",
+      "end-to-end. In our setting, each word is a node in a weighted graph and the\n",
+      "distance between words is the shortest path distance between the corresponding\n",
+      "nodes. We adopt a recent method learning a representation of data in the form\n",
+      "of a differentiable weighted graph and use it to modify the GloVe training\n",
+      "algorithm. We show that our graph-based representations substantially\n",
+      "outperform vector-based methods on word similarity and analogy tasks. Our\n",
+      "analysis reveals that the structure of the learned graphs is hierarchical and\n",
+      "similar to that of WordNet, the geometry is highly non-trivial and contains\n",
+      "subgraphs with different local topology.\n",
+      "\n"
+     ]
+    }
+   ]
+  }
+ ]
+}
\ No newline at end of file