From 213d5af0fe919277087fe7e80793eb35ad88fa82 Mon Sep 17 00:00:00 2001 From: Jeff Vestal <53237856+jeffvestal@users.noreply.github.com> Date: Fri, 1 Nov 2024 11:30:36 -0500 Subject: [PATCH] addding notebook for re-ranking-elasticsearch-hosted blog (#346) * addding notebook for re-ranking-elasticsearch-hosted blog * adding cloud trial signup link * Update supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb fix typo Co-authored-by: Carly Richmond <74931905+carlyrichmond@users.noreply.github.com> * fixed index name * typo --------- Co-authored-by: Carly Richmond <74931905+carlyrichmond@users.noreply.github.com> --- .../re-ranking-elasticsearch-hosted.ipynb | 819 ++++++++++++++++++ 1 file changed, 819 insertions(+) create mode 100644 supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb diff --git a/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb b/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb new file mode 100644 index 00000000..281f8a47 --- /dev/null +++ b/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb @@ -0,0 +1,819 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "*Reranking with a locally hosted reranker model from HuggingFace*" + ], + "metadata": { + "id": "bW9q8qD_bPhY" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Setup the notebook" + ], + "metadata": { + "id": "BecBOzyDbWik" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Install required libs" + ], + "metadata": { + "id": "6ayhDP72bZAe" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2Xz9uWQFbNkH" + }, + "outputs": [], + "source": [ + "!pip install -qqU elasticsearch\n", + "!pip install -qqU eland[pytorch]\n", + "!pip install -qqU datasets" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Import the required python libraries" + ], + "metadata": { + "id": "LgHQaJh0bmJQ" + } + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "from elasticsearch import Elasticsearch, helpers, exceptions\n", + "from urllib.request import urlopen\n", + "from getpass import getpass\n", + "import json\n", + "import time\n", + "from datasets import load_dataset\n", + "import pandas as pd" + ], + "metadata": { + "id": "CsL466H0bjNX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Create an Elasticsearch Python client\n", + "\n" + ], + "metadata": { + "id": "gsQ4XIpkbpd4" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Free Trial\n", + "If you don't have an Elasticsearch cluster, or what one to test out. Head over to [cloud.elastic.co](https://cloud.elastic.co/registration?onboarding_token=search&cta=cloud-registration&tech=trial&plcmt=article%20content&pg=search-labs) and sign up. You can sign up and have a serverless project up and running in only a few mintues!" + ], + "metadata": { + "id": "M_JppGte1Wc5" + } + }, + { + "cell_type": "markdown", + "source": [ + "We are using an Elastic Cloud cloud_id and deployment (cluster) API key.\n", + "\n", + "[See this guide](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud) for finding the `cloud_id` and creating an `api_key`" + ], + "metadata": { + "id": "gdsEjwhV1tr5" + } + }, + { + "cell_type": "code", + "source": [ + "cloud_id = getpass(prompt=\"Enter your Elasticsearch Cloud ID: \")\n", + "api_key = getpass(prompt=\"Enter your Elasticsearch API key: \")\n", + "\n", + "\n", + "es = Elasticsearch(cloud_id=cloud_id, api_key=api_key)\n", + "\n", + "try:\n", + " es.info()\n", + " print(\"Successfully connected to Elasticsearch!\")\n", + "except exceptions.ConnectionError as e:\n", + " print(f\"Error connecting to Elasticsearch: {e}\")" + ], + "metadata": { + "id": "UY5WCB0HUVTb" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Ready Elasticsearch" + ], + "metadata": { + "id": "jQdzhNbB_e3n" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Hugging Face Reranking Model\n", + "Run this cell to:\n", + "- Use Eland's `eland_import_hub_model` command to upload the reranking model to Elasticsearch.\n", + "\n", + "For this example we've chosen the [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) text similarity model.\n", + "

\n", + "**Note**:\n", + "While we are importing the model for use as a reranker, Eland and Elasticsearch do not have a dedicated rerank task type, so we still use `text_similarity`" + ], + "metadata": { + "id": "5bsLLnqCfNKk" + } + }, + { + "cell_type": "code", + "source": [ + "model_id = \"cross-encoder/ms-marco-MiniLM-L-6-v2\"\n", + "\n", + "!eland_import_hub_model \\\n", + " --cloud-id $cloud_id \\\n", + " --es-api-key $api_key \\\n", + " --hub-model-id $model_id \\\n", + " --task-type text_similarity" + ], + "metadata": { + "id": "J2MTEYrUfk9R" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Create Inference Endpoint\n", + "Run this cell to:\n", + "- Create an inference Endpoint\n", + "- Deploy the reranking model we impoted in the previous section\n", + "We need to create an endpoint queries can use for reranking\n", + "\n", + "Key points about the `model_config`\n", + "- `service` - in this case `elasticsearch` will tell the inference API to use a locally hosted (in Elasticsearch) model\n", + "- `num_allocations` sets the number of allocations to 1\n", + " - Allocations are independent units of work for NLP tasks. Scaling this allows for an increase in concurrent throughput\n", + "- `num_threads` - sets the number of threads per allocation to 1\n", + " - Threads per allocation affect the number of threads used by each allocation during inference. Scaling this generally increased the speed of inference requests (to a point).\n", + "- `model_id` - This is the id of the model as it is named in Elasticsearch\n", + "\n" + ], + "metadata": { + "id": "-rrQV6SAgWz8" + } + }, + { + "cell_type": "code", + "source": [ + "model_config = {\n", + " \"service\": \"elasticsearch\",\n", + " \"service_settings\": {\n", + " \"num_allocations\": 1,\n", + " \"num_threads\": 1,\n", + " \"model_id\": \"cross-encoder__ms-marco-minilm-l-6-v2\",\n", + " },\n", + " \"task_settings\": {\"return_documents\": True},\n", + "}\n", + "\n", + "inference_id = \"semantic-reranking\"\n", + "\n", + "create_endpoint = es.inference.put(\n", + " inference_id=inference_id, task_type=\"rerank\", body=model_config\n", + ")\n", + "\n", + "create_endpoint.body" + ], + "metadata": { + "id": "Abu084BYgWCE", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d58cc940-281e-4d56-d6e6-040e4881e78a" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'inference_id': 'semantic-reranking',\n", + " 'task_type': 'rerank',\n", + " 'service': 'elasticsearch',\n", + " 'service_settings': {'num_allocations': 1,\n", + " 'num_threads': 1,\n", + " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", + " 'task_settings': {'return_documents': True}}" + ] + }, + "metadata": {}, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Verify it was created\n", + "\n", + "- Run the two cells in this section to verify:\n", + "- The Inference Endpoint has been completed\n", + "- The model has been deployed\n", + "\n", + "You should see JSON output with information about the semantic endpoint" + ], + "metadata": { + "id": "X8rQXMrHhMkS" + } + }, + { + "cell_type": "code", + "source": [ + "check_endpoint = es.inference.get(\n", + " inference_id=inference_id,\n", + ")\n", + "\n", + "check_endpoint.body" + ], + "metadata": { + "id": "n3Yk7rgYhP-N", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d9e68225-5796-411e-964a-6db3be5541aa" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'endpoints': [{'inference_id': 'semantic-reranking',\n", + " 'task_type': 'rerank',\n", + " 'service': 'elasticsearch',\n", + " 'service_settings': {'num_allocations': 1,\n", + " 'num_threads': 1,\n", + " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", + " 'task_settings': {'return_documents': True}}]}" + ] + }, + "metadata": {}, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Create the index mapping\n", + "\n", + "We are going to index the `title` and `abstract` from the dataset. " + ], + "metadata": { + "id": "4vqimyNWAhWb" + } + }, + { + "cell_type": "code", + "source": [ + "index_name = \"arxiv-papers\"\n", + "\n", + "index_mapping = {\n", + " \"mappings\": {\n", + " \"properties\": {\"title\": {\"type\": \"text\"}, \"abstract\": {\"type\": \"text\"}}\n", + " }\n", + "}\n", + "\n", + "\n", + "try:\n", + " es.indices.create(index=index_name, body=index_mapping)\n", + " print(f\"Index '{index_name}' created successfully.\")\n", + "except exceptions.RequestError as e:\n", + " if e.error == \"resource_already_exists_exception\":\n", + " print(f\"Index '{index_name}' already exists.\")\n", + " else:\n", + " print(f\"Error creating index '{index_name}': {e}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "4bc9ba0b-9be3-410a-d1d6-f1da04bbfec7", + "id": "DPADF_7ytTmR" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Index 'arxiv-papers' created successfully.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Ready the dataset\n", + "We are going to use the [CShorten/ML-ArXiv-Papers](https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers) dataset." + ], + "metadata": { + "id": "FqQmaT5P-Nhx" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Download Dataset\n", + "**Note** You may get a warning *The secret `HF_TOKEN` does not exist in your Colab secrets*.\n", + "\n", + "You can safely ignore this." + ], + "metadata": { + "id": "aN0dbYO7oB47" + } + }, + { + "cell_type": "code", + "source": [ + "dataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IVnpj5bBoEBL", + "outputId": "bc6371d9-d66f-482c-95f8-8cdb89e4f0ef" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", + "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", + "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", + "You will be able to reuse this secret in all of your notebooks.\n", + "Please note that authentication is recommended but still optional to access public models or datasets.\n", + " warnings.warn(\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Index into Elasticsearch\n", + "\n", + "We will loop through the dataset and send batches of rows to Elasticsearch\n", + "- This may take a couple minutes depending on your cluster sizing." + ], + "metadata": { + "id": "GQxDITCpAKWb" + } + }, + { + "cell_type": "code", + "source": [ + "def bulk_insert_elasticsearch(dataset, index_name, chunk_size=1000):\n", + " actions = []\n", + " for record in dataset:\n", + " action = {\n", + " \"_index\": index_name,\n", + " \"_source\": {\"title\": record[\"title\"], \"abstract\": record[\"abstract\"]},\n", + " }\n", + " actions.append(action)\n", + "\n", + " if len(actions) == chunk_size:\n", + " helpers.bulk(es, actions)\n", + " actions = []\n", + "\n", + " if actions:\n", + " helpers.bulk(es, actions)\n", + "\n", + "\n", + "bulk_insert_elasticsearch(dataset[\"train\"], index_name)" + ], + "metadata": { + "id": "tDZ0qEbW-ozW" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Query with Reranking\n", + "\n", + "This containes a `text_similarity_reranker` retriever which:\n", + "1. Uses a Standard Retriever to :\n", + " 1. Perform a lexical query against `title field\n", + "2. Perform a reranking:\n", + " 1. Taks as input the top 100 results from the previous search\n", + " - `\"rank_window_size\": 100`\n", + " 2. Taks as input the query\n", + " - `\"inference_text\": query`\n", + " 3. Uses our previously created reranking API and model\n" + ], + "metadata": { + "id": "2bwvzLfRjJ2n" + } + }, + { + "cell_type": "code", + "source": [ + "query = \"sparse vector embedding\"\n", + "\n", + "# Query scored from score\n", + "response_scored = es.search(\n", + " index=\"arxiv-papers\",\n", + " body={\n", + " \"size\": 10,\n", + " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", + " \"fields\": [\"title\", \"abstract\"],\n", + " \"_source\": False,\n", + " },\n", + ")\n", + "\n", + "# Query with Semantic Reranker\n", + "response_reranked = es.search(\n", + " index=\"arxiv-papers\",\n", + " body={\n", + " \"size\": 10,\n", + " \"retriever\": {\n", + " \"text_similarity_reranker\": {\n", + " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", + " \"field\": \"abstract\",\n", + " \"inference_id\": \"semantic-reranking\",\n", + " \"inference_text\": query,\n", + " \"rank_window_size\": 100,\n", + " }\n", + " },\n", + " \"fields\": [\"title\", \"abstract\"],\n", + " \"_source\": False,\n", + " },\n", + ")" + ], + "metadata": { + "id": "HWXQBS35jQ3n" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Print the table comparing the scored and reranked results" + ], + "metadata": { + "id": "Hnam80Irbj6a" + } + }, + { + "cell_type": "code", + "source": [ + "titles_scored = [\n", + " paper[\"fields\"][\"title\"][0] for paper in response_scored.body[\"hits\"][\"hits\"]\n", + "]\n", + "titles_reranked = [\n", + " paper[\"fields\"][\"title\"][0] for paper in response_reranked.body[\"hits\"][\"hits\"]\n", + "]\n", + "\n", + "# Creating a DataFrame\n", + "df = pd.DataFrame(\n", + " {\"Scored Results\": titles_scored, \"Reranked Results\": titles_reranked}\n", + ")\n", + "\n", + "df_styled = df.style.set_properties(**{\"text-align\": \"left\"}).set_caption(\n", + " f\"Comparison of Scored and Semantic Reranked Results - Query: '{query}'\"\n", + ")\n", + "\n", + "# Display the table\n", + "df_styled" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 415 + }, + "id": "yTTNYCYcBtll", + "outputId": "0f0af538-13fd-4e62-e3d2-1dd185689904" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Comparison of Scored and Semantic Reranked Results - Query: 'sparse vector embedding'
 Scored ResultsReranked Results
0Compact Speaker Embedding: lrx-vectorScaling Up Sparse Support Vector Machines by Simultaneous Feature and\n", + " Sample Reduction
1Quantum Sparse Support Vector MachinesSpaceland Embedding of Sparse Stochastic Graphs
2Sparse Support Vector Infinite PushElliptical Ordinal Embedding
3The Sparse Vector Technique, RevisitedMinimum-Distortion Embedding
4L-Vector: Neural Label Embedding for Domain AdaptationFree Gap Information from the Differentially Private Sparse Vector and\n", + " Noisy Max Mechanisms
5Spaceland Embedding of Sparse Stochastic GraphsInterpolated Discretized Embedding of Single Vectors and Vector Pairs\n", + " for Classification, Metric Learning and Distance Approximation
6Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", + " CorrelationAttention Word Embedding
7Stable Sparse Subspace Embedding for Dimensionality ReductionBinary Speaker Embedding
8Auto-weighted Mutli-view Sparse Reconstructive EmbeddingNetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
9Embedding Words in Non-Vector Space with Unsupervised Graph LearningEstimating Vector Fields on Manifolds and the Embedding of Directed\n", + " Graphs
\n" + ] + }, + "metadata": {}, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Print out Title and Abstract\n", + "This will print the title and the abstract for the top 10 results after semantic reranking." + ], + "metadata": { + "id": "A0HyNZoWyeun" + } + }, + { + "cell_type": "code", + "source": [ + "for paper in response_scored.body[\"hits\"][\"hits\"]:\n", + " print(\n", + " f\"Title {paper['fields']['title'][0]} \\n Abstract: {paper['fields']['abstract'][0]}\"\n", + " )" + ], + "metadata": { + "id": "4ZEx-46rn3in", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "e83318ad-ca42-4aa7-98d4-37c4428eb70a" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Title Compact Speaker Embedding: lrx-vector \n", + " Abstract: Deep neural networks (DNN) have recently been widely used in speaker\n", + "recognition systems, achieving state-of-the-art performance on various\n", + "benchmarks. The x-vector architecture is especially popular in this research\n", + "community, due to its excellent performance and manageable computational\n", + "complexity. In this paper, we present the lrx-vector system, which is the\n", + "low-rank factorized version of the x-vector embedding network. The primary\n", + "objective of this topology is to further reduce the memory requirement of the\n", + "speaker recognition system. We discuss the deployment of knowledge distillation\n", + "for training the lrx-vector system and compare against low-rank factorization\n", + "with SVD. On the VOiCES 2019 far-field corpus we were able to reduce the\n", + "weights by 28% compared to the full-rank x-vector system while keeping the\n", + "recognition rate constant (1.83% EER).\n", + "\n", + "Title Quantum Sparse Support Vector Machines \n", + " Abstract: We analyze the computational complexity of Quantum Sparse Support Vector\n", + "Machine, a linear classifier that minimizes the hinge loss and the $L_1$ norm\n", + "of the feature weights vector and relies on a quantum linear programming solver\n", + "instead of a classical solver. Sparse SVM leads to sparse models that use only\n", + "a small fraction of the input features in making decisions, and is especially\n", + "useful when the total number of features, $p$, approaches or exceeds the number\n", + "of training samples, $m$. We prove a $\\Omega(m)$ worst-case lower bound for\n", + "computational complexity of any quantum training algorithm relying on black-box\n", + "access to training samples; quantum sparse SVM has at least linear worst-case\n", + "complexity. However, we prove that there are realistic scenarios in which a\n", + "sparse linear classifier is expected to have high accuracy, and can be trained\n", + "in sublinear time in terms of both the number of training samples and the\n", + "number of features.\n", + "\n", + "Title Sparse Support Vector Infinite Push \n", + " Abstract: In this paper, we address the problem of embedded feature selection for\n", + "ranking on top of the list problems. We pose this problem as a regularized\n", + "empirical risk minimization with $p$-norm push loss function ($p=\\infty$) and\n", + "sparsity inducing regularizers. We leverage the issues related to this\n", + "challenging optimization problem by considering an alternating direction method\n", + "of multipliers algorithm which is built upon proximal operators of the loss\n", + "function and the regularizer. Our main technical contribution is thus to\n", + "provide a numerical scheme for computing the infinite push loss function\n", + "proximal operator. Experimental results on toy, DNA microarray and BCI problems\n", + "show how our novel algorithm compares favorably to competitors for ranking on\n", + "top while using fewer variables in the scoring function.\n", + "\n", + "Title The Sparse Vector Technique, Revisited \n", + " Abstract: We revisit one of the most basic and widely applicable techniques in the\n", + "literature of differential privacy - the sparse vector technique [Dwork et al.,\n", + "STOC 2009]. This simple algorithm privately tests whether the value of a given\n", + "query on a database is close to what we expect it to be. It allows to ask an\n", + "unbounded number of queries as long as the answer is close to what we expect,\n", + "and halts following the first query for which this is not the case.\n", + " We suggest an alternative, equally simple, algorithm that can continue\n", + "testing queries as long as any single individual does not contribute to the\n", + "answer of too many queries whose answer deviates substantially form what we\n", + "expect. Our analysis is subtle and some of its ingredients may be more widely\n", + "applicable. In some cases our new algorithm allows to privately extract much\n", + "more information from the database than the original.\n", + " We demonstrate this by applying our algorithm to the shifting heavy-hitters\n", + "problem: On every time step, each of $n$ users gets a new input, and the task\n", + "is to privately identify all the current heavy-hitters. That is, on time step\n", + "$i$, the goal is to identify all data elements $x$ such that many of the users\n", + "have $x$ as their current input. We present an algorithm for this problem with\n", + "improved error guarantees over what can be obtained using existing techniques.\n", + "Specifically, the error of our algorithm depends on the maximal number of times\n", + "that a single user holds a heavy-hitter as input, rather than the total number\n", + "of times in which a heavy-hitter exists.\n", + "\n", + "Title L-Vector: Neural Label Embedding for Domain Adaptation \n", + " Abstract: We propose a novel neural label embedding (NLE) scheme for the domain\n", + "adaptation of a deep neural network (DNN) acoustic model with unpaired data\n", + "samples from source and target domains. With NLE method, we distill the\n", + "knowledge from a powerful source-domain DNN into a dictionary of label\n", + "embeddings, or l-vectors, one for each senone class. Each l-vector is a\n", + "representation of the senone-specific output distributions of the source-domain\n", + "DNN and is learned to minimize the average L2, Kullback-Leibler (KL) or\n", + "symmetric KL distance to the output vectors with the same label through simple\n", + "averaging or standard back-propagation. During adaptation, the l-vectors serve\n", + "as the soft targets to train the target-domain model with cross-entropy loss.\n", + "Without parallel data constraint as in the teacher-student learning, NLE is\n", + "specially suited for the situation where the paired target-domain data cannot\n", + "be simulated from the source-domain data. We adapt a 6400 hours\n", + "multi-conditional US English acoustic model to each of the 9 accented English\n", + "(80 to 830 hours) and kids' speech (80 hours). NLE achieves up to 14.1%\n", + "relative word error rate reduction over direct re-training with one-hot labels.\n", + "\n", + "Title Spaceland Embedding of Sparse Stochastic Graphs \n", + " Abstract: We introduce a nonlinear method for directly embedding large, sparse,\n", + "stochastic graphs into low-dimensional spaces, without requiring vertex\n", + "features to reside in, or be transformed into, a metric space. Graph data and\n", + "models are prevalent in real-world applications. Direct graph embedding is\n", + "fundamental to many graph analysis tasks, in addition to graph visualization.\n", + "We name the novel approach SG-t-SNE, as it is inspired by and builds upon the\n", + "core principle of t-SNE, a widely used method for nonlinear dimensionality\n", + "reduction and data visualization. We also introduce t-SNE-$\\Pi$, a\n", + "high-performance software for 2D, 3D embedding of large sparse graphs on\n", + "personal computers with superior efficiency. It empowers SG-t-SNE with modern\n", + "computing techniques for exploiting in tandem both matrix structures and memory\n", + "architectures. We present elucidating embedding results on one synthetic graph\n", + "and four real-world networks.\n", + "\n", + "Title Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", + " Correlation \n", + " Abstract: This work discusses the problem of sparse signal recovery when there is\n", + "correlation among the values of non-zero entries. We examine intra-vector\n", + "correlation in the context of the block sparse model and inter-vector\n", + "correlation in the context of the multiple measurement vector model, as well as\n", + "their combination. Algorithms based on the sparse Bayesian learning are\n", + "presented and the benefits of incorporating correlation at the algorithm level\n", + "are discussed. The impact of correlation on the limits of support recovery is\n", + "also discussed highlighting the different impact intra-vector and inter-vector\n", + "correlations have on such limits.\n", + "\n", + "Title Stable Sparse Subspace Embedding for Dimensionality Reduction \n", + " Abstract: Sparse random projection (RP) is a popular tool for dimensionality reduction\n", + "that shows promising performance with low computational complexity. However, in\n", + "the existing sparse RP matrices, the positions of non-zero entries are usually\n", + "randomly selected. Although they adopt uniform sampling with replacement, due\n", + "to large sampling variance, the number of non-zeros is uneven among rows of the\n", + "projection matrix which is generated in one trial, and more data information\n", + "may be lost after dimension reduction. To break this bottleneck, based on\n", + "random sampling without replacement in statistics, this paper builds a stable\n", + "sparse subspace embedded matrix (S-SSE), in which non-zeros are uniformly\n", + "distributed. It is proved that the S-SSE is stabler than the existing matrix,\n", + "and it can maintain Euclidean distance between points well after dimension\n", + "reduction. Our empirical studies corroborate our theoretical findings and\n", + "demonstrate that our approach can indeed achieve satisfactory performance.\n", + "\n", + "Title Auto-weighted Mutli-view Sparse Reconstructive Embedding \n", + " Abstract: With the development of multimedia era, multi-view data is generated in\n", + "various fields. Contrast with those single-view data, multi-view data brings\n", + "more useful information and should be carefully excavated. Therefore, it is\n", + "essential to fully exploit the complementary information embedded in multiple\n", + "views to enhance the performances of many tasks. Especially for those\n", + "high-dimensional data, how to develop a multi-view dimension reduction\n", + "algorithm to obtain the low-dimensional representations is of vital importance\n", + "but chanllenging. In this paper, we propose a novel multi-view dimensional\n", + "reduction algorithm named Auto-weighted Mutli-view Sparse Reconstructive\n", + "Embedding (AMSRE) to deal with this problem. AMSRE fully exploits the sparse\n", + "reconstructive correlations between features from multiple views. Furthermore,\n", + "it is equipped with an auto-weighted technique to treat multiple views\n", + "discriminatively according to their contributions. Various experiments have\n", + "verified the excellent performances of the proposed AMSRE.\n", + "\n", + "Title Embedding Words in Non-Vector Space with Unsupervised Graph Learning \n", + " Abstract: It has become a de-facto standard to represent words as elements of a vector\n", + "space (word2vec, GloVe). While this approach is convenient, it is unnatural for\n", + "language: words form a graph with a latent hierarchical structure, and this\n", + "structure has to be revealed and encoded by word embeddings. We introduce\n", + "GraphGlove: unsupervised graph word representations which are learned\n", + "end-to-end. In our setting, each word is a node in a weighted graph and the\n", + "distance between words is the shortest path distance between the corresponding\n", + "nodes. We adopt a recent method learning a representation of data in the form\n", + "of a differentiable weighted graph and use it to modify the GloVe training\n", + "algorithm. We show that our graph-based representations substantially\n", + "outperform vector-based methods on word similarity and analogy tasks. Our\n", + "analysis reveals that the structure of the learned graphs is hierarchical and\n", + "similar to that of WordNet, the geometry is highly non-trivial and contains\n", + "subgraphs with different local topology.\n", + "\n" + ] + } + ] + } + ] +} \ No newline at end of file