Add files via upload

aimclub · Jun 9, 2024 · d8af632 · d8af632
1 parent 308bd50
commit d8af632
Showing 1 changed file with 298 additions and 0 deletions.
diff --git a/examples/demo_autotm.ipynb b/examples/demo_autotm.ipynb
@@ -0,0 +1,298 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "fbf190e5-3040-452b-bc63-e0cafe2a711f",
+   "metadata": {},
+   "source": [
+    "# Topic modeling with AutoTM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e5bc8d9-83c8-4194-a9cb-cd89636572ae",
+   "metadata": {},
+   "source": [
+    "Topic Modeling is a powerful technique that unveils the hidden structure of textual corpora, transforming them into intuitive topics and their representations within texts. This approach significantly enhances the interpretability of complex datasets, making it a breeze to extract meaningful insights and comprehend vast amounts of information."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0238f02-ded3-4945-b1ec-4cfb72d5f9b6",
+   "metadata": {},
+   "source": [
+    "In this tutorial we will train topic modeling on the set of  imdb reviews to understand the main topics."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b4947fbb-3b3e-418c-a4eb-b0288d1d807e",
+   "metadata": {},
+   "source": [
+    "### Installation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ca1c7ec-5da5-44b4-b033-aaf05cac2e0f",
+   "metadata": {},
+   "source": [
+    "Pip version is currently available only for linux system. You should also download ```en_core_web_sm``` from ```spacy``` for correct dataset preprocessing. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e6ba89e-c8b6-48c0-9fed-5e69882055df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install autotm\n",
+    "! python -m spacy download en_core_web_sm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c9053b4-d848-46b9-9026-199e9b52f405",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from autotm.base import AutoTM\n",
+    "import pandas as pd\n",
+    "import logging"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aca59e21-1cac-4086-a03e-af170027b637",
+   "metadata": {},
+   "source": [
+    "Now let's load nesessary for English datasets nltk package"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "96f35d96-ca16-4399-90ce-28e547015d6f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nltk \n",
+    "nltk.download('averaged_perceptron_tagger')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8b4916f3-31d7-4cda-944f-3bfcbeae7b7b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "logging.basicConfig(\n",
+    "    format=\"%(asctime)s - %(name)s - %(levelname)s - %(message)s\",\n",
+    "    level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S'\n",
+    ")\n",
+    "logger = logging.getLogger()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5eb34b00-be20-4338-aa53-391603b10cac",
+   "metadata": {},
+   "source": [
+    "### Dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6d866e2d-3524-4423-b746-b569e766124f",
+   "metadata": {},
+   "source": [
+    "First of all let's download the dataset from Huggingface Datasets. If you do not have the Datasets library you should first install it with ```pip install --quiet datasets``` or you can load your own ```csv``` dataset. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa8663e5-fae9-44e4-bc92-d146ed0e8774",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(\"SetFit/20_newsgroups\")\n",
+    "pd_dataset = dataset['train'].to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a522277c-0c11-4504-a4ab-1719b2456ff3",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "pd_dataset.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02535255-8ead-4a13-8388-c712cc7e9b83",
+   "metadata": {},
+   "source": [
+    "In 20 Newsgroups dataset text is in \"text\" column and we will use it s a modeling target "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a72f311-816b-4fa4-88b5-0785a6f3bf93",
+   "metadata": {},
+   "source": [
+    "In AutoTM we have have the basic object ```AutoTM``` that can be used with default parameters or configured for your specific dataset.\n",
+    "- Basically user should set ```topic_count``` - the number of topics that should be obtained; column name that contain text to process ```texts_column_name``` and ```working_dir_path``` to store the results\n",
+    "- AutoTm implements dataset preprocessing procedure, so user only needs to define language (special preprocessing is implemented for 'en' and 'ru')\n",
+    "- User can also manipulate with ```alg_params``` and change algorithms from genetic to bayesian or select another way of quality calculation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ebbc7f46-7b1a-4d3b-bc2b-00e943adc3c6",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "autotm = AutoTM(\n",
+    "        topic_count=25,\n",
+    "        texts_column_name='text',\n",
+    "        preprocessing_params={\n",
+    "            \"lang\": \"en\", # available languages with special preprocessing options: \"ru\" and \"en\", if you have dataset in another language do not set this parameter\n",
+    "        },\n",
+    "        working_dir_path='tmp',\n",
+    "        alg_params={\n",
+    "            \"num_iterations\": 50, # setting iteration number to 30 or you can use default parameter\n",
+    "            \"use_pipeline\": True, # the latest default version of GA-based algorithm (default version), set it to False if you want to use the previous version\n",
+    "            # \"individual_type\": \"llm\", # if you want to use llm as a quality measure \n",
+    "            # \"surrogate_name\": \"random-forest-regressor\" # enable surrogate modeling to speed up computation\n",
+    "        },\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b6734e5-5afd-420b-924b-6b6879c80c40",
+   "metadata": {},
+   "source": [
+    "If you worked with ```sklearn``` library than in ```AutoTM``` you should also be comfortable with ```fit```, ```predict``` and their combined version ```fit_predict```. As a reault of ```fit``` you will get a fitted ```autotm``` model that is tuned to your data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6176c9d6-6e0d-4556-b60e-73332e41e5cf",
+   "metadata": {},
+   "source": [
+    "Let's process the dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cc718672-14a8-4292-b7b3-9845d8a168fb",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "mixtures = autotm.fit_predict(pd_dataset.sample(1000)) # we will do the modeling on 1000 random examples from ACL-23 dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e57e1449-cf14-47ca-9610-7675fdc13b6d",
+   "metadata": {},
+   "source": [
+    "Now we are going to look at resulting topics. We defined 25 topics, so they can be accessed by \"mainN\" key"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "23707b8c-9a41-49ca-9bc7-41676ccfddd1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(autotm.topics['main11'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21645619-16a5-41de-87b9-e5966783f914",
+   "metadata": {},
+   "source": [
+    "If user wants to save the resulting model "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2210c27-b8c7-4a99-889e-8128fd49632d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "autotm.save('model_artm')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68a73054-7e7c-4607-8376-bf642bf223d9",
+   "metadata": {},
+   "source": [
+    "Trained model structure:\n",
+    "```\n",
+    "|model_artm\n",
+    "| -- artm_model\n",
+    "| -- | -- n_wt.bin\n",
+    "| -- | -- p_wt.bin\n",
+    "| -- | -- parameters.bin\n",
+    "| -- | -- parameters.json\n",
+    "| -- | -- scre_tracker.bin\n",
+    "| -- autotm_data\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "995475fd-a397-4e9d-aa81-edaf1f246ad0",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "autotm-env",
+   "language": "python",
+   "name": "autotm-env"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}